===== MXNet =====

MXNet is a deep learning library with support for Python, R, C++, Scala, Julia, Matlab and javascript.  It can run on cpus, gpus and well as multiple cluster nodes.  Please see the project website for detailed infromation:

[[https://mxnet.readthedocs.io/en/latest/]]


On Razor, MXNet is compiled with GPU support as a python package and installed in ''/share/apps/python/2.7.11/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/''.  It requires the following modules: gcc-4.9.1, python-2.7.11, opencv-2.4.13, mkl-16.0.1, and cuda-7.5.  

As an example, below we will run an MXNet training session for MNIST (Mixed National Institute of Standards) hand written digit image database.  Because of the precompiled GPU support, the example below requires a node with GPU(s) and cuda drivers installed.  GPU nodes are available in the following queues:

   * gpu8core
   * gpu16core
   * qcondo (with gpu property, i.e. ''qsub -q qcondo -l nodes=1:k80gpu'')


<code>
razor-l1:pwolinsk:$ qsub -I -q qcondo -l nodes=1:k80gpu
qsub: waiting for job 1758962.sched to start
qsub: job 1758962.sched ready
PBS_NODEFILE=/var/spool/torque/aux/1758962.sched PBS_NUM_NODES=1 PBS_PPN=neednodes=1:k80gpu PBS_PPA=24
Currently Loaded Modulefiles:
  1) os/el6
compute3133:pwolinsk:$ 
compute3133:pwolinsk:$ module load gcc/4.9.1 python/2.7.11 mkl/16.0.1 opencv/2.4.13 cuda/7.5
compute3133:pwolinsk:$ cp -r /share/apps/python/2.7.11/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/example/image-classification/ .
compute3133:pwolinsk:$ cd image-classification/
compute3133:pwolinsk:/image-classification$ python train_mnist.py --gpus 0

2016-09-02 12:39:07,821 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus='0', kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='mlp', num_epochs=10, num_examples=60000, save_model_prefix=None)
[12:39:08] src/io/iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(128,784)
[12:39:09] src/io/iter_mnist.cc:91: MNISTIter: load 10000 images, shuffle=1, shape=(128,784)
2016-09-02 12:39:09,292 Node[0] Start training with [gpu(0)]
2016-09-02 12:39:22,386 Node[0] Epoch[0] Batch [50]	Speed: 54211.10 samples/sec	Train-accuracy=0.686719
2016-09-02 12:39:22,386 Node[0] Epoch[0] Batch [50]	Speed: 54211.10 samples/sec	Train-top_k_accuracy_5=0.936562
2016-09-02 12:39:22,386 Node[0] Epoch[0] Batch [50]	Speed: 54211.10 samples/sec	Train-top_k_accuracy_10=1.000000
...
2016-09-02 12:39:33,259 Node[0] Epoch[9] Batch [450]	Speed: 62463.31 samples/sec	Train-top_k_accuracy_10=1.000000
2016-09-02 12:39:33,259 Node[0] Epoch[9] Batch [450]	Speed: 62463.31 samples/sec	Train-top_k_accuracy_20=1.000000
2016-09-02 12:39:33,296 Node[0] Epoch[9] Resetting Data Iterator
2016-09-02 12:39:33,296 Node[0] Epoch[9] Time cost=0.962
2016-09-02 12:39:33,404 Node[0] Epoch[9] Validation-accuracy=0.973858
2016-09-02 12:39:33,404 Node[0] Epoch[9] Validation-top_k_accuracy_5=0.999399
2016-09-02 12:39:33,404 Node[0] Epoch[9] Validation-top_k_accuracy_10=1.000000
2016-09-02 12:39:33,404 Node[0] Epoch[9] Validation-top_k_accuracy_20=1.000000
compute3133:pwolinsk:/image-classification$ 
</code>

These commands launched a training session using a single GPU.  Nvidia driver installation provides a tool ''nvidia-smi'' which monitors the use of available gpus on the node. 

<code>
compute3133:pwolinsk:/image-classification$ nvidia-smi 
Fri Sep  2 12:41:34 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35                 Driver Version: 367.35                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:04:00.0     Off |                    0 |
| N/A   37C    P0    56W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   31C    P0    72W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   31C    P0    56W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:85:00.0     Off |                    0 |
| N/A   40C    P0    72W / 149W |      0MiB / 11439MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
</code>

The output above shows that there are four  Tesla K80 cards available on the system (in reality there are only 2).  To launch the training model with multiple GPU cards:  

<code>
compute3133:pwolinsk:/image-classification$ python train_mnist.py --gpus 0,1,2,3
</code>