==== Tensorflow ====

Tensorflow is an open source, deep learning software library for numerical computation using data flow graphs.  Detailed information about the software is available on the project website:

https://www.tensorflow.org/

The library is available as a python package.  The cpu version is installed for **python/2.7.5** on both clusters and requires 3 additional dependencies **gcc/4.9.1 mkl/16.0.1 java/sunjdk_1.8.0** . The gpu version is installed for **python 2.7.11** on razor and requires **gcc/4.9.1 mkl/16.0.1 java/sunjdk/1.8.0 cuda/8.0** as well as **python/2.7.11**.

<code>
tres0118:pwolinsk:$ module load gcc/4.9.1 python/2.7.5 mkl/16.0.1 java/sunjdk_1.8.0
tres0118:pwolinsk:$ python
Python 2.7.5 (default, Jul 10 2014, 16:10:08) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>> 
</code>

The tensorflow package is installed on Razor in ''/share/apps/opt/rh/python27/root/usr/lib/python2.7/site-packages/tensorflow''.  The installation contains a few example models: ''image/alexnet image/cifar10 image/imagenet image/mnist embedding''.

We will use the image/mnist training model to run a training session.      

<code>
tres0118:pwolinsk:$ python -m tensorflow.models.image.mnist.convolutional
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
Initialized!
Step 0 (epoch 0.00), 5.1 ms
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
Validation error: 84.6%
Step 100 (epoch 0.12), 203.7 ms
Minibatch loss: 3.282, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 7.1%
...
</code>

The ''-m'' option instructs python to search the PYTHON path for a specified program name.  You could also specify the full path to the convolutional.py script.

<code>
python /share/apps/opt/rh/python27/root/usr/lib/python2.7/site-packages/tensorflow/models/image/mnist/convolutional.py
</code>

The **cifar10** tensorflow example has been tested with cpu and gpu.  To get the example set:
<code>
git clone https://github.com/tensorflow/models
cd models/tutorials/image/cifar10
vi cifar10.py
module load gcc/5.2.1 mkl/16.0.1 python/2.7.11  java/sunjdk_1.8.0 cuda/8.0
</code>
and edit cifar10.py to change ''/tmp/'' to an appropriate scratch directory such as ''/local_scratch/rfeynman/''.  We use CUDA_VISIBLE_DEVICES to simulate 0,1,4 devices. Scaling performance from CPU to 1 GPU to 4 GPU (two twin K80) is very modest.  There is essentially no difference between 1 gpu and 4.
<code>
models/tutorials/image/cifar10$ export CUDA_VISIBLE_DEVICES=""
models/tutorials/image/cifar10$ python cifar10_multi_gpu_train.py
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
2017-03-17 14:13:46.533942: step 0, loss = 4.68 (29.7 examples/sec; 4.305 sec/batch)
2017-03-17 14:13:49.055265: step 10, loss = 4.66 (781.8 examples/sec; 0.164 sec/batch)
2017-03-17 14:13:50.697406: step 20, loss = 4.63 (771.2 examples/sec; 0.166 sec/batch)
2017-03-17 14:13:52.340482: step 30, loss = 4.60 (771.3 examples/sec; 0.166 sec/batch)

models/tutorials/image/cifar10$ export CUDA_VISIBLE_DEVICES="0"
models/tutorials/image/cifar10$ python cifar10_multi_gpu_train.py
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.

2017-03-21 15:41:42.767510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:04:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-03-21 15:41:42.767552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 
2017-03-21 15:41:42.767559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y 
2017-03-21 15:41:42.767571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0)
2017-03-21 15:42:06.767827: step 0, loss = 4.68 (37.1 examples/sec; 3.448 sec/batch)
2017-03-21 15:42:07.701603: step 10, loss = 4.67 (1370.8 examples/sec; 0.093 sec/batch)
2017-03-21 15:42:08.750127: step 20, loss = 4.60 (1220.8 examples/sec; 0.105 sec/batch)
2017-03-21 15:42:09.762612: step 30, loss = 4.61 (1264.2 examples/sec; 0.101 sec/batch)
2017-03-21 15:42:10.769818: step 40, loss = 4.58 (1270.8 examples/sec; 0.101 sec/batch)
2017-03-21 15:42:11.768493: step 50, loss = 4.53 (1281.7 examples/sec; 0.100 sec/batch)
2017-03-21 15:42:12.769582: step 60, loss = 4.52 (1278.6 examples/sec; 0.100 sec/batch)
2017-03-21 15:42:13.769733: step 70, loss = 4.54 (1279.8 examples/sec; 0.100 sec/batch)

models/tutorials/image/cifar10$ export CUDA_VISIBLE_DEVICES="0,1,2,3"
models/tutorials/image/cifar10$ python cifar10_multi_gpu_train.py
2017-03-21 15:43:37.866128: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:04:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-03-21 15:43:37.866246: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x364c2b0
2017-03-21 15:43:38.104892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 1 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:05:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-03-21 15:43:38.104995: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x36500f0
2017-03-21 15:43:38.349437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 2 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:84:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-03-21 15:43:38.349535: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x3653f60
2017-03-21 15:43:38.600657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 3 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:85:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-03-21 15:43:38.602403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2 3 
2017-03-21 15:43:38.602412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y Y N N 
2017-03-21 15:43:38.602418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1:   Y Y N N 
2017-03-21 15:43:38.602423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2:   N N Y Y 
2017-03-21 15:43:38.602428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 3:   N N Y Y 
2017-03-21 15:43:38.602445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0)
2017-03-21 15:43:38.602453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:05:00.0)
2017-03-21 15:43:38.602459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:84:00.0)
2017-03-21 15:43:38.602464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:85:00.0)
2017-03-21 15:43:54.086766: step 0, loss = 4.67 (47.9 examples/sec; 2.674 sec/batch)
2017-03-21 15:43:55.013200: step 10, loss = 4.66 (1381.7 examples/sec; 0.093 sec/batch)
2017-03-21 15:43:56.011015: step 20, loss = 4.65 (1282.8 examples/sec; 0.100 sec/batch)
2017-03-21 15:43:56.967307: step 30, loss = 4.60 (1338.5 examples/sec; 0.096 sec/batch)
2017-03-21 15:43:57.940303: step 40, loss = 4.57 (1315.5 examples/sec; 0.097 sec/batch)
2017-03-21 15:43:58.902810: step 50, loss = 4.54 (1329.9 examples/sec; 0.096 sec/batch)
2017-03-21 15:43:59.859618: step 60, loss = 4.48 (1337.8 examples/sec; 0.096 sec/batch)
</code>