==== Tensorflow ==== Tensorflow is an open source, deep learning software library for numerical computation using data flow graphs. Detailed information about the software is available on the project website: https://www.tensorflow.org/ The library is available as a python package. The cpu version is installed for **python/2.7.5** on both clusters and requires 3 additional dependencies **gcc/4.9.1 mkl/16.0.1 java/sunjdk_1.8.0** . The gpu version is installed for **python 2.7.11** on razor and requires **gcc/4.9.1 mkl/16.0.1 java/sunjdk/1.8.0 cuda/8.0** as well as **python/2.7.11**. tres0118:pwolinsk:$ module load gcc/4.9.1 python/2.7.5 mkl/16.0.1 java/sunjdk_1.8.0 tres0118:pwolinsk:$ python Python 2.7.5 (default, Jul 10 2014, 16:10:08) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow >>> The tensorflow package is installed on Razor in ''/share/apps/opt/rh/python27/root/usr/lib/python2.7/site-packages/tensorflow''. The installation contains a few example models: ''image/alexnet image/cifar10 image/imagenet image/mnist embedding''. We will use the image/mnist training model to run a training session. tres0118:pwolinsk:$ python -m tensorflow.models.image.mnist.convolutional Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. Extracting data/train-images-idx3-ubyte.gz Extracting data/train-labels-idx1-ubyte.gz Extracting data/t10k-images-idx3-ubyte.gz Extracting data/t10k-labels-idx1-ubyte.gz Initialized! Step 0 (epoch 0.00), 5.1 ms Minibatch loss: 12.054, learning rate: 0.010000 Minibatch error: 90.6% Validation error: 84.6% Step 100 (epoch 0.12), 203.7 ms Minibatch loss: 3.282, learning rate: 0.010000 Minibatch error: 6.2% Validation error: 7.1% ... The ''-m'' option instructs python to search the PYTHON path for a specified program name. You could also specify the full path to the convolutional.py script. python /share/apps/opt/rh/python27/root/usr/lib/python2.7/site-packages/tensorflow/models/image/mnist/convolutional.py The **cifar10** tensorflow example has been tested with cpu and gpu. To get the example set: git clone https://github.com/tensorflow/models cd models/tutorials/image/cifar10 vi cifar10.py module load gcc/5.2.1 mkl/16.0.1 python/2.7.11 java/sunjdk_1.8.0 cuda/8.0 and edit cifar10.py to change ''/tmp/'' to an appropriate scratch directory such as ''/local_scratch/rfeynman/''. We use CUDA_VISIBLE_DEVICES to simulate 0,1,4 devices. Scaling performance from CPU to 1 GPU to 4 GPU (two twin K80) is very modest. There is essentially no difference between 1 gpu and 4. models/tutorials/image/cifar10$ export CUDA_VISIBLE_DEVICES="" models/tutorials/image/cifar10$ python cifar10_multi_gpu_train.py Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes. 2017-03-17 14:13:46.533942: step 0, loss = 4.68 (29.7 examples/sec; 4.305 sec/batch) 2017-03-17 14:13:49.055265: step 10, loss = 4.66 (781.8 examples/sec; 0.164 sec/batch) 2017-03-17 14:13:50.697406: step 20, loss = 4.63 (771.2 examples/sec; 0.166 sec/batch) 2017-03-17 14:13:52.340482: step 30, loss = 4.60 (771.3 examples/sec; 0.166 sec/batch) models/tutorials/image/cifar10$ export CUDA_VISIBLE_DEVICES="0" models/tutorials/image/cifar10$ python cifar10_multi_gpu_train.py Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes. 2017-03-21 15:41:42.767510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:04:00.0 Total memory: 11.17GiB Free memory: 11.11GiB 2017-03-21 15:41:42.767552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 2017-03-21 15:41:42.767559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y 2017-03-21 15:41:42.767571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0) 2017-03-21 15:42:06.767827: step 0, loss = 4.68 (37.1 examples/sec; 3.448 sec/batch) 2017-03-21 15:42:07.701603: step 10, loss = 4.67 (1370.8 examples/sec; 0.093 sec/batch) 2017-03-21 15:42:08.750127: step 20, loss = 4.60 (1220.8 examples/sec; 0.105 sec/batch) 2017-03-21 15:42:09.762612: step 30, loss = 4.61 (1264.2 examples/sec; 0.101 sec/batch) 2017-03-21 15:42:10.769818: step 40, loss = 4.58 (1270.8 examples/sec; 0.101 sec/batch) 2017-03-21 15:42:11.768493: step 50, loss = 4.53 (1281.7 examples/sec; 0.100 sec/batch) 2017-03-21 15:42:12.769582: step 60, loss = 4.52 (1278.6 examples/sec; 0.100 sec/batch) 2017-03-21 15:42:13.769733: step 70, loss = 4.54 (1279.8 examples/sec; 0.100 sec/batch) models/tutorials/image/cifar10$ export CUDA_VISIBLE_DEVICES="0,1,2,3" models/tutorials/image/cifar10$ python cifar10_multi_gpu_train.py 2017-03-21 15:43:37.866128: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:04:00.0 Total memory: 11.17GiB Free memory: 11.11GiB 2017-03-21 15:43:37.866246: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x364c2b0 2017-03-21 15:43:38.104892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 1 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:05:00.0 Total memory: 11.17GiB Free memory: 11.11GiB 2017-03-21 15:43:38.104995: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x36500f0 2017-03-21 15:43:38.349437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 2 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:84:00.0 Total memory: 11.17GiB Free memory: 11.11GiB 2017-03-21 15:43:38.349535: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x3653f60 2017-03-21 15:43:38.600657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 3 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:85:00.0 Total memory: 11.17GiB Free memory: 11.11GiB 2017-03-21 15:43:38.602403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2 3 2017-03-21 15:43:38.602412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y Y N N 2017-03-21 15:43:38.602418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1: Y Y N N 2017-03-21 15:43:38.602423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2: N N Y Y 2017-03-21 15:43:38.602428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 3: N N Y Y 2017-03-21 15:43:38.602445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0) 2017-03-21 15:43:38.602453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:05:00.0) 2017-03-21 15:43:38.602459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:84:00.0) 2017-03-21 15:43:38.602464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:85:00.0) 2017-03-21 15:43:54.086766: step 0, loss = 4.67 (47.9 examples/sec; 2.674 sec/batch) 2017-03-21 15:43:55.013200: step 10, loss = 4.66 (1381.7 examples/sec; 0.093 sec/batch) 2017-03-21 15:43:56.011015: step 20, loss = 4.65 (1282.8 examples/sec; 0.100 sec/batch) 2017-03-21 15:43:56.967307: step 30, loss = 4.60 (1338.5 examples/sec; 0.096 sec/batch) 2017-03-21 15:43:57.940303: step 40, loss = 4.57 (1315.5 examples/sec; 0.097 sec/batch) 2017-03-21 15:43:58.902810: step 50, loss = 4.54 (1329.9 examples/sec; 0.096 sec/batch) 2017-03-21 15:43:59.859618: step 60, loss = 4.48 (1337.8 examples/sec; 0.096 sec/batch)