tensorflow

Tensorflow

Tensorflow is an open source, deep learning software library for numerical computation using data flow graphs. Detailed information about the software is available on the project website:

https://www.tensorflow.org/

The library is available as a python package. The cpu version is installed for python/2.7.5 on both clusters and requires 3 additional dependencies gcc/4.9.1 mkl/16.0.1 java/sunjdk_1.8.0 . The gpu version is installed for python 2.7.11 on razor and requires gcc/4.9.1 mkl/16.0.1 java/sunjdk/1.8.0 cuda/8.0 as well as python/2.7.11.

tres0118:pwolinsk:$ module load gcc/4.9.1 python/2.7.5 mkl/16.0.1 java/sunjdk_1.8.0
tres0118:pwolinsk:$ python
Python 2.7.5 (default, Jul 10 2014, 16:10:08) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>>

The tensorflow package is installed on Razor in /share/apps/opt/rh/python27/root/usr/lib/python2.7/site-packages/tensorflow. The installation contains a few example models: image/alexnet image/cifar10 image/imagenet image/mnist embedding.

We will use the image/mnist training model to run a training session.

tres0118:pwolinsk:$ python -m tensorflow.models.image.mnist.convolutional
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
Initialized!
Step 0 (epoch 0.00), 5.1 ms
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
Validation error: 84.6%
Step 100 (epoch 0.12), 203.7 ms
Minibatch loss: 3.282, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 7.1%
...

The -m option instructs python to search the PYTHON path for a specified program name. You could also specify the full path to the convolutional.py script.

python /share/apps/opt/rh/python27/root/usr/lib/python2.7/site-packages/tensorflow/models/image/mnist/convolutional.py

The cifar10 tensorflow example has been tested with cpu and gpu. To get the example set:

git clone https://github.com/tensorflow/models
cd models/tutorials/image/cifar10
vi cifar10.py
module load gcc/5.2.1 mkl/16.0.1 python/2.7.11  java/sunjdk_1.8.0 cuda/8.0

and edit cifar10.py to change /tmp/ to an appropriate scratch directory such as /localscratch/rfeynman/''. We use CUDAVISIBLEDEVICES to simulate 0,1,4 devices. Scaling performance from CPU to 1 GPU to 4 GPU (two twin K80) is very modest. There is essentially no difference between 1 gpu and 4. <code> models/tutorials/image/cifar10$ export CUDAVISIBLEDEVICES=“” models/tutorials/image/cifar10$ python cifar10multigputrain.py Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes. 2017-03-17 14:13:46.533942: step 0, loss = 4.68 (29.7 examples/sec; 4.305 sec/batch) 2017-03-17 14:13:49.055265: step 10, loss = 4.66 (781.8 examples/sec; 0.164 sec/batch) 2017-03-17 14:13:50.697406: step 20, loss = 4.63 (771.2 examples/sec; 0.166 sec/batch) 2017-03-17 14:13:52.340482: step 30, loss = 4.60 (771.3 examples/sec; 0.166 sec/batch) models/tutorials/image/cifar10$ export CUDAVISIBLEDEVICES=“0” models/tutorials/image/cifar10$ python cifar10multigpu_train.py Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes. 2017-03-21 15:41:42.767510: I tensorflow/core/commonruntime/gpu/gpudevice.cc:887] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:04:00.0 Total memory: 11.17GiB Free memory: 11.11GiB 2017-03-21 15:41:42.767552: I tensorflow/core/commonruntime/gpu/gpudevice.cc:908] DMA: 0 2017-03-21 15:41:42.767559: I tensorflow/core/commonruntime/gpu/gpudevice.cc:918] 0: Y 2017-03-21 15:41:42.767571: I tensorflow/core/commonruntime/gpu/gpudevice.cc:977] Creating TensorFlow device (/gpu:0) → (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0) 2017-03-21 15:42:06.767827: step 0, loss = 4.68 (37.1 examples/sec; 3.448 sec/batch) 2017-03-21 15:42:07.701603: step 10, loss = 4.67 (1370.8 examples/sec; 0.093 sec/batch) 2017-03-21 15:42:08.750127: step 20, loss = 4.60 (1220.8 examples/sec; 0.105 sec/batch) 2017-03-21 15:42:09.762612: step 30, loss = 4.61 (1264.2 examples/sec; 0.101 sec/batch) 2017-03-21 15:42:10.769818: step 40, loss = 4.58 (1270.8 examples/sec; 0.101 sec/batch) 2017-03-21 15:42:11.768493: step 50, loss = 4.53 (1281.7 examples/sec; 0.100 sec/batch) 2017-03-21 15:42:12.769582: step 60, loss = 4.52 (1278.6 examples/sec; 0.100 sec/batch) 2017-03-21 15:42:13.769733: step 70, loss = 4.54 (1279.8 examples/sec; 0.100 sec/batch) models/tutorials/image/cifar10$ export CUDAVISIBLEDEVICES=“0,1,2,3” models/tutorials/image/cifar10$ python cifar10multigputrain.py 2017-03-21 15:43:37.866128: I tensorflow/core/commonruntime/gpu/gpudevice.cc:887] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:04:00.0 Total memory: 11.17GiB Free memory: 11.11GiB 2017-03-21 15:43:37.866246: W tensorflow/streamexecutor/cuda/cudadriver.cc:485] creating context when one is currently active; existing: 0x364c2b0 2017-03-21 15:43:38.104892: I tensorflow/core/commonruntime/gpu/gpudevice.cc:887] Found device 1 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:05:00.0 Total memory: 11.17GiB Free memory: 11.11GiB 2017-03-21 15:43:38.104995: W tensorflow/streamexecutor/cuda/cudadriver.cc:485] creating context when one is currently active; existing: 0x36500f0 2017-03-21 15:43:38.349437: I tensorflow/core/commonruntime/gpu/gpudevice.cc:887] Found device 2 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:84:00.0 Total memory: 11.17GiB Free memory: 11.11GiB 2017-03-21 15:43:38.349535: W tensorflow/streamexecutor/cuda/cudadriver.cc:485] creating context when one is currently active; existing: 0x3653f60 2017-03-21 15:43:38.600657: I tensorflow/core/commonruntime/gpu/gpudevice.cc:887] Found device 3 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:85:00.0 Total memory: 11.17GiB Free memory: 11.11GiB 2017-03-21 15:43:38.602403: I tensorflow/core/commonruntime/gpu/gpudevice.cc:908] DMA: 0 1 2 3 2017-03-21 15:43:38.602412: I tensorflow/core/commonruntime/gpu/gpudevice.cc:918] 0: Y Y N N 2017-03-21 15:43:38.602418: I tensorflow/core/commonruntime/gpu/gpudevice.cc:918] 1: Y Y N N 2017-03-21 15:43:38.602423: I tensorflow/core/commonruntime/gpu/gpudevice.cc:918] 2: N N Y Y 2017-03-21 15:43:38.602428: I tensorflow/core/commonruntime/gpu/gpudevice.cc:918] 3: N N Y Y 2017-03-21 15:43:38.602445: I tensorflow/core/commonruntime/gpu/gpudevice.cc:977] Creating TensorFlow device (/gpu:0) → (device: 0, name: Tesla K80, pci bus id: 0000:04:00.0) 2017-03-21 15:43:38.602453: I tensorflow/core/commonruntime/gpu/gpudevice.cc:977] Creating TensorFlow device (/gpu:1) → (device: 1, name: Tesla K80, pci bus id: 0000:05:00.0) 2017-03-21 15:43:38.602459: I tensorflow/core/commonruntime/gpu/gpudevice.cc:977] Creating TensorFlow device (/gpu:2) → (device: 2, name: Tesla K80, pci bus id: 0000:84:00.0) 2017-03-21 15:43:38.602464: I tensorflow/core/commonruntime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:3) → (device: 3, name: Tesla K80, pci bus id: 0000:85:00.0) 2017-03-21 15:43:54.086766: step 0, loss = 4.67 (47.9 examples/sec; 2.674 sec/batch) 2017-03-21 15:43:55.013200: step 10, loss = 4.66 (1381.7 examples/sec; 0.093 sec/batch) 2017-03-21 15:43:56.011015: step 20, loss = 4.65 (1282.8 examples/sec; 0.100 sec/batch) 2017-03-21 15:43:56.967307: step 30, loss = 4.60 (1338.5 examples/sec; 0.096 sec/batch) 2017-03-21 15:43:57.940303: step 40, loss = 4.57 (1315.5 examples/sec; 0.097 sec/batch) 2017-03-21 15:43:58.902810: step 50, loss = 4.54 (1329.9 examples/sec; 0.096 sec/batch) 2017-03-21 15:43:59.859618: step 60, loss = 4.48 (1337.8 examples/sec; 0.096 sec/batch) </code>