User Tools

Site Tools


tensorflow

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
tensorflow [2016/08/18 18:51]
pwolinsk created
tensorflow [2017/03/21 20:55] (current)
root
Line 1: Line 1:
 ==== Tensorflow ==== ==== Tensorflow ====
  
-Tensorflow is an open source software library for numerical computation using data flow graphs. ​ Detailed information about the software is available on the project website:+Tensorflow is an open source, deep learning ​software library for numerical computation using data flow graphs. ​ Detailed information about the software is available on the project website:
  
 https://​www.tensorflow.org/​ https://​www.tensorflow.org/​
  
-The library is available as a python package.  ​It is installed for **python/​2.7.5** and requires 3 additional dependencies **gcc/4.9.1 mkl/16.0.1 java/​sunjdk_1.8.0**+The library is available as a python package.  ​The cpu version ​is installed for **python/​2.7.5** ​on both clusters ​and requires 3 additional dependencies **gcc/4.9.1 mkl/16.0.1 java/​sunjdk_1.8.0** ​. The gpu version is installed for **python 2.7.11** on razor and requires **gcc/4.9.1 mkl/16.0.1 java/​sunjdk/​1.8.0 cuda/8.0** as well as **python/​2.7.11**.
  
 <​code>​ <​code>​
 tres0118:​pwolinsk:​$ module load gcc/4.9.1 python/​2.7.5 mkl/16.0.1 java/​sunjdk_1.8.0 tres0118:​pwolinsk:​$ module load gcc/4.9.1 python/​2.7.5 mkl/16.0.1 java/​sunjdk_1.8.0
-tres0118:​pwolinsk:​/​local_scratch/​pwolinsk$ python+tres0118:​pwolinsk:​$ python
 Python 2.7.5 (default, Jul 10 2014, 16:​10:​08) ​ Python 2.7.5 (default, Jul 10 2014, 16:​10:​08) ​
 [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Line 16: Line 16:
 >>> ​ >>> ​
 </​code>​ </​code>​
 +
 +The tensorflow package is installed on Razor in ''/​share/​apps/​opt/​rh/​python27/​root/​usr/​lib/​python2.7/​site-packages/​tensorflow''​. ​ The installation contains a few example models: ''​image/​alexnet image/​cifar10 image/​imagenet image/mnist embedding''​.
 +
 +We will use the image/mnist training model to run a training session. ​     ​
 +
 +<​code>​
 +tres0118:​pwolinsk:​$ python -m tensorflow.models.image.mnist.convolutional
 +Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
 +Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
 +Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
 +Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
 +Extracting data/​train-images-idx3-ubyte.gz
 +Extracting data/​train-labels-idx1-ubyte.gz
 +Extracting data/​t10k-images-idx3-ubyte.gz
 +Extracting data/​t10k-labels-idx1-ubyte.gz
 +Initialized!
 +Step 0 (epoch 0.00), 5.1 ms
 +Minibatch loss: 12.054, learning rate: 0.010000
 +Minibatch error: 90.6%
 +Validation error: 84.6%
 +Step 100 (epoch 0.12), 203.7 ms
 +Minibatch loss: 3.282, learning rate: 0.010000
 +Minibatch error: 6.2%
 +Validation error: 7.1%
 +...
 +</​code>​
 +
 +The ''​-m''​ option instructs python to search the PYTHON path for a specified program name.  You could also specify the full path to the convolutional.py script.
 +
 +<​code>​
 +python /​share/​apps/​opt/​rh/​python27/​root/​usr/​lib/​python2.7/​site-packages/​tensorflow/​models/​image/​mnist/​convolutional.py
 +</​code>​
 +
 +The **cifar10** tensorflow example has been tested with cpu and gpu.  To get the example set:
 +<​code>​
 +git clone https://​github.com/​tensorflow/​models
 +cd models/​tutorials/​image/​cifar10
 +vi cifar10.py
 +module load gcc/5.2.1 mkl/16.0.1 python/​2.7.11 ​ java/​sunjdk_1.8.0 cuda/8.0
 +</​code>​
 +and edit cifar10.py to change ''/​tmp/''​ to an appropriate scratch directory such as ''/​local_scratch/​rfeynman/''​. ​ We use CUDA_VISIBLE_DEVICES to simulate 0,1,4 devices. Scaling performance from CPU to 1 GPU to 4 GPU (two twin K80) is very modest. ​ There is essentially no difference between 1 gpu and 4.
 +<​code>​
 +models/​tutorials/​image/​cifar10$ export CUDA_VISIBLE_DEVICES=""​
 +models/​tutorials/​image/​cifar10$ python cifar10_multi_gpu_train.py
 +Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
 +2017-03-17 14:​13:​46.533942:​ step 0, loss = 4.68 (29.7 examples/​sec;​ 4.305 sec/batch)
 +2017-03-17 14:​13:​49.055265:​ step 10, loss = 4.66 (781.8 examples/​sec;​ 0.164 sec/batch)
 +2017-03-17 14:​13:​50.697406:​ step 20, loss = 4.63 (771.2 examples/​sec;​ 0.166 sec/batch)
 +2017-03-17 14:​13:​52.340482:​ step 30, loss = 4.60 (771.3 examples/​sec;​ 0.166 sec/batch)
 +
 +models/​tutorials/​image/​cifar10$ export CUDA_VISIBLE_DEVICES="​0"​
 +models/​tutorials/​image/​cifar10$ python cifar10_multi_gpu_train.py
 +Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
 +
 +2017-03-21 15:​41:​42.767510:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​887] Found device 0 with properties: ​
 +name: Tesla K80
 +major: 3 minor: 7 memoryClockRate (GHz) 0.8235
 +pciBusID 0000:​04:​00.0
 +Total memory: 11.17GiB
 +Free memory: 11.11GiB
 +2017-03-21 15:​41:​42.767552:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​908] DMA: 0 
 +2017-03-21 15:​41:​42.767559:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​918] 0:   ​Y ​
 +2017-03-21 15:​41:​42.767571:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:​04:​00.0)
 +2017-03-21 15:​42:​06.767827:​ step 0, loss = 4.68 (37.1 examples/​sec;​ 3.448 sec/batch)
 +2017-03-21 15:​42:​07.701603:​ step 10, loss = 4.67 (1370.8 examples/​sec;​ 0.093 sec/batch)
 +2017-03-21 15:​42:​08.750127:​ step 20, loss = 4.60 (1220.8 examples/​sec;​ 0.105 sec/batch)
 +2017-03-21 15:​42:​09.762612:​ step 30, loss = 4.61 (1264.2 examples/​sec;​ 0.101 sec/batch)
 +2017-03-21 15:​42:​10.769818:​ step 40, loss = 4.58 (1270.8 examples/​sec;​ 0.101 sec/batch)
 +2017-03-21 15:​42:​11.768493:​ step 50, loss = 4.53 (1281.7 examples/​sec;​ 0.100 sec/batch)
 +2017-03-21 15:​42:​12.769582:​ step 60, loss = 4.52 (1278.6 examples/​sec;​ 0.100 sec/batch)
 +2017-03-21 15:​42:​13.769733:​ step 70, loss = 4.54 (1279.8 examples/​sec;​ 0.100 sec/batch)
 +
 +models/​tutorials/​image/​cifar10$ export CUDA_VISIBLE_DEVICES="​0,​1,​2,​3"​
 +models/​tutorials/​image/​cifar10$ python cifar10_multi_gpu_train.py
 +2017-03-21 15:​43:​37.866128:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​887] Found device 0 with properties: ​
 +name: Tesla K80
 +major: 3 minor: 7 memoryClockRate (GHz) 0.8235
 +pciBusID 0000:​04:​00.0
 +Total memory: 11.17GiB
 +Free memory: 11.11GiB
 +2017-03-21 15:​43:​37.866246:​ W tensorflow/​stream_executor/​cuda/​cuda_driver.cc:​485] creating context when one is currently active; existing: 0x364c2b0
 +2017-03-21 15:​43:​38.104892:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​887] Found device 1 with properties: ​
 +name: Tesla K80
 +major: 3 minor: 7 memoryClockRate (GHz) 0.8235
 +pciBusID 0000:​05:​00.0
 +Total memory: 11.17GiB
 +Free memory: 11.11GiB
 +2017-03-21 15:​43:​38.104995:​ W tensorflow/​stream_executor/​cuda/​cuda_driver.cc:​485] creating context when one is currently active; existing: 0x36500f0
 +2017-03-21 15:​43:​38.349437:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​887] Found device 2 with properties: ​
 +name: Tesla K80
 +major: 3 minor: 7 memoryClockRate (GHz) 0.8235
 +pciBusID 0000:​84:​00.0
 +Total memory: 11.17GiB
 +Free memory: 11.11GiB
 +2017-03-21 15:​43:​38.349535:​ W tensorflow/​stream_executor/​cuda/​cuda_driver.cc:​485] creating context when one is currently active; existing: 0x3653f60
 +2017-03-21 15:​43:​38.600657:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​887] Found device 3 with properties: ​
 +name: Tesla K80
 +major: 3 minor: 7 memoryClockRate (GHz) 0.8235
 +pciBusID 0000:​85:​00.0
 +Total memory: 11.17GiB
 +Free memory: 11.11GiB
 +2017-03-21 15:​43:​38.602403:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​908] DMA: 0 1 2 3 
 +2017-03-21 15:​43:​38.602412:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​918] 0:   Y Y N N 
 +2017-03-21 15:​43:​38.602418:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​918] 1:   Y Y N N 
 +2017-03-21 15:​43:​38.602423:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​918] 2:   N N Y Y 
 +2017-03-21 15:​43:​38.602428:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​918] 3:   N N Y Y 
 +2017-03-21 15:​43:​38.602445:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:​04:​00.0)
 +2017-03-21 15:​43:​38.602453:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:​05:​00.0)
 +2017-03-21 15:​43:​38.602459:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:​84:​00.0)
 +2017-03-21 15:​43:​38.602464:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:​85:​00.0)
 +2017-03-21 15:​43:​54.086766:​ step 0, loss = 4.67 (47.9 examples/​sec;​ 2.674 sec/batch)
 +2017-03-21 15:​43:​55.013200:​ step 10, loss = 4.66 (1381.7 examples/​sec;​ 0.093 sec/batch)
 +2017-03-21 15:​43:​56.011015:​ step 20, loss = 4.65 (1282.8 examples/​sec;​ 0.100 sec/batch)
 +2017-03-21 15:​43:​56.967307:​ step 30, loss = 4.60 (1338.5 examples/​sec;​ 0.096 sec/batch)
 +2017-03-21 15:​43:​57.940303:​ step 40, loss = 4.57 (1315.5 examples/​sec;​ 0.097 sec/batch)
 +2017-03-21 15:​43:​58.902810:​ step 50, loss = 4.54 (1329.9 examples/​sec;​ 0.096 sec/batch)
 +2017-03-21 15:​43:​59.859618:​ step 60, loss = 4.48 (1337.8 examples/​sec;​ 0.096 sec/batch)
 +</​code>​
 +
 +
 +
 +
tensorflow.1471546299.txt.gz · Last modified: 2016/08/18 18:51 by pwolinsk