User Tools

Site Tools


tensorflow

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
tensorflow [2016/08/18 19:12]
pwolinsk
tensorflow [2017/03/21 20:55] (current)
root
Line 1: Line 1:
 ==== Tensorflow ==== ==== Tensorflow ====
  
-Tensorflow is an open source software library for numerical computation using data flow graphs. ​ Detailed information about the software is available on the project website:+Tensorflow is an open source, deep learning ​software library for numerical computation using data flow graphs. ​ Detailed information about the software is available on the project website:
  
 https://​www.tensorflow.org/​ https://​www.tensorflow.org/​
  
-The library is available as a python package.  ​It is installed for **python/​2.7.5** and requires 3 additional dependencies **gcc/4.9.1 mkl/16.0.1 java/​sunjdk_1.8.0**+The library is available as a python package.  ​The cpu version ​is installed for **python/​2.7.5** ​on both clusters ​and requires 3 additional dependencies **gcc/4.9.1 mkl/16.0.1 java/​sunjdk_1.8.0** ​. The gpu version is installed for **python 2.7.11** on razor and requires **gcc/4.9.1 mkl/16.0.1 java/​sunjdk/​1.8.0 cuda/8.0** as well as **python/​2.7.11**.
  
 <​code>​ <​code>​
Line 17: Line 17:
 </​code>​ </​code>​
  
-The tensorflow package is installed on Razor in ''/​share/​apps/​opt/​rh/​python27/​root/​usr/​lib/​python2.7/​site-packages/​tensorflow''​. ​ The installation contains a few example models.+The tensorflow package is installed on Razor in ''/​share/​apps/​opt/​rh/​python27/​root/​usr/​lib/​python2.7/​site-packages/​tensorflow''​. ​ The installation contains a few example models: ''​image/​alexnet image/​cifar10 image/​imagenet image/mnist embedding''​. 
 + 
 +We will use the image/mnist training model to run a training session. ​     ​
  
 <​code>​ <​code>​
 +tres0118:​pwolinsk:​$ python -m tensorflow.models.image.mnist.convolutional
 +Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
 +Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
 +Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
 +Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
 +Extracting data/​train-images-idx3-ubyte.gz
 +Extracting data/​train-labels-idx1-ubyte.gz
 +Extracting data/​t10k-images-idx3-ubyte.gz
 +Extracting data/​t10k-labels-idx1-ubyte.gz
 +Initialized!
 +Step 0 (epoch 0.00), 5.1 ms
 +Minibatch loss: 12.054, learning rate: 0.010000
 +Minibatch error: 90.6%
 +Validation error: 84.6%
 +Step 100 (epoch 0.12), 203.7 ms
 +Minibatch loss: 3.282, learning rate: 0.010000
 +Minibatch error: 6.2%
 +Validation error: 7.1%
 +...
 </​code>​ </​code>​
 +
 +The ''​-m''​ option instructs python to search the PYTHON path for a specified program name.  You could also specify the full path to the convolutional.py script.
 +
 +<​code>​
 +python /​share/​apps/​opt/​rh/​python27/​root/​usr/​lib/​python2.7/​site-packages/​tensorflow/​models/​image/​mnist/​convolutional.py
 +</​code>​
 +
 +The **cifar10** tensorflow example has been tested with cpu and gpu.  To get the example set:
 +<​code>​
 +git clone https://​github.com/​tensorflow/​models
 +cd models/​tutorials/​image/​cifar10
 +vi cifar10.py
 +module load gcc/5.2.1 mkl/16.0.1 python/​2.7.11 ​ java/​sunjdk_1.8.0 cuda/8.0
 +</​code>​
 +and edit cifar10.py to change ''/​tmp/''​ to an appropriate scratch directory such as ''/​local_scratch/​rfeynman/''​. ​ We use CUDA_VISIBLE_DEVICES to simulate 0,1,4 devices. Scaling performance from CPU to 1 GPU to 4 GPU (two twin K80) is very modest. ​ There is essentially no difference between 1 gpu and 4.
 +<​code>​
 +models/​tutorials/​image/​cifar10$ export CUDA_VISIBLE_DEVICES=""​
 +models/​tutorials/​image/​cifar10$ python cifar10_multi_gpu_train.py
 +Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
 +2017-03-17 14:​13:​46.533942:​ step 0, loss = 4.68 (29.7 examples/​sec;​ 4.305 sec/batch)
 +2017-03-17 14:​13:​49.055265:​ step 10, loss = 4.66 (781.8 examples/​sec;​ 0.164 sec/batch)
 +2017-03-17 14:​13:​50.697406:​ step 20, loss = 4.63 (771.2 examples/​sec;​ 0.166 sec/batch)
 +2017-03-17 14:​13:​52.340482:​ step 30, loss = 4.60 (771.3 examples/​sec;​ 0.166 sec/batch)
 +
 +models/​tutorials/​image/​cifar10$ export CUDA_VISIBLE_DEVICES="​0"​
 +models/​tutorials/​image/​cifar10$ python cifar10_multi_gpu_train.py
 +Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
 +
 +2017-03-21 15:​41:​42.767510:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​887] Found device 0 with properties: ​
 +name: Tesla K80
 +major: 3 minor: 7 memoryClockRate (GHz) 0.8235
 +pciBusID 0000:​04:​00.0
 +Total memory: 11.17GiB
 +Free memory: 11.11GiB
 +2017-03-21 15:​41:​42.767552:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​908] DMA: 0 
 +2017-03-21 15:​41:​42.767559:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​918] 0:   ​Y ​
 +2017-03-21 15:​41:​42.767571:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:​04:​00.0)
 +2017-03-21 15:​42:​06.767827:​ step 0, loss = 4.68 (37.1 examples/​sec;​ 3.448 sec/batch)
 +2017-03-21 15:​42:​07.701603:​ step 10, loss = 4.67 (1370.8 examples/​sec;​ 0.093 sec/batch)
 +2017-03-21 15:​42:​08.750127:​ step 20, loss = 4.60 (1220.8 examples/​sec;​ 0.105 sec/batch)
 +2017-03-21 15:​42:​09.762612:​ step 30, loss = 4.61 (1264.2 examples/​sec;​ 0.101 sec/batch)
 +2017-03-21 15:​42:​10.769818:​ step 40, loss = 4.58 (1270.8 examples/​sec;​ 0.101 sec/batch)
 +2017-03-21 15:​42:​11.768493:​ step 50, loss = 4.53 (1281.7 examples/​sec;​ 0.100 sec/batch)
 +2017-03-21 15:​42:​12.769582:​ step 60, loss = 4.52 (1278.6 examples/​sec;​ 0.100 sec/batch)
 +2017-03-21 15:​42:​13.769733:​ step 70, loss = 4.54 (1279.8 examples/​sec;​ 0.100 sec/batch)
 +
 +models/​tutorials/​image/​cifar10$ export CUDA_VISIBLE_DEVICES="​0,​1,​2,​3"​
 +models/​tutorials/​image/​cifar10$ python cifar10_multi_gpu_train.py
 +2017-03-21 15:​43:​37.866128:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​887] Found device 0 with properties: ​
 +name: Tesla K80
 +major: 3 minor: 7 memoryClockRate (GHz) 0.8235
 +pciBusID 0000:​04:​00.0
 +Total memory: 11.17GiB
 +Free memory: 11.11GiB
 +2017-03-21 15:​43:​37.866246:​ W tensorflow/​stream_executor/​cuda/​cuda_driver.cc:​485] creating context when one is currently active; existing: 0x364c2b0
 +2017-03-21 15:​43:​38.104892:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​887] Found device 1 with properties: ​
 +name: Tesla K80
 +major: 3 minor: 7 memoryClockRate (GHz) 0.8235
 +pciBusID 0000:​05:​00.0
 +Total memory: 11.17GiB
 +Free memory: 11.11GiB
 +2017-03-21 15:​43:​38.104995:​ W tensorflow/​stream_executor/​cuda/​cuda_driver.cc:​485] creating context when one is currently active; existing: 0x36500f0
 +2017-03-21 15:​43:​38.349437:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​887] Found device 2 with properties: ​
 +name: Tesla K80
 +major: 3 minor: 7 memoryClockRate (GHz) 0.8235
 +pciBusID 0000:​84:​00.0
 +Total memory: 11.17GiB
 +Free memory: 11.11GiB
 +2017-03-21 15:​43:​38.349535:​ W tensorflow/​stream_executor/​cuda/​cuda_driver.cc:​485] creating context when one is currently active; existing: 0x3653f60
 +2017-03-21 15:​43:​38.600657:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​887] Found device 3 with properties: ​
 +name: Tesla K80
 +major: 3 minor: 7 memoryClockRate (GHz) 0.8235
 +pciBusID 0000:​85:​00.0
 +Total memory: 11.17GiB
 +Free memory: 11.11GiB
 +2017-03-21 15:​43:​38.602403:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​908] DMA: 0 1 2 3 
 +2017-03-21 15:​43:​38.602412:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​918] 0:   Y Y N N 
 +2017-03-21 15:​43:​38.602418:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​918] 1:   Y Y N N 
 +2017-03-21 15:​43:​38.602423:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​918] 2:   N N Y Y 
 +2017-03-21 15:​43:​38.602428:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​918] 3:   N N Y Y 
 +2017-03-21 15:​43:​38.602445:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:​04:​00.0)
 +2017-03-21 15:​43:​38.602453:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:​05:​00.0)
 +2017-03-21 15:​43:​38.602459:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:​84:​00.0)
 +2017-03-21 15:​43:​38.602464:​ I tensorflow/​core/​common_runtime/​gpu/​gpu_device.cc:​977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:​85:​00.0)
 +2017-03-21 15:​43:​54.086766:​ step 0, loss = 4.67 (47.9 examples/​sec;​ 2.674 sec/batch)
 +2017-03-21 15:​43:​55.013200:​ step 10, loss = 4.66 (1381.7 examples/​sec;​ 0.093 sec/batch)
 +2017-03-21 15:​43:​56.011015:​ step 20, loss = 4.65 (1282.8 examples/​sec;​ 0.100 sec/batch)
 +2017-03-21 15:​43:​56.967307:​ step 30, loss = 4.60 (1338.5 examples/​sec;​ 0.096 sec/batch)
 +2017-03-21 15:​43:​57.940303:​ step 40, loss = 4.57 (1315.5 examples/​sec;​ 0.097 sec/batch)
 +2017-03-21 15:​43:​58.902810:​ step 50, loss = 4.54 (1329.9 examples/​sec;​ 0.096 sec/batch)
 +2017-03-21 15:​43:​59.859618:​ step 60, loss = 4.48 (1337.8 examples/​sec;​ 0.096 sec/batch)
 +</​code>​
 +
 +
 +
 +
tensorflow.1471547573.txt.gz · Last modified: 2016/08/18 19:12 by pwolinsk