Singularity http://singularity.lbl.gov/ is a software container system. It allows users to build and run entire scientific workflows, software and libraries using a specific distribution and version of Linux all packaged into a single image file. It is based on the Linux “chroot” command which allows users to switch the environment from the operating system installed on the host node to the one inside the singularity image file.
Many pre-built container images are available for download in the singularity hub https://singularity-hub.org/ and the docker hub https://singularity-hub.org/ repositories.
Start an interactive job and load the singularity module
razor-l2:pwolinsk:$ qsub -I -q tiny12core -l walltime=1:00:00 -l nodes=1:ppn=12 qsub: waiting for job 3608596.sched to start qsub: job 3608596.sched ready compute1144:pwolinsk:$ module load singularity
Open a shell inside a prebuilt container stored locally on Razor in /share/apps/singularity/images/hello-world.simg
compute1144:pwolinsk:$ cat /etc/issue CentOS release 6.8 (Final) Kernel \r on an \m compute1144:pwolinsk:$ singularity shell /share/apps/singularity/images/hello-world.simg Singularity: Invoking an interactive shell within container... Singularity hello-world.simg:~> cat /etc/issue Ubuntu 14.04.5 LTS \n \l Singularity hello-world.simg:~> exit exit compute1144:pwolinsk:$
Open a shell inside a container stored on singularity hub shub:/ /vsoch/hello-world
compute1144:pwolinsk:$ singularity shell shub://vsoch/hello-world Progress |===================================| 100.0% Singularity: Invoking an interactive shell within container... Singularity vsoch-hello-world-master.simg:~> cat /etc/issue Ubuntu 14.04.5 LTS \n \l Singularity vsoch-hello-world-master.simg:~>
Open a shell inside a container pulled for the docker repository, and bind /scratch directory on Razor to /mnt directory inside the container
compute1144:pwolinsk:$ ls /scratch |wc -l 5535 compute1144:pwolinsk:$ singularity shell --bind /scratch:/mnt docker://ubuntu Docker image path: index.docker.io/library/ubuntu:latest Cache folder set to /gpfs_home/pwolinsk/.singularity/docker [5/5] |===================================| 100.0% Creating container runtime... Singularity: Invoking an interactive shell within container... Singularity ubuntu:~> cat /etc/issue Ubuntu 16.04.3 LTS \n \l Singularity ubuntu:~> ls /mnt |wc -l 5535 Singularity ubuntu:~>
By default the container will bind the users $HOME directory /tmp and the current working directory from teh host to the equivalent directories inside the container. You can specify additional directories to bind using –bind <localdir>:<containerdir> syntax.
Download the example models from git repository
compute1144:pwolinsk:$ git clone https://github.com/tensorflow/models Initialized empty Git repository in /gpfs_home/pwolinsk/models/.git/ remote: Counting objects: 9158, done. remote: Compressing objects: 100% (2/2), done. remote: Total 9158 (delta 0), reused 0 (delta 0), pack-reused 9156 Receiving objects: 100% (9158/9158), 293.18 MiB | 32.15 MiB/s, done. Resolving deltas: 100% (5162/5162), done. compute1144:pwolinsk:$ cd models/tutorials/image/mnist/
Start a shell in the prebuilt singularity container from within the directory containing the python training script
compute1144:pwolinsk:/models/tutorials/image/mnist$ singularity shell /share/apps/singularity/images/ubuntu-tensorflow-1.4.simg Singularity: Invoking an interactive shell within container... Singularity ubuntu-tensorflow-1.4.simg:~/models/tutorials/image/mnist> python convolutional.py Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. Extracting data/train-images-idx3-ubyte.gz Extracting data/train-labels-idx1-ubyte.gz Extracting data/t10k-images-idx3-ubyte.gz Extracting data/t10k-labels-idx1-ubyte.gz 2017-12-01 15:05:15.992688: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 Initialized! Step 0 (epoch 0.00), 2.4 ms Minibatch loss: 8.334, learning rate: 0.010000 Minibatch error: 85.9% Validation error: 84.6% ...
Or instead of starting a shell inside the container use the exec command run the command inside the container:
compute1144:pwolinsk:/models/tutorials/image/mnist$ singularity exec /share/apps/singularity/images/ubuntu-tensorflow-1.4.simg python convolutional.py Extracting data/train-images-idx3-ubyte.gz Extracting data/train-labels-idx1-ubyte.gz Extracting data/t10k-images-idx3-ubyte.gz Extracting data/t10k-labels-idx1-ubyte.gz 2017-12-01 15:07:56.905035: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 Initialized! Step 0 (epoch 0.00), 2.2 ms Minibatch loss: 8.334, learning rate: 0.010000 Minibatch error: 85.9% Validation error: 84.6% ...
Start an interactive job on a gpu node:
razor-l1:pwolinsk:$ qsub -I -q gpu16core qsub: waiting for job 3927490.sched to start qsub: job 3927490.sched ready Currently Loaded Modulefiles: 1) os/el6 compute0805:pwolinsk:$
Clone the tensorflow example models:
compute0805:pwolinsk:$ git clone https://github.com/tensorflow/models Initialized empty Git repository in /home/pwolinsk/models/.git/ remote: Counting objects: 12884, done. remote: Compressing objects: 100% (6/6), done. remote: Total 12884 (delta 2), reused 3 (delta 2), pack-reused 12876 Receiving objects: 100% (12884/12884), 412.34 MiB | 27.24 MiB/s, done. Resolving deltas: 100% (7276/7276), done.
Load the singularity module and start a shell within the docker container.
compute0805:pwolinsk:$ module load singularity compute0805:pwolinsk:$ singularity shell --nv /share/apps/singularity/images/nvidia-tensorflow\:18.01-py2-ahpcc.simg Singularity: Invoking an interactive shell within container... Singularity nvidia-tensorflow:18.01-py2-ahpcc.simg:~> python ~/models/tutorials/image/mnist/convolutional.py Extracting data/train-images-idx3-ubyte.gz Extracting data/train-labels-idx1-ubyte.gz Extracting data/t10k-images-idx3-ubyte.gz Extracting data/t10k-labels-idx1-ubyte.gz 2018-03-06 20:49:48.176645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Tesla K40c major: 3 minor: 5 memoryClockRate(GHz): 0.745 pciBusID: 0000:81:00.0 totalMemory: 11.17GiB freeMemory: 11.09GiB 2018-03-06 20:49:48.421609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties: name: Tesla K40c major: 3 minor: 5 memoryClockRate(GHz): 0.745 pciBusID: 0000:82:00.0 totalMemory: 11.17GiB freeMemory: 11.09GiB 2018-03-06 20:49:48.421905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix 2018-03-06 20:49:48.421966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2018-03-06 20:49:48.421981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y Y 2018-03-06 20:49:48.421989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: Y Y 2018-03-06 20:49:48.422010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1093] Ignoring visible gpu device (device: 0, name: Tesla K40c, pci bus id: 0000:81:00.0, compute capability: 3.5) with Cuda compute capability 3.5. The minimum required Cuda capability is 5.2. 2018-03-06 20:49:48.422033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1093] Ignoring visible gpu device (device: 1, name: Tesla K40c, pci bus id: 0000:82:00.0, compute capability: 3.5) with Cuda compute capability 3.5. The minimum required Cuda capability is 5.2. Initialized! Step 0 (epoch 0.00), 40.5 ms Minibatch loss: 8.334, learning rate: 0.010000 Minibatch error: 85.9% Validation error: 84.6% Step 100 (epoch 0.12), 48.3 ms ....
While the Tensorflow job is running inside the Singularity container, ssh into the node and verify that the GPUS are in use:
compute0805:pwolinsk:$ nvidia-smi Tue Mar 6 14:49:50 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.30 Driver Version: 390.30 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K40c Off | 00000000:81:00.0 Off | 0 | | 24% 48C P0 66W / 235W | 79MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K40c Off | 00000000:82:00.0 Off | 0 | | 25% 50C P0 62W / 235W | 79MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 12628 C python 68MiB | | 1 12628 C python 68MiB | +-----------------------------------------------------------------------------+ compute0805:pwolinsk:$