User Tools

Site Tools



This describes the NVidia GPU nodes on the Razor cluster and how to use them.

Hardware and Queues

There are nine NVidia GPU nodes. Five have dual Intel E5520 2.27 GHz CPUs, 12 GB main memory, and dual NVidia GTX 480 GPUs. One has dual Intel E5-2630v3 2.40 GHz CPUs, 64 GB main memory, and dual NVidia K40 Tesla GPUs. Three new condo nodes have dual E5-2650v4 2.20 GHz cpus, 128 GB main memory, and dual NVidia K80 Tesla GPUs. The GTX 480 has 1.35 TFlops SP (Single Precision) and 0.17 TFlops DP (Double Precision), so is only meaningfully faster than a CPU for single-precision workloads. The K40 has 4.3 TFlops SP and 1.4 Tflops DP, and also has ECC memory for better computational reliability. The K80 has 4.25 TFlops SP/1.45 DP with ECC on each of four logical GPUs. The GTX 480,K40,K80 have Nvidia Compute Capability 2.0,3.5,3.7 respectively. The five GTX 480 nodes are accessed through the gpu8core queue with a 72 hour time limit. The K40 node has a gpu16core queue with a 72 hour time limit, and a run limit of one job. There is a shortgpu16core queue for the one K40 or three K80 nodes (which actually have 24 cores) with a 6 hour time limit, and a run limit of one job.
Many programs that use GPUs assume a single GPU and/or do not properly use cudasetdevice(), so the default mode for these dual-GPU nodes is half a node or a quarter of a K80 node using one GPU. The GPUs are set to “Exclusive Process mode” (nvidia-smi -c 3) at boot time so that they will work correctly with independent programs that assume a single GPU, or have no provision for multiple GPUs. If two single-GPU programs are run on a dual-GPU node in “default mode” (nvidia-smi -c 0), they both share GPU 0 while GPU 1 sits idle. Usually “Exclusive Process mode” works well. See and User accounts can

sudo /usr/bin/nvidia-smi {options}

If you reset mode with -c {0-2}, please reset -c 3 at the end of the job.

Most CUDA programs follow one of three programming models: (1) often older programs assume a single GPU and ignore cudasetdevice(), so devices=1; (2) some programs count the available GPUs and use them all, and devices=all; (3) MPI programs often match a single GPU with a single MPI process (who may also be CPU multi-threaded), so devices=MPI processes per node.

You should select your CPU cores accordingly to get a known number of GPUs. Users may select either a half node (ppn=4 on 8 core and ppn=8 on 16 core), a quarter or half K80 node (ppn=6 pr 12 on 24 core), or a full node (ppn=the number of CPU cores). One of the following conditions is required to select a full node:(1) use multiple GPUs; or (2) use more than half the main memory. Jobs that take up a full node and don't meet these conditions are subject to cancellation. If your program takes the number of GPUs as an input, remember that K80s in the condo nodes are essentially each two K20X in a single package, so the operating system sees 4 GPUs out of two K80s.

There is a system script on each gpu node which detects the available unused GPU devices by calling nvidia-smi and parsing the output. This script can be called, below in backticks, and supplied to the environment variable CUDAVISIBLEDEVICES to set the devices for your program:

export CUDA_VISIBLE_DEVICES=`/share/apps/bin/ 2`

Script options are numbers 1 to 4 for that many devices, and blank for maximum available. Output of the script saved in the environment variable will be a comma-separated string of currently unused devices, with 1 to 4 elements.


If the above variable is set before running your Cuda program, the given devices will be assigned to program devices 0 to n-1.

How do I tell how many GPUs my binary program uses?

Run it on a full node and login to the gpu compute node in a second terminal. Here photoscan uses both GPUs as shown in the “Processes” table, but is not doing much at this time as shown by the near-idle power usage.

sudo /usr/bin/nvidia-smi
Fri Feb 10 16:17:30 2017       
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K40c          Off  | 0000:81:00.0     Off |                    0 |
| 31%   63C    P0    70W / 235W |   1091MiB / 11439MiB |      0%      Default |
|   1  Tesla K40c          Off  | 0000:82:00.0     Off |                    0 |
| 33%   71C    P0    65W / 235W |    905MiB / 11439MiB |      0%      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|    0     32738    C   /share/apps/photoscan-pro/1.2.6/photoscan     1091MiB |
|    1     32738    C   /share/apps/photoscan-pro/1.2.6/photoscan      905MiB |


module cuda/7.5 is the default cuda module, with newer 8.0 and older 2.2,2.3,3.2,4.1,4.2,5.0,and 5.5 also available. In most cases you will want the latest version. Software is installed on common areas, should not be dependent on hardware, and it is not necessary to be logged into GPU nodes to compile, with the exception of the node-locked Portland compiler. In most cases when using nvcc it works best to use relatively modern Gnu compiler modules with Cuda (such as gcc/4.9.1).


Thre free academic development version of the PGI compiler is installed on the node controlled by the gpu16core/shortgpu16core queues. Please follow the License Conditions. A usage example follows from the May '16 Nvidia workshop which shows OpenMP versions running on the CPU and a (much faster) OpenACC version running on the GPU. The PGI development compiler is limited to 4 OpenMP threads.

razor-l3$ qsub -I -q shortgpu16core -l nodes=1:ppn=8 -l walltime=6:00:00
compute0805$ module purge
compute0805$ module load PGI/2016
compute0805$ cd src
compute0805$ tar zxf /share/apps/cuda/class/workshop/openacc_CUNY.tgz
compute0805$ cd openacc-workshop/exercises/004-laplace2D-profiling
compute0805$ sed -i 's/1000;/200;/' laplace2d.c
compute0805$ make
compute0805$ export OMP_NUM_THREADS=1
compute0805$ ./laplace2d_omp
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
 total: 18.076966 s
compute0805$ export OMP_NUM_THREADS=4
compute0805$ ./laplace2d_omp
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
 total: 7.715069 s
compute0805$ ./laplace2d_acc
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
 total: 1.175254 s
gpu.txt · Last modified: 2020/09/21 21:11 by root