User Tools

Site Tools


gpu

AHPCC GPU Nodes

This describes the GPU nodes on the Razor cluster and how to use them.

Hardware and Queues

There are nine NVidia GPU nodes. Five have dual Intel E5520 2.27 GHz CPUs, 12 GB main memory, and dual NVidia GTX 480 GPUs. One has dual Intel E5-2630v3 2.40 GHz CPUs, 64 GB main memory, and dual NVidia K40 GPUs. Three new condo nodes have dual E5-2650v4 2.20 GHz cpus, 128 GB main memory, and dual NVidia K80 GPUs. The GTX 480 has 1.35 TFlops SP (Single Precision) and 0.17 TFlops DP (Double Precision), so is only meaningfully faster than a CPU for single-precision workloads. The K40 has 4.3 TFlops SP and 1.4 Tflops DP, and also has ECC memory for better computational reliability. The K80 has 8.5 TFlops SP/2.9 DP with ECC. The GTX 480,K40,K80 have Nvidia Compute Capability 2.0,3.5,3.7 respectively. The five GTX 480 nodes are accessed through the gpu8core queue with a 72 hour time limit. The K40 node has a gpu16core queue with a 72 hour time limit, and a run limit of one job. There is a shortgpu16core queue for the four K40 or K80 nodes with a 6 hour time limit, and a run limit of one job.

Most programs that use GPUs assume a single GPU and/or do not properly use cudasetdevice(), so the default mode for these dual-GPU nodes is half a node using one GPU. The GPUs are set to “Exclusive Process mode” (nvidia-smi -c 3) at boot time so that they will work correctly with independent programs that assume a GPU or have no provision for multiple GPU. If two single-GPU programs are run on a dual-GPU node in “default mode” (nvidia-smi -c 0), they both share GPU 0 while GPU 1 sits idle.

Users may select either half a node (ppn=4 on 8 core and ppn=8 on 16 core) or a full node. One of the following conditions is required to select a full node:(1) use multiple GPUs; or (2) use more than half the main memory. Jobs that take up a full node and don't meet these conditions are subject to cancellation. If your program takes the number of GPUs as an input, rembember that K80s in the condo nodes are essentially each two K20X in a single package, so the operating system sees 4 GPUs out of two K80s.

To use a full node:select ppn= the number of cpu cores in the node, and in your job script or interactive session, set default mode before your job and reset exclusive process node after the job. You can also run nvidia-smi with no arguments to check the current state.

sudo /usr/bin/nvidia-smi -c 0
[run your gpu program]
sudo /usr/bin/nvidia-smi -c 3
How do I tell how many GPUs my commercial program uses?

Run it on a full node and login to the compute node in a second terminal. Here photoscan uses both GPUs, but is not doing much at this time as shown bu the power usage.

sudo /usr/bin/nvidia-smi
Fri Feb 10 16:17:30 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 0000:81:00.0     Off |                    0 |
| 31%   63C    P0    70W / 235W |   1091MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 0000:82:00.0     Off |                    0 |
| 33%   71C    P0    65W / 235W |    905MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     32738    C   /share/apps/photoscan-pro/1.2.6/photoscan     1091MiB |
|    1     32738    C   /share/apps/photoscan-pro/1.2.6/photoscan      905MiB |
+-----------------------------------------------------------------------------+

Software

module cuda/7.5 is the default cuda module, with newer 8.0 and older 2.2,2.3,3.2,4.1,4.2,5.0,and 5.5 also available. In most cases you will want the latest version. Software is installed on common areas, should not be dependent on hardware, and it is not necessary to be logged into GPU nodes to compile, with the exception of the node-locked Portland compiler. In most cases when using nvcc it works best to use relatively modern Gnu compiler modules with Cuda (such as gcc/4.9.1).

PGI

Thre free academic development version of the PGI compiler is installed on the node controlled by the gpu16core/shortgpu16core queues. Please follow the License Conditions. A usage example follows from the May '16 Nvidia workshop which shows OpenMP versions running on the CPU and a (much faster) OpenACC version running on the GPU. The PGI development compiler is limited to 4 OpenMP threads.

razor-l3$ qsub -I -q shortgpu16core -l nodes=1:ppn=8 -l walltime=6:00:00
compute0805$ module purge
compute0805$ module load PGI/2016
compute0805$ cd src
compute0805$ tar zxf /share/apps/cuda/class/workshop/openacc_CUNY.tgz
compute0805$ cd openacc-workshop/exercises/004-laplace2D-profiling
compute0805$ sed -i 's/1000;/200;/' laplace2d.c
compute0805$ make
compute0805$ export OMP_NUM_THREADS=1
compute0805$ ./laplace2d_omp
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
 total: 18.076966 s
compute0805$ export OMP_NUM_THREADS=4
compute0805$ ./laplace2d_omp
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
 total: 7.715069 s
compute0805$ ./laplace2d_acc
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
 total: 1.175254 s
gpu.txt · Last modified: 2017/02/10 22:25 by root