User Tools

Site Tools


gpu

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
gpu [2017/02/10 22:25]
root
gpu [2017/10/10 20:21]
root
Line 1: Line 1:
 ==== AHPCC GPU Nodes ===== ==== AHPCC GPU Nodes =====
-This describes the GPU nodes on the Razor cluster and how to use them.+This describes the NVidia ​GPU nodes on the Razor cluster and how to use them.
  
 === Hardware and Queues === === Hardware and Queues ===
  
-There are nine NVidia GPU nodes. Five have dual Intel E5520 2.27 GHz CPUs, 12 GB main memory, and dual NVidia GTX 480 GPUs.  One has dual Intel E5-2630v3 2.40 GHz CPUs, 64 GB main memory, and dual NVidia K40 GPUs. Three new condo nodes have dual E5-2650v4 2.20 GHz cpus, 128 GB main memory, and dual NVidia K80 GPUs. The GTX 480 has 1.35 TFlops SP (Single Precision) and 0.17 TFlops DP (Double Precision), so is only meaningfully faster than a CPU for single-precision workloads. ​ The K40 has 4.3 TFlops SP and 1.4 Tflops DP, and also has ECC memory for better computational reliability. The K80 has 8.TFlops SP/2.DP with ECC. The GTX 480,K40,K80 have Nvidia Compute Capability 2.0,3.5,3.7 respectively. ​ The five GTX 480 nodes are accessed through the gpu8core queue with a 72 hour time limit. ​ The K40 node has a gpu16core queue with a 72 hour time limit, and a run limit of one job.  There is a shortgpu16core queue for the four K40 or K80 nodes with a 6 hour time limit, and a run limit of one job.  ​+There are nine NVidia GPU nodes. Five have dual Intel E5520 2.27 GHz CPUs, 12 GB main memory, and dual NVidia GTX 480 GPUs.  One has dual Intel E5-2630v3 2.40 GHz CPUs, 64 GB main memory, and dual NVidia K40 Tesla GPUs. Three new condo nodes have dual E5-2650v4 2.20 GHz cpus, 128 GB main memory, and dual NVidia K80 Tesla GPUs. The GTX 480 has 1.35 TFlops SP (Single Precision) and 0.17 TFlops DP (Double Precision), so is only meaningfully faster than a CPU for single-precision workloads. ​ The K40 has 4.3 TFlops SP and 1.4 Tflops DP, and also has ECC memory for better computational reliability. The K80 has 4.25 TFlops SP/1.45 DP with ECC on each of four logical GPUs. The GTX 480,K40,K80 have Nvidia Compute Capability 2.0,3.5,3.7 respectively. ​ The five GTX 480 nodes are accessed through the gpu8core queue with a 72 hour time limit. ​ The K40 node has a gpu16core queue with a 72 hour time limit, and a run limit of one job.  There is a shortgpu16core queue for the one K40 or three K80 nodes (which actually have 24 cores) ​with a 6 hour time limit, and a run limit of one job.  ​
  
-Most programs that use GPUs assume a single GPU and/or do not properly use cudasetdevice(),​ so the default mode for these dual-GPU nodes is half a node using one GPU. The GPUs are set to "​Exclusive Process mode" (nvidia-smi -c 3) at boot time so that they will work correctly with independent programs that assume a GPU or have no provision for multiple ​GPU. If two single-GPU programs are run on a dual-GPU node in "​default mode" (nvidia-smi -c 0), they both share GPU 0 while GPU 1 sits idle.+Many programs that use GPUs assume a single GPU and/or do not properly use cudasetdevice(),​ so the default mode for these dual-GPU nodes is half a node or a quarter of a K80 node using one GPU. The GPUs are set to "​Exclusive Process mode" (nvidia-smi -c 3) at boot time so that they will work correctly with independent programs that assume a single ​GPUor have no provision for multiple ​GPUs. If two single-GPU programs are run on a dual-GPU node in "​default mode" (nvidia-smi -c 0), they both share GPU 0 while GPU 1 sits idle. Usually "​Exclusive Process mode" works well. See [[https://​bugs.schedmd.com/​show_bug.cgi?​id=1458]] and [[https://​www.microway.com/​hpc-tech-tips/​nvidia-smi_control-your-gpus/​]]. ​ User accounts can  
 +<​code>​ 
 +sudo /​usr/​bin/​nvidia-smi {options} 
 +</​code>​ 
 +If you reset mode with ''​-c {0-2}'',​ please reset ''​-c 3''​ at the end of the job.
  
-Users may select either half a node (ppn=4 on 8 core and ppn=8 on 16 coreor full node.  One of the following conditions is required to select a full node:(1) use multiple GPUsor (2) use more than half the main memory. Jobs that take up a full node and don't meet these conditions are subject to cancellation. If your program takes the number of GPUs as an inputrembember that K80s in the condo nodes are essentially each two K20X in a single ​package, so the operating system sees 4 GPUs out of two K80s.+Most CUDA programs follow one of three programming models: 
 +(1often older programs assume ​single GPU and ignore cudasetdevice(), so devices=1; 
 +(2) some programs count the available GPUs and use them alland devices=all;​ 
 +(3) MPI programs often match a single ​GPU with a single MPI process (who may also be CPU multi-threaded), so devices=MPI processes per node.
  
-To use a full node:​select ​ppn= the number of cpu cores in the nodeand in your job script or interactive sessionset default mode before your job and reset exclusive process ​node after the jobYou can also run nvidia-smi with no arguments ​to check the current state.+You should select your CPU cores accordingly to get a known number of GPUs. Users may select either a half node (ppn=4 on 8 core and ppn=8 on 16 core), a quarter or half K80 node (ppn=6 pr 12 on 24 core), or a full node (ppn=the number of CPU cores).  One of the following conditions is required to select a full node:(1) use multiple GPUs; or (2) use more than half the main memory. Jobs that take up a full node and don't meet these conditions are subject to cancellation. If your program takes the number of GPUs as an inputremember that K80s in the condo nodes are essentially each two K20X in a single package, so the operating system sees 4 GPUs out of two K80s. 
 + 
 +There is a system script on each gpu node which detects ​the available unused GPU devices by calling ''​nvidia-smi''​ and parsing the outputThis script ​can be called, below in backticks, and supplied to the environment variable CUDA_VISIBLE_DEVICES to set the devices for your program: 
 +<​code>​ 
 +export CUDA_VISIBLE_DEVICES=`/​share/​apps/​bin/​nvidia-available-devices.sh 2` 
 +</​code>​ 
 +Script options are numbers 1 to 4 for that many devices, and blank for maximum available. Output of the script saved in the environment variable will be a comma-separated string of currently unused devices, with 1 to 4 elements.
 <​code>​ <​code>​
-sudo /​usr/​bin/​nvidia-smi -c 0 +echo $CUDA_VISIBLE_DEVICES 
-[run your gpu program] +2,3
-sudo /​usr/​bin/​nvidia-smi -c 3+
 </​code>​ </​code>​
 +If the above variable is set before running your Cuda program, the given devices will be assigned to program devices 0 to n-1.
  
-==How do I tell how many GPUs my commercial ​program uses?==+==How do I tell how many GPUs my binary ​program uses?==
  
-Run it on a full node and login to the compute node in a second terminal. ​ Here photoscan uses both GPUs, but is not doing much at this time as shown bu the power usage.+Run it on a full node and login to the gpu compute node in a second terminal. ​ Here photoscan uses both GPUs as shown in the "​Processes"​ table, but is not doing much at this time as shown by the near-idle ​power usage.
 <​code>​ <​code>​
 sudo /​usr/​bin/​nvidia-smi sudo /​usr/​bin/​nvidia-smi
gpu.txt · Last modified: 2017/10/10 20:21 by root