=====How to use the Karpinski Cluster===== This is a brief “how to” summary of usage for users of the Karpinski cluster. Karpinski has 18 compute nodes. Each node has a single E5-2620 v4 cpu, 32 GB of ram and an [[https://www.nvidia.com/en-us/data-center/tesla-t4/ | NVidia T4 GPU]] card. The cluster is primarily intended for the use of the Computer Science & Computer Engineering (CSCE) students and Faculty as a teaching and training resource in two areas: * GPU Computing: Cuda programming, Machine Learnning/AI * Virtual/Cloud compuing Main storage for the Karpinski cluster is a single NFS server, storage120, which hosts a 30TB partition for the user home directories. More about the name, [[https://en.wikipedia.org/wiki/Jacek_Karpinski|Karpinski.]] ===Login=== **karpinski.uark.edu** is the login node for the Karpinski cluster, but the UofA firewall is blocking ssh access to that node. To log into Karpinski please go through pinnacle.uark.edu: ssh pinnacle.uark.edu and from there ssh login22 This will log you into the login22 node. This is a shared login node for all Karpinski users. **DO NOT** run any jobs on this node. Please use the scheduler to submit jobs to the queue to run on the compute nodes. ===Scheduler=== Karpinski shares the SLURM scheduler with the Pinnacle cluster. There are two queues (partitions) set up for Karpinski: csce72: standard compute/gpu jobs, 72 hour limit, 18 nodes cscloud72: virtual machine jobs, 72 hour limit, 18 nodes ===Queue and Jobs=== Use the **sinfo** command to get the current status of the jobs in the queue: login22:pwolinsk:~$ sinfo -p csce72,cscloud72 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST csce72 up 3-00:00:00 18 idle cs[2201-2218] cscloud72 up 3-00:00:00 18 idle cs[2201-2218] Above all compute nodes, labeled cs2201 trough cs2218, are "idle", i.e. available to run jobs. To submit and interactive job on one of the compute nodes use the **srun** command: login22:pwolinsk:~$ srun -p csce72 -t 1:00:00 -N1 -n 16 --pty /bin/bash cs2202:pwolinsk:~$ hostname cs2202 cs2202:pwolinsk:~$ echo $SLURM_NODELIST cs2202 Notice that after the srun command is executed, the prompt changed from "login22" to "cs2202". The job started and we got the prompt on the compute node cs2202. This job will run for up to 1 hour. The user can type in **exit** at any time to end the job. At that point we would be logged out of the compute node and got the prompt back on the "login22" node. The srun command int this example took multiple arguments: -p csce72 <-- use partition "csce72" -t 1:00:00 <-- ask the walltime for the job to be 1 hour, 00 minutes, 00 seconds -N 1 <-- request a single compute node -n 16 <-- use all 16 cores on the node so no other jobs can start on this node --pty /bin/bash <-- since this is an interactive job use the pseudo terminal to run bash interpreter Use the **squeue** command to see a list of all of the jobs in the queue. With the -u $USERNAME parameter you can filter the output to see only your jobs: login22:pwolinsk:~$ squeue -u pwolinsk JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 18707 csce72 bash pwolinsk R 2:14 1 cs2202 login22:pwolinsk:~$ === Virtual Machines === Using the **vm-clone.sh** script you can clone virtual machines from a list of predefined VM templates. Without specifying any parameters, vm-clone.sh will show you a list of available VM templates: login22:pwolinsk:~$ vm-clone.sh Usage: vm-clone.sh where is one of: tmpl-centos7.6 tmpl-centos7.6-desktop tmpl-ubuntu-18.04 tmpl-ubuntu-18.04-desktop login22:pwolinsk:~$ To clone a virtual machine specify a template name as the argument: login22:pwolinsk:~$ vm-clone.sh tmpl-centos7.6 Cloning tmpl-centos7.6 for pwolinsk as centos7.6-pwolinsk-csce..... Found tmpl-centos7.6 defined on c1329. Cloning.... Allocating 'centos7.6-pwolinsk-csce.qcow2' | 10 GB 00:00:07 Clone 'centos7.6-pwolinsk-csce' created successfully. Moving centos7.6-pwolinsk-csce to Karpinski cluster.... Editing centos7.6-pwolinsk-csce definition for use on Karpinski cluster.... login22:pwolinsk:~$ All of your VM's are stored in /storage/$USER/.virtual-machines directory. To get a list of your virtual machines: login22:pwolinsk:~$ vm-list.sh pwolinsk's VMS (Karpinski Cluster)STATE VM IP HOST ====================================================================== centos7.6-pwolinsk-csce SHUT OFF Virtual machines are stored in /storage/pwolinsk/.virtual-machines. Total storage on disk: 1.7G total login22:pwolinsk:~$ ==Run the VM== To run the virtual machine we have to submit a job to the **cscloud72** partition with following parameters passed into **sbatch** command: * the name of the virtual machine with the -J flag (-J VM_NAME) * number of virtual cores to use with the -n flag (-n NOVCORES) * number of hours for the VM to run with -t flag (-t HOURS:00:00) * and "cloud" constraint (-C cloud). The VM itself is created in a job prolog script which runs before the actual job starts, and is shut down in the epilog script which runs after the actual job finishes. So we also need to specify some process to run in the actual job portion which will prevent the job from shutting down. **waitforvm.sh** script will keep running while the VM is running on the job execution host. To simplify the process, **vm-job-launch.sh** script is available: login22:pwolinsk:~$ vm-job-launch.sh Usage: vm-job-launch.sh where is one of your defined virtual machines (vm-list.sh) is the number of virtual cores for VM, range: 1..16 is the lifetime of the VM in hours, range: 1..72 login22:pwolinsk:~$ To start our newly created VM with 2 virtual cores for 3 hours: login22:pwolinsk:~$ vm-job-launch.sh centos7.6-pwolinsk-csce 2 3 Submitting job to the queue with command: sbatch -N1 -n2 -p cscloud72 -C cloud -t 3:00:00 -J centos7.6-pwolinsk-csce waitforvm.sh centos7.6-pwolinsk-csce Submitted batch job 19659 Found job #19659 Waiting for log file /home/pwolinsk/cloud-19659.log .................... Press CNTRL-C to exit log --------/home/pwolinsk/cloud-19659.log----------------------------------------- Starting centos7.6-pwolinsk-csce for pwolinsk Domain centos7.6-pwolinsk-csce created from /home/pwolinsk/vmdef-19659.xml centos7.6-pwolinsk-csce booting up....IP assigned 172.16.254.179 ... Waiting for SSH access ...........done. centos7.6-pwolinsk-csce is Ready. "ssh centos@172.16.254.179" password: centos ^C login22:pwolinsk:~$ The **sbatch** line in the output above is the actual job submission command used to start the VM job. After the job is submitted, the **vm-job-launch.sh** script displays the log messages generated during VM startup procedure. Once you see a line "VMNAME is Ready" you can press -C to stop displaying the log and log into the VM. To log into the VM run the ssh command listed at the end of output and use the listed password: cs2203:pwolinsk:~$ ssh centos@172.16.254.179 Warning: Permanently added '172.16.254.179' (ECDSA) to the list of known hosts. centos@172.16.254.179's password: Last login: Thu May 30 09:49:39 2019 [centos@vm-centos7 ~]$ To get root access in the VM: [centos@vm-centos7 ~]$ sudo /bin/bash [sudo] password for centos: [root@vm-centos7 centos]# whoami root [root@vm-centos7 centos]# touch /ihavefullfsaccess [root@vm-centos7 centos]# While the virtual machine job is running in the **cscloud72** partition you can see it in the queue: login22:pwolinsk:~$ squeue -u pwolinsk JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 19660 cscloud72 centos7. pwolinsk R 1:50 1 cs2218 login22:pwolinsk:~$ As well as with the **vm-list.sh** script: login22:pwolinsk:~$ vm-list.sh pwolinsk's VMS (Karpinski Cluster) STATE VM IP HOST ================================================================================ centos7.6-pwolinsk-csce RUNNING 172.16.254.179 cs2218 (52:54:00:b1:8d:8f) Virtual machines are stored in /storage/pwolinsk/.virtual-machines. Total storage on disk: 17G total To terminate the job (and shut down the VM), you can either let the job expire in 3 hours, or you can delete the job from the queue using the **scancel** command: login22:pwolinsk:~$ squeue -u pwolinsk JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 19660 cscloud72 centos7. pwolinsk R 8:58 1 cs2218 login22:pwolinsk:~$ scancel 19660 login22:pwolinsk:~$ squeue -u pwolinsk JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) login22:pwolinsk:~$ vm-list.sh pwolinsk's VMS (Karpinski Cluster) STATE VM IP HOST ================================================================================ centos7.6-pwolinsk-csce SHUT OFF Virtual machines are stored in /storage/pwolinsk/.virtual-machines. Total storage on disk: 17G total login22:pwolinsk:~$