This is a brief “how to” summary of usage for users of the Karpinski cluster.
Karpinski has 18 compute nodes. Each node has a single E5-2620 v4 cpu, 32 GB of ram and an NVidia T4 GPU card. The cluster is primarily intended for the use of the Computer Science & Computer Engineering (CSCE) students and Faculty as a teaching and training resource in two areas:
Main storage for the Karpinski cluster is a single NFS server, storage120, which hosts a 30TB partition for the user home directories.
More about the name, Karpinski.
ssh to karpinski.uark.edu This will log you into the login22 node. This is a shared login node for all Karpinski users. DO NOT run any jobs on this node. Please use the scheduler to submit jobs to the queue to run on the compute nodes.
Karpinski shares the SLURM scheduler with the Pinnacle cluster. There are two queues (partitions) set up for Karpinski:
csce72: standard compute/gpu jobs, 72 hour limit, 18 nodes cscloud72: virtual machine jobs, 72 hour limit, 18 nodes
Use the sinfo command to get the current status of the jobs in the queue:
login22:pwolinsk:~$ sinfo -p csce72,cscloud72 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST csce72 up 3-00:00:00 18 idle cs[2201-2218] cscloud72 up 3-00:00:00 18 idle cs[2201-2218]
Above all compute nodes, labeled cs2201 trough cs2218, are “idle”, i.e. available to run jobs. To submit and interactive job on one of the compute nodes use the srun command:
login22:pwolinsk:~$ srun -p csce72 -t 1:00:00 -N1 -n 16 --pty /bin/bash cs2202:pwolinsk:~$ hostname cs2202 cs2202:pwolinsk:~$ echo $SLURM_NODELIST cs2202
Notice that after the srun command is executed, the prompt changed from “login22” to “cs2202”. The job started and we got the prompt on the compute node cs2202. This job will run for up to 1 hour. The user can type in exit at any time to end the job. At that point we would be logged out of the compute node and got the prompt back on the “login22” node.
The srun command int this example took multiple arguments:
-p csce72 <-- use partition "csce72" -t 1:00:00 <-- ask the walltime for the job to be 1 hour, 00 minutes, 00 seconds -N 1 <-- request a single compute node -n 16 <-- use all 16 cores on the node so no other jobs can start on this node --pty /bin/bash <-- since this is an interactive job use the pseudo terminal to run bash interpreter
Use the squeue command to see a list of all of the jobs in the queue. With the -u $USERNAME parameter you can filter the output to see only your jobs:
login22:pwolinsk:~$ squeue -u pwolinsk JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 18707 csce72 bash pwolinsk R 2:14 1 cs2202 login22:pwolinsk:~$
Using the vm-clone.sh script you can clone virtual machines from a list of predefined VM templates. Without specifying any parameters, vm-clone.sh will show you a list of available VM templates:
login22:pwolinsk:~$ vm-clone.sh Usage: vm-clone.sh <template-vm> where <templ-vm> is one of: tmpl-centos7.6 tmpl-centos7.6-desktop tmpl-ubuntu-18.04 tmpl-ubuntu-18.04-desktop login22:pwolinsk:~$
To clone a virtual machine specify a template name as the argument:
login22:pwolinsk:~$ vm-clone.sh tmpl-centos7.6 Cloning tmpl-centos7.6 for pwolinsk as centos7.6-pwolinsk-csce..... Found tmpl-centos7.6 defined on c1329. Cloning.... Allocating 'centos7.6-pwolinsk-csce.qcow2' | 10 GB 00:00:07 Clone 'centos7.6-pwolinsk-csce' created successfully. Moving centos7.6-pwolinsk-csce to Karpinski cluster.... Editing centos7.6-pwolinsk-csce definition for use on Karpinski cluster.... login22:pwolinsk:~$
All of your VM's are stored in /storage/$USER/.virtual-machines directory. To get a list of your virtual machines:
login22:pwolinsk:~$ vm-list.sh pwolinsk's VMS (Karpinski Cluster)STATE VM IP HOST ====================================================================== centos7.6-pwolinsk-csce SHUT OFF Virtual machines are stored in /storage/pwolinsk/.virtual-machines. Total storage on disk: 1.7G total login22:pwolinsk:~$
To run the virtual machine we have to submit a job to the cscloud72 partition with following parameters passed into sbatch command:
The VM itself is created in a job prolog script which runs before the actual job starts, and is shut down in the epilog script which runs after the actual job finishes. So we also need to specify some process to run in the actual job portion which will prevent the job from shutting down. waitforvm.sh script will keep running while the VM is running on the job execution host. To simplify the process, vm-job-launch.sh script is available:
login22:pwolinsk:~$ vm-job-launch.sh Usage: vm-job-launch.sh <vm_name> <vm_cores> <vm_hours> where <vm_name> is one of your defined virtual machines (vm-list.sh) <vm_core> is the number of virtual cores for VM, range: 1..16 <vm_hours> is the lifetime of the VM in hours, range: 1..72 login22:pwolinsk:~$
To start our newly created VM with 2 virtual cores for 3 hours:
login22:pwolinsk:~$ vm-job-launch.sh centos7.6-pwolinsk-csce 2 3 Submitting job to the queue with command: sbatch -N1 -n2 -p cscloud72 -C cloud -t 3:00:00 -J centos7.6-pwolinsk-csce waitforvm.sh centos7.6-pwolinsk-csce Submitted batch job 19659 Found job #19659 Waiting for log file /home/pwolinsk/cloud-19659.log .................... Press CNTRL-C to exit log --------/home/pwolinsk/cloud-19659.log----------------------------------------- Starting centos7.6-pwolinsk-csce for pwolinsk Domain centos7.6-pwolinsk-csce created from /home/pwolinsk/vmdef-19659.xml centos7.6-pwolinsk-csce booting up....IP assigned 172.16.254.179 ... Waiting for SSH access ...........done. centos7.6-pwolinsk-csce is Ready. "ssh firstname.lastname@example.org" password: centos ^C login22:pwolinsk:~$
The sbatch line in the output above is the actual job submission command used to start the VM job. After the job is submitted, the vm-job-launch.sh script displays the log messages generated during VM startup procedure. Once you see a line “VMNAME is Ready” you can press <CONTROL>-C to stop displaying the log and log into the VM.
To log into the VM run the ssh command listed at the end of output and use the listed password:
cs2203:pwolinsk:~$ ssh email@example.com Warning: Permanently added '172.16.254.179' (ECDSA) to the list of known hosts. firstname.lastname@example.org's password: Last login: Thu May 30 09:49:39 2019 [centos@vm-centos7 ~]$
To get root access in the VM:
[centos@vm-centos7 ~]$ sudo /bin/bash [sudo] password for centos: [root@vm-centos7 centos]# whoami root [root@vm-centos7 centos]# touch /ihavefullfsaccess [root@vm-centos7 centos]#
While the virtual machine job is running in the cscloud72 partition you can see it in the queue:
login22:pwolinsk:~$ squeue -u pwolinsk JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 19660 cscloud72 centos7. pwolinsk R 1:50 1 cs2218 login22:pwolinsk:~$
As well as with the vm-list.sh script:
login22:pwolinsk:~$ vm-list.sh pwolinsk's VMS (Karpinski Cluster) STATE VM IP HOST ================================================================================ centos7.6-pwolinsk-csce RUNNING 172.16.254.179 cs2218 (52:54:00:b1:8d:8f) Virtual machines are stored in /storage/pwolinsk/.virtual-machines. Total storage on disk: 17G total
To terminate the job (and shut down the VM), you can either let the job expire in 3 hours, or you can delete the job from the queue using the scancel command:
login22:pwolinsk:~$ squeue -u pwolinsk JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 19660 cscloud72 centos7. pwolinsk R 8:58 1 cs2218 login22:pwolinsk:~$ scancel 19660 login22:pwolinsk:~$ squeue -u pwolinsk JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) login22:pwolinsk:~$ vm-list.sh pwolinsk's VMS (Karpinski Cluster) STATE VM IP HOST ================================================================================ centos7.6-pwolinsk-csce SHUT OFF Virtual machines are stored in /storage/pwolinsk/.virtual-machines. Total storage on disk: 17G total login22:pwolinsk:~$