User Tools

Site Tools


karpinski_usage

How to use the Karpinski Cluster

This is a brief “how to” summary of usage for users of the Karpinski cluster.

Karpinski has 18 compute nodes. Each node has a single E5-2620 v4 cpu, 32 GB of ram and an NVidia T4 GPU card. The cluster is primarily intended for the use of the Computer Science & Computer Engineering (CSCE) students and Faculty as a teaching and training resource in two areas:

  • GPU Computing: Cuda programming, Machine Learnning/AI
  • Virtual/Cloud compuing

Main storage for the Karpinski cluster is a single NFS server, storage120, which hosts a 30TB partition for the user home directories.

More about the name, Karpinski.

Login

karpinski.uark.edu is the login node for the Karpinski cluster, but the UofA firewall is blocking ssh access to that node. To log into Karpinski please go through pinnacle.uark.edu:

ssh pinnacle.uark.edu

and from there

ssh login22

This will log you into the login22 node. This is a shared login node for all Karpinski users. DO NOT run any jobs on this node. Please use the scheduler to submit jobs to the queue to run on the compute nodes.

Scheduler

Karpinski shares the SLURM scheduler with the Pinnacle cluster. There are two queues (partitions) set up for Karpinski:

csce72:     standard compute/gpu jobs, 72 hour limit, 18 nodes
cscloud72:  virtual machine jobs,      72 hour limit, 18 nodes

Queue and Jobs

Use the sinfo command to get the current status of the jobs in the queue:

login22:pwolinsk:~$ sinfo -p csce72,cscloud72
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
csce72       up 3-00:00:00     18   idle cs[2201-2218]
cscloud72    up 3-00:00:00     18   idle cs[2201-2218]

Above all compute nodes, labeled cs2201 trough cs2218, are “idle”, i.e. available to run jobs. To submit and interactive job on one of the compute nodes use the srun command:

login22:pwolinsk:~$ srun -p csce72 -t 1:00:00 -N1 -n 16 --pty /bin/bash
cs2202:pwolinsk:~$ hostname
cs2202
cs2202:pwolinsk:~$ echo $SLURM_NODELIST 
cs2202

Notice that after the srun command is executed, the prompt changed from “login22” to “cs2202”. The job started and we got the prompt on the compute node cs2202. This job will run for up to 1 hour. The user can type in exit at any time to end the job. At that point we would be logged out of the compute node and got the prompt back on the “login22” node.

The srun command int this example took multiple arguments:

-p csce72             <-- use partition "csce72"
-t 1:00:00            <-- ask the walltime for the job to be 1 hour, 00 minutes, 00 seconds
-N 1                  <-- request a single compute node
-n 16                 <-- use all 16 cores on the node so no other jobs can start on this node
--pty /bin/bash       <-- since this is an interactive job use the pseudo terminal to run bash interpreter 

Use the squeue command to see a list of all of the jobs in the queue. With the -u $USERNAME parameter you can filter the output to see only your jobs:

login22:pwolinsk:~$ squeue -u pwolinsk
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             18707    csce72     bash pwolinsk  R       2:14      1 cs2202
login22:pwolinsk:~$ 

Virtual Machines

Using the vm-clone.sh script you can clone virtual machines from a list of predefined VM templates. Without specifying any parameters, vm-clone.sh will show you a list of available VM templates:

login22:pwolinsk:~$ vm-clone.sh 
Usage: vm-clone.sh <template-vm>
    where <templ-vm> is one of:

     tmpl-centos7.6
     tmpl-centos7.6-desktop
     tmpl-ubuntu-18.04
     tmpl-ubuntu-18.04-desktop

login22:pwolinsk:~$ 

To clone a virtual machine specify a template name as the argument:

login22:pwolinsk:~$ vm-clone.sh tmpl-centos7.6
Cloning tmpl-centos7.6 for pwolinsk as centos7.6-pwolinsk-csce.....
Found tmpl-centos7.6 defined on c1329.  Cloning....
Allocating 'centos7.6-pwolinsk-csce.qcow2'                                                                                       
                                                       |  10 GB  00:00:07     
Clone 'centos7.6-pwolinsk-csce' created successfully.

Moving centos7.6-pwolinsk-csce to Karpinski cluster....
Editing centos7.6-pwolinsk-csce definition for use on Karpinski cluster....
login22:pwolinsk:~$ 

All of your VM's are stored in /storage/$USER/.virtual-machines directory. To get a list of your virtual machines:

login22:pwolinsk:~$ vm-list.sh

pwolinsk's VMS (Karpinski Cluster)STATE       VM IP              HOST 
======================================================================
centos7.6-pwolinsk-csce           SHUT OFF   

Virtual machines are stored in /storage/pwolinsk/.virtual-machines.
Total storage on disk: 1.7G	total

login22:pwolinsk:~$ 
Run the VM

To run the virtual machine we have to submit a job to the cscloud72 partition with following parameters passed into sbatch command:

  • the name of the virtual machine with the -J flag (-J VM_NAME)
  • number of virtual cores to use with the -n flag (-n NOVCORES)
  • number of hours for the VM to run with -t flag (-t HOURS:00:00)
  • and “cloud” constraint (-C cloud).

The VM itself is created in a job prolog script which runs before the actual job starts, and is shut down in the epilog script which runs after the actual job finishes. So we also need to specify some process to run in the actual job portion which will prevent the job from shutting down. waitforvm.sh script will keep running while the VM is running on the job execution host. To simplify the process, vm-job-launch.sh script is available:

login22:pwolinsk:~$ vm-job-launch.sh
Usage: vm-job-launch.sh <vm_name> <vm_cores> <vm_hours>

    where <vm_name> is one of your defined virtual machines (vm-list.sh)
          <vm_core> is the number of virtual cores for VM, range: 1..16
          <vm_hours> is the lifetime of the VM in hours,   range: 1..72

login22:pwolinsk:~$ 

To start our newly created VM with 2 virtual cores for 3 hours:

login22:pwolinsk:~$ vm-job-launch.sh centos7.6-pwolinsk-csce 2 3
Submitting job to the queue with command:

   sbatch -N1 -n2 -p cscloud72 -C cloud -t 3:00:00 -J centos7.6-pwolinsk-csce waitforvm.sh centos7.6-pwolinsk-csce

Submitted batch job 19659
Found job #19659
Waiting for log file /home/pwolinsk/cloud-19659.log ....................

Press CNTRL-C to exit log
--------/home/pwolinsk/cloud-19659.log-----------------------------------------
Starting centos7.6-pwolinsk-csce for pwolinsk
Domain centos7.6-pwolinsk-csce created from /home/pwolinsk/vmdef-19659.xml


centos7.6-pwolinsk-csce booting up....IP assigned 172.16.254.179 ... Waiting for SSH access ...........done.

centos7.6-pwolinsk-csce is Ready. "ssh centos@172.16.254.179" password: centos
^C
login22:pwolinsk:~$ 

The sbatch line in the output above is the actual job submission command used to start the VM job. After the job is submitted, the vm-job-launch.sh script displays the log messages generated during VM startup procedure. Once you see a line “VMNAME is Ready” you can press <CONTROL>-C to stop displaying the log and log into the VM.

To log into the VM run the ssh command listed at the end of output and use the listed password:

cs2203:pwolinsk:~$ ssh centos@172.16.254.179
Warning: Permanently added '172.16.254.179' (ECDSA) to the list of known hosts.
centos@172.16.254.179's password: 
Last login: Thu May 30 09:49:39 2019
[centos@vm-centos7 ~]$ 

To get root access in the VM:

[centos@vm-centos7 ~]$ sudo /bin/bash
[sudo] password for centos: 
[root@vm-centos7 centos]# whoami
root
[root@vm-centos7 centos]# touch /ihavefullfsaccess
[root@vm-centos7 centos]# 

While the virtual machine job is running in the cscloud72 partition you can see it in the queue:

login22:pwolinsk:~$ squeue -u pwolinsk
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             19660 cscloud72 centos7. pwolinsk  R       1:50      1 cs2218
login22:pwolinsk:~$

As well as with the vm-list.sh script:

login22:pwolinsk:~$ vm-list.sh 

pwolinsk's VMS (Karpinski Cluster)          STATE       VM IP              HOST 
================================================================================
centos7.6-pwolinsk-csce                     RUNNING    172.16.254.179     cs2218  (52:54:00:b1:8d:8f)

Virtual machines are stored in /storage/pwolinsk/.virtual-machines.
Total storage on disk: 17G	total

To terminate the job (and shut down the VM), you can either let the job expire in 3 hours, or you can delete the job from the queue using the scancel command:

login22:pwolinsk:~$ squeue -u pwolinsk
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             19660 cscloud72 centos7. pwolinsk  R       8:58      1 cs2218
login22:pwolinsk:~$ scancel 19660
login22:pwolinsk:~$ squeue -u pwolinsk
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
login22:pwolinsk:~$ vm-list.sh

pwolinsk's VMS (Karpinski Cluster)          STATE       VM IP              HOST 
================================================================================
centos7.6-pwolinsk-csce                     SHUT OFF   

Virtual machines are stored in /storage/pwolinsk/.virtual-machines.
Total storage on disk: 17G	total

login22:pwolinsk:~$ 
karpinski_usage.txt · Last modified: 2020/07/16 22:23 by pwolinsk