This section provides information on the workload manager Slurm used by ALICE and SHARK.
Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler that allocates compute resources on clusters for jobs. Slurm has been deployed at various national and international computing centres, and by approximately 60% of the TOP500 supercomputers in the world.
The following pages will give you a basic overview of Slurm on ALICE. You can learn much more about Slurm and its commands from the official Slurm website.
This section is intended as an overview of the fundamental concepts of using Slurm. In the Tutorials, we provide a more practical introduction on how to use Slurm, including examples for batch job scripts (e.g., Your first job). However, we recommend novice users to read through both chapters.
Slurm basics
Common user commands
The following is a list of common user commands that will be discussed in more detail in this section and the tutorials.
Command | Definition |
---|---|
sbatch | Submit a batch job script for execution (queued) |
scancel | Delete a job |
scontrol | Job status (detailed), several options only available to root |
sinfo | Display state of partitions and nodes |
squeue | Display state of all (queued) jobs |
salloc | Submit a job for execution or initiate job in real-time (interactive job) |
srun | Run parallel jobs |
seff | Get a summary of how efficiently requested resources were used for a given job |
If you want to get a full overview, have a look at the Slurm documentation or enter man <command>
or check the documentation on the Slurm website.
There are different ways of how to submit jobs to Slurm. We always recommend to primarily use batch scripts submitted with the sbatch
command instead of interactive jobs requested with salloc
Accounting info
The following list a number of commands to get accounting and job statistics.
Command | Info |
---|---|
sacct | Displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database |
sstat | Display various status information of a running job/step |
sacctmgr | Used to view and modify Slurm account information |
sreport | Generate reports from the slurm accounting data |
Scheduling info
The following commands can provide useful information to understand the scheduling of your job
Command | Info |
---|---|
sprio | View the factors that comprise a job's scheduling priority |
sshare | Tool for listing the shares of associations to a cluster |
Environment variables
Any environment variables that you have set with the sbatch
command will be passed to your job. For this reason, if your program needs certain environment variables set to function properly, it is best to put them in your job script. This also makes it easier to reproduce your job results later, if necessary.
In addition to setting environment variables yourself, Slurm provides some environment variables of its own that you can use in your job scripts. Information on some of the common Slurm environment variables is listed in the chart below. For additional information, see the man page for sbatch.
Environmental Variable | Definition |
---|---|
$SLURM_JOB_ID | ID of job allocation |
$SLURM_SUBMIT_DIR | Directory job where was submitted |
$SLURM_JOB_NODELIST | File containing allocated hostnames |
$SLURM_NTASKS | Total number of cores for job |
$SLURM_CPUS_ON_NODE | Processors available to the job on this node |
$SLURM_LAUNCH_NODE_IPADDR | IP address of node where job launched |
$SLURM_NNODES | Total number of nodes use by the job |
$SLURM_NODEID | Relative node ID of current node |
$SLURM_NTASKS | Total number of processes in current job |
$SLURM_PROCID | MPI rank (or relative process ID) of the current process |
$SLURM_TASK_PID | Process ID of task started |
$SLURM_TASKS_PER_NODE | Number of tasks to be run on each node |
$CUDA_VISIBLE_DEVICES | List of GPUs are available for use |
Environment variables override any options set in a batch script. Command-line options override any previously set environment variables.
Job settings and configuration information
With the slurm command scontrol show
you can get a more detailed overview of your running job, node hardware and partitions:
Command | Info |
---|---|
| Get settings for a specific job |
| See the configuration of a given partition |
| See the configuration of a given node |
Slurm partitions
Slurm makes nodes on a cluster available as partitions, i.e., one partition gives access to a certain number of nodes and specific resource limits. Jobs are always submitted to either a default partition or a user-specified partition.
ALICE
You can find an overview of all currently available partitions on ALICE here: Partitions on ALICE
SHARK
You can find an overview of all currently available partitions on SHARK here: Partitions on SHARK
sinfo
sinfo
provides information about the current state of partitions on a cluster. The following commands are useful:
Command | Description |
---|---|
| Shows the partitions available to the users and basic state information |
| Shows all partitions on the cluster including those not available to the user |
| Prints a summary which includes the number of nodes in the various states state |
| Provides a node-centric view of the partitions |
sinfo -N --Format=Nodelist:20,Available:10,Time:15,Partition:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10 | Provides a quick overview of the available resources for all nodes on the cluster. |
sinfo -p <parition_name> -N --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10 | Provides a quick overview of the available resources for all nodes in the partition |
sinfo --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10,Partition | Provides a quick overview of the available resources per partition. Adding |
Here, we show the output for some the above commands for both ALICE and SHARK. Note that the output will most likely look different when you run it because of the current load of the system and possible changes made to the partition system. We encourage you to try them yourself.
ALICE
[me@nodelogin02 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST testing up 1:00:00 1 mix nodelogin01 testing up 1:00:00 1 idle nodelogin02 cpu-short* up 4:00:00 20 alloc node[001-020] cpu-medium up 1-00:00:00 19 alloc node[002-020] cpu-long up 7-00:00:00 18 alloc node[003-020] cpu-short up 4:00:00 7 mix node[852-853,856-860] gpu-short up 4:00:00 2 alloc node[854-855] gpu-short up 4:00:00 1 idle node851 gpu-medium up 1-00:00:00 7 mix node[852-853,856-860] gpu-medium up 1-00:00:00 2 alloc node[854-855] gpu-medium up 1-00:00:00 1 idle node851 gpu-long up 7-00:00:00 7 mix node[852-853,856-860] gpu-long up 7-00:00:00 2 alloc node[854-855] mem up 14-00:00:0 1 mix node801 amd-short up 4:00:00 1 mix node802 [me@nodelogin02 ~]$ sinfo -s PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST testing up 1:00:00 0/2/0/2 nodelogin[01-02] cpu-short* up 4:00:00 20/0/0/20 node[001-020] cpu-medium up 1-00:00:00 19/0/0/19 node[002-020] cpu-long up 7-00:00:00 18/0/0/18 node[003-020] gpu-short up 4:00:00 9/1/0/10 node[851-860] gpu-medium up 1-00:00:00 9/1/0/10 node[851-860] gpu-long up 7-00:00:00 9/0/0/9 node[852-860] mem up 14-00:00:0 1/0/0/1 node801 amd-short up 4:00:00 1/0/0/1 node802 [me@nodelogin02 ~]$ sinfo -p gpu-long -N --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10 NODELIST AVAIL TIMELIMIT STATE MEMORY FREE_MEM CPU_LOAD CPUS(A/I/O/T) GRES_USED AVAIL_FEATURES REASON node852 up 7-00:00:00 mixed 380851 286301 13.42 22/2/0/24 gpu:3 Geforce.rtx2080Ti none node853 up 7-00:00:00 mixed 380851 320623 1.04 22/2/0/24 gpu:3 Geforce.rtx2080Ti none node854 up 7-00:00:00 allocated 380851 280041 39.07 24/0/0/24 gpu:4 Geforce.rtx2080Ti none node855 up 7-00:00:00 allocated 380851 306605 41.95 24/0/0/24 gpu:4 Geforce.rtx2080Ti none node856 up 7-00:00:00 mixed 380851 313819 1.20 4/20/0/24 gpu:4 Geforce.rtx2080Ti none node857 up 7-00:00:00 mixed 380851 345208 1.22 4/20/0/24 gpu:4 Geforce.rtx2080Ti none node858 up 7-00:00:00 mixed 380851 331898 4.26 4/20/0/24 gpu:4 Geforce.rtx2080Ti none node859 up 7-00:00:00 mixed 380851 235598 4.21 14/10/0/24 gpu:4 Geforce.rtx2080Ti none node860 up 7-00:00:00 mixed 380851 306182 2.76 18/6/0/24 gpu:4 Geforce.rtx2080Ti none
SHARK
[me@res-hpc-lo02 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST all* up infinite 13 mix res-hpc-exe[001,014,024,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02] all* up infinite 13 idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031] gpu up infinite 3 mix res-hpc-gpu[01-02,09] gpu up infinite 2 idle res-hpc-gpu[03-04] lumcdiagnostics up infinite 15 mix res-hpc-exe[001,014,024,032-033,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02] lumcdiagnostics up infinite 15 idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031,034-035] highmem up infinite 2 mix res-hpc-mem[01-02] short up 1:00:00 16 mix res-hpc-exe[001,014,024,032-033,036,039-043],res-hpc-gpu[01-02,05-06,09] short up 1:00:00 19 idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031,034-035],res-hpc-gpu[03-04],res-hpc-path[01-02] highmemgpu up infinite 1 idle res-hpc-gpu07 [me@res-hpc-lo02 ~]$ sinfo -s PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST all* up infinite 13/13/0/26 res-hpc-exe[001-003,005,007-009,011-014,024,027,029-031,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02] gpu up infinite 3/2/0/5 res-hpc-gpu[01-04,09] lumcdiagnostics up infinite 15/15/0/30 res-hpc-exe[001-003,005,007-009,011-014,024,027,029-036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02] highmem up infinite 2/0/0/2 res-hpc-mem[01-02] short up 1:00:00 16/19/0/35 res-hpc-exe[001-003,005,007-009,011-014,024,027,029-036,039-043],res-hpc-gpu[01-06,09],res-hpc-path[01-02] highmemgpu up infinite 0/1/0/1 res-hpc-gpu07 [me@res-hpc-lo02 ~]$ sinfo -p gpu -N --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:30,Features:20,Reason:10 NODELIST AVAIL TIMELIMIT STATE MEMORY FREE_MEM CPU_LOAD CPUS(A/I/O/T) GRES_USED AVAIL_FEATURES REASON res-hpc-gpu01 up infinite mixed 515000 110911 5.53 9/39/0/48 gpu:TitanXp:1(IDX:0) Platinum8160 none res-hpc-gpu02 up infinite mixed 514000 198863 0.00 8/40/0/48 gpu:TitanXp:1(IDX:1) Platinum8160 none res-hpc-gpu03 up infinite idle 48000 13443 0.00 0/6/0/6 gpu:GRIDV10032g:0(IDX:N/A) Gold6252 none res-hpc-gpu04 up infinite idle 48000 23317 0.00 0/6/0/6 gpu:GRIDV10032g:0(IDX:N/A) Gold6252 none res-hpc-gpu09 up infinite mixed 48000 17064 0.00 4/2/0/6 gpu:GRIDV10016g:1(IDX:0) Gold6252 none
The node states have the following meaning:
State | Description |
---|---|
idle | the node is not used, but available for new jobs to run |
mix(ed) | there is 1 or more jobs running on the node, but there are still resources free on this node for more jobs |
alloc(ated) | the whole node is allocated by 1 or more jobs, no additional jobs can run on this node |
draining or drained | the node is being drained or drained. New jobs cannot be scheduled until the node is undrained |
The abbreviations (A/I/O/T) in NODES(A/I/O/T) or CPUS(A/I/O/T) mean (Allocated/Idle/Other/Total)
squeue
With squeue
, you can get information about your running jobs and jobs from other users:
Command | Description |
---|---|
| Returns a list of jobs in the current queue for all partitions available to the user |
| Returns a list of jobs in the current queue for all partitions on the cluster |
| Same as |
| Lists only a user’s jobs in the queue |
| Lists a user’s pending jobs and their estimated starting time |
Jobs typically pass through several states in the course of their execution.
The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. An explanation of some state follows:
State | State (full) | Explanation |
---|---|---|
CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. |
CG | COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
F | FAILED | Job terminated with non-zero exit code or other failure condition. |
PD | PENDING | Job is awaiting resource allocation. |
R | RUNNING | Job currently has an allocation. |
S | SUSPENDED | Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. |
Job information and cluster configuration with scontrol
With the Slurm command scontrol
you can get a more detailed overview of your running job, node hardware and partitions, e.g.,
[user@res-hpc-lo02 ~]$ scontrol show job 260 JobId=260 JobName=IMB UserId=user(225812) GroupId=Domain Users(513) MCS_label=N/A Priority=35603 Nice=0 Account=dnst-ict QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:13 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2020-01-23T10:27:45 EligibleTime=2020-01-23T10:27:45 AccrueTime=2020-01-23T10:27:45 StartTime=2020-01-23T10:27:45 EndTime=2020-01-23T10:57:45 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-01-23T10:27:45 Partition=all AllocNode:Sid=res-hpc-ma01:46428 ReqNodeList=(null) ExcNodeList=(null) NodeList=res-hpc-exe[013-014] BatchHost=res-hpc-exe013 NumNodes=2 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,mem=64G,node=2,billing=32 Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=* MinCPUsNode=16 MinMemoryCPU=2G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/user/Software/imb/mpi-benchmarks/imb.slurm WorkDir=/home/user/Software/imb/mpi-benchmarks StdErr=/home/user/Software/imb/mpi-benchmarks/job.%J.err StdIn=/dev/null StdOut=/home/user/Software/imb/mpi-benchmarks/job.%J.out Power= MailUser=user@gmail.com MailType=BEGIN,END,FAIL [user@res-hpc-lo02 ~]$ scontrol show node res-hpc-exe014 NodeName=res-hpc-exe014 Arch=x86_64 CoresPerSocket=12 CPUAlloc=16 CPUTot=24 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=res-hpc-exe014 NodeHostName=res-hpc-exe014 Version=20.02.0-0pre1 OS=Linux 4.18.0-80.11.2.el8_0.x86_64 #1 SMP Tue Sep 24 11:32:19 UTC 2019 RealMemory=386800 AllocMem=32768 FreeMem=380208 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=all BootTime=2019-12-11T11:51:40 SlurmdStartTime=2020-01-14T15:36:20 CfgTRES=cpu=24,mem=386800M,billing=24 AllocTRES=cpu=16,mem=32G CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [user@res-hpc-lo02 ~]$ scontrol show partition all PartitionName=all AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=res-hpc-exe[013-014] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED
Job Resources
Determining what resources to request
Requesting the right amount of resources for jobs is one the most essential aspects of using Slurm (or running any jobs on an HPC).
Before you submit a job for batch processing, it is important to know what the requirements of your program are so that it can run properly. Each program and workflow has unique requirements so we advise that you determine what resources you need before you submit your job.
Keep in mind that increasing the amount of compute resources may also increase the amount of time that your job spends waiting in the queue. Within some limits, you may request whatever resources you need but bear in mind that other researchers need to be able to use those resources as well.
It is vital that you specify the resources you need as detailed as possible. This will help Slurm to better schedule your job and to allocate free resources to other users.
Finding the right settings often requires a bit of trial and error. It usually helps to run a few small test jobs and use their performance to estimate the resources for the production job. After each job, you can check how much of the requested resources was actually use. It is also worthwhile to check this after production jobs to see if you can or should make adjustments.
Below are some ways to specify the resources to ask for in your job script. These are options defined for the sbatch
and salloc
commands. There are additional options that you can find by checking the man pages for each command or the Slurm website.
Specifying resources and other settings for jobs
Slurm has its own syntax to request compute resources. In addition, Slurm has a number of settings for jobs that make it easier to organize jobs. Below is a summary table of some commonly requested resources and the Slurm syntax to get it. These options can be passed to sbatch
or provided in a batch job script using the #SBATCH --option=<value>
syntax. For a complete listing of request syntax, run the command man sbatch
or check the Slurm website.
Category | Option | Meaning |
---|---|---|
Job Organization |
| Setting the name of the job (will be displayed in squeue output) |
| Setting where to send email alerts to | |
| Setting when to send email alerts | |
| Setting the name of the output file (Default is slurm-<jobid>.out). It is highly recommendable to set a name for the output file that can be readily associated with the job. An easy way to achieve this by using Slurm name patterns “%x” for the job name and “%j” for the job id, i.e., | |
| Setting a name for a separate file for error messages only | |
Nodes and CPUs |
| Request this many nodes on the cluster (default is 1 unless other parameters allow splitting up the job over multiple nodes). Uses by default 1 core on each node. |
| Request this many tasks on the cluster. A task is an instance of a running program. Use this for example with MPI. Defaults to 1 task per node. You can request, multiple nodes, multiple tasks and multiple CPUs per task and/or per nodes. For jobs with | |
| Request this many CPUs per task set by | |
Running time |
| The walltime or running time of your job. If you do not define how long your job will run, a default value might be set for a given partition. The maximum walltime that is available also depends on the partition that you use. |
Memory |
| Request this amount of memory for your job (single node). Suffix for unit are [K|M|G|T]. If not specified, a default value will be set depending on the partition which might not be suitable for your job. For parallelized jobs, it is better to use |
| Minimum memory required per allocated CPU. Suffix for unit are [K|M|G|T]. | |
Partition |
| Request specified partition/queue |
GPU |
| Request a number of GPUs for each node, optionally specifying the type of GPU (default is 0 GPUs). The amount and type of GPUs available depend on the nodes in the cluster and partitions. If you need a GPU, you always have request one explicitly. If you do not specify a GPU, Slurm will not assign one to your job. |
| Request GPUs for the job. We recommend using For jobs with | |
| Minimum memory required per allocated GPU. Suffix for unit are [K|M|G|T]. | |
Other settings |
| Request nodes which have the specific attributes (e.g. avx, IB) |
If you want to run a job that can make use of multiple CPUs, but which does not use MPI, then it is usually best to use #SBATCH --ntasks=1
and #SBATCH --cpus-per-task=c
where “c” is the number of CPUs you want. In most cases, setting --ntasks>1
only makes sense when running MPI or creating job steps with srun for parallelization in the batch file.
Adjusting job settings after resubmission
While scontrol show
a powerful command is to show info about your job, with scontrol update
, you can change certain settings as long as your job is on hold or pending. First put your job on hold, update the settings and then release your job. Here is brief and generic example:
scontrol hold <jobid> scontrol update job <jobid> NumNodes=2-2 NumTasks=2 Features=intel16 scontrol release <jobid>
See the man page for the scontrol
command or the Slurm website for more information.
Adjusting job priority after submission
You can adjust the priority of your job manually yourself to some extent. You cannot actually change the Priority value because it is not a fixed value, but you can manipulate the “Nice” factor to change the Priority value. The Nice factor will be subtracted from the Priority value that Slurm calculated.
For this, you do not have to hold your pending job, but you can adjust the value directly:
scontrol update job <jobid> Nice=<some_value>
Note that you can only give Nice a positive value which will decrease your priority. You cannot give it a negative value to increase your priority.
Batch Jobs
Batch processing is the recommended and common way to use a Slurm-controlled HPC cluster. It is a non-interactive workflow for running jobs which is reusable and reproducible if set up correctly. It makes use of a self-contained shell script, a so-called batch script, that is submitted to slurm using the sbatch command. The batch script contains all necessary information for Slurm to setup and run your job without the need for you to take action.
General workflow for batch processing
The workflow for batch processing can be summarized like this:
Prepare your job e.g, develop/write/compile program/script/software, download data to the cluster
Write your batch script
Submit your batch script with
sbatch batch_file_name
Wait for Slurm to start your job
Let Slurm run your job
Optionally, you can monitor your job while it runs
Come back after Slurm has finished running your job and retrieve results (and assess job performance)
Go back to step 1 or 2 and run more jobs.
You can of course cancel your pending (step 4) or running job (stage 5) at any point in time and go back to stage 2 or even 1, in particular if you encountered any issues.
The batch script
The batch script contains everything Slurm needs to make sure your job runs properly and successfully assuming that your program can run without errors. It is written as
Below, you can find an example for a basic layout of a batch script. Specific examples can be found in the section Tutorials
#!/bin/bash # Always the first line ########################### # Settings for slurm # Not all of them are always necessary # Some examples, ########################### #SBATCH --job-name=<job_name> #SBATCH --out=%x_%j.out #SBATCH --mail=<your_e-mail_address> #SBATCH --mail-type=<set_mail_type> #SBATCH --partition=<partition> #SBATCH --time=<d-hh:mm:ss> #SBATCH --ntasks=<number_of_tasks> ... ########################### # Set up your software environment # e.g., ########################### # For example, load modules, set environment variables ... ########################### # Copy your data to local scratch ########################### # Stage your data on the local scratch # if you want to avoid using network storage # for reading and writing files while your job runs ########################### # Execute tasks ########################### # All the commands that you need # for executing your program # including the call to your program itself ########################### # Move data products back to network storage ########################### # If you staged or wrote data to # the local scratch storage, # you probably want to keep some data products # after your job has finished and move those # to your scratch directory in network storage
Job arrays
Job arrays allow you to submit multiple jobs with a single batch script. The jobs have to have the same #SBATCH settings to begin with, but subsequent commands in your batch script can make use of SLURM environment variables specific for array jobs that allow you to adapt commands to jobs in the job array. More information is available on the Slurm website: Slurm Workload Manager - Job Array Support (schedmd.com)
Job arrays are only available for batch jobs
The id of array jobs consists of the general job id and the id of the job in the array separate by an underscore, i.e., <jobid>_<array_id_counter>
Batch commands for job arrays
In order to tell Slurm that your batch file should be run as a job array, add the sbatch setting --array
your batch script, for example
# For a job array with an array index from 0 till 20 #SBATCH --array=1-20 # For a job array with a specific index #SBATCH --array=1,3,5,7,9 # For a job array with index from 0 till 20 limited to 4 jobs running at the same time #SBATCH --array=1-20%4
When submitting an array job, squeue
will show the job id as <primary_id>_<array_index>
. However, internally Slurm will also log the job id of a job in a job array as <primary_id> + <array_index>
As such you have two options to easily specify your output file with a unique name.
# Option 1 using %j # Here, %j will be %j=<primary_id> + <array_index> #SBATCH --output=%x_%j.out # Option 2 using %A (primary id) and %a (array index) #SBATCH --output=%x_%A_%a.out
For seff
, you can also specify the job id in both ways.
Limitations for job arrays
In order to limit users from submitting too many jobs that would occupy the cluster for too long, a number of limitations apply:
ALICE
The total number of jobs that can be submitted with a job array currently are:
[me@nodelogin02 ~]$ scontrol show config | grep MaxArraySize MaxArraySize = 1001
In addition, QOS settings limit the number of jobs that can run at any given time.
SHARK
The total number of jobs that can be submitted with a job array currently are:
[me@res-hpc-lo02 ~]$ scontrol show config | grep MaxArraySize MaxArraySize = 125
Environment variables for job arrays
These are some of the Slurm environment variables specific to job arrays:
Environment variable | Comment |
---|---|
SLURM_ARRAY_JOB_ID | set to the first job ID of the array |
SLURM_ARRAY_TASK_ID | set to the job array index value |
SLURM_ARRAY_TASK_COUNT | set to the number of tasks in the job array |
SLURM_ARRAY_TASK_MAX | set to the highest job array index value |
SLURM_ARRAY_TASK_MIN | set to the lowest job array index value |
Interactive Jobs
It is also possible to run interactive jobs on a Slurm-based HPC cluster like ALICE or SHARK. They can be useful for quick interactive tests and data visualization, but they are not recommended for production jobs.
Interactive jobs also require you to specify resources for your job similar to batch jobs and they also have to go through the queue. This means that it depends on the load of the cluster or partition when your interactive job will run and you might have to wait.
The Slurm command salloc
allows you to request an interactive job. Here is an example:
ALICE
[me@nodelogin02 ~]$ salloc --ntasks=1 -p cpu-short --mem=1G --time=00:05:00 salloc: Granted job allocation 492489 salloc: Waiting for resource configuration salloc: Nodes node001 are ready for job [me@nodelogin02 ~]$ ssh node001 Last login: Tue Dec 14 16:32:57 2021 [me@node001 ~]$ echo $HOSTNAME node001 [me@node001 ~]$ exit logout Connection to node001 closed. [me@nodelogin02 ~]$ exit exit salloc: Relinquishing job allocation 492489
In the example above, we did not run a command so we ended up in the bash environment on the node from which we requested the node. We can then log in to the node with ssh and run commands on it. With exit
we left the node and another exit
left the environment and released the node.
SHARK
[me@res-hpc-lo02 ~]$ salloc -N1 salloc: Granted job allocation 267 salloc: Waiting for resource configuration salloc: Nodes res-hpc-exe013 are ready for job [me@res-hpc-exe013 ~]$ squeue JOBID PARTITION USER ST TIME NODES NODELIST(REASON) 267 all user R 0:04 1 res-hpc-exe013 [me@res-hpc-exe013 ~]$ exit exit salloc: Relinquishing job allocation 267 [me@res-hpc-lo02 ~]$
In the example above, we did not run a command so we ended up in the bash environment. With exit
we left the environment and we released the node.
If you need X11 forwarding, you can enable it for your interactive session by adding the option --x11
, for example:
salloc --ntasks=1 --mem=1G --time=00:05:00 --x11
Job Monitoring
Slurm provides a number of ways for you to monitor the state of your job.
One option is to use squeue
to get the overall state.
If you use entire nodes, you can query sinfo
to get basic load information.
If your program writes out sufficient information to the slurm output file (or a separate log file), you can check its content on a regular basis.
Another option is to log in to the node on which your job is running. After a job has started to run, Slurm grants you permission to login to the compute node via ssh until the job terminates. You can use this to look at the utilization. Note that anything you do on the compute node, will count against the resources that you requested. So, you should make sure not to run any resource-intensive tasks.
Job Performance
A quick and easy way to get information about the performance of your job is to use the command seff
followed by the id of your job, i.e.
seff <jobid>
If you told Slurm to notify you after your job as finished, the e-mail send from slurm will most likely also contain information about used resources.
More detailed performance statistics about your job or the program that you ran cannot be provided through Slurm-based tools. In this case, you need to deploy your own method to gather metrics that allow you to assess the performance.
Cancelling jobs
With the command scancel
, you can cancel any of your jobs (running or pending) or even all of your jobs. Here are some example commands:
Command | Meaning |
---|---|
| Cancels the job with the given job id |
| Cancels jobs with the given space-separated list of job ids |
| Cancels jobs of a job array matching the specified and job and array id. |
| Cancel all jobs of the given user |
| Cancel all pending jobs of the given user in the given partition |
See the man page for the scontrol
command or the Slurm website for more information