Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Current »

This section provides information on the workload manager Slurm used by ALICE and SHARK.

Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler that allocates compute resources on clusters for jobs. Slurm has been deployed at various national and international computing centres, and by approximately 60% of the TOP500 supercomputers in the world.

The following pages will give you a basic overview of Slurm on ALICE. You can learn much more about Slurm and its commands from the official Slurm website.

This section is intended as an overview of the fundamental concepts of using Slurm. In the Tutorials, we provide a more practical introduction on how to use Slurm, including examples for batch job scripts (e.g., Your first job). However, we recommend novice users to read through both chapters.

Slurm basics

Common user commands

The following is a list of common user commands that will be discussed in more detail in this section and the tutorials.

Command

Definition

sbatch

Submit a batch job script for execution (queued)

scancel

Delete a job

scontrol

Job status (detailed), several options only available to root

sinfo

Display state of partitions and nodes

squeue

Display state of all (queued) jobs

salloc

Submit a job for execution or initiate job in real-time (interactive job)

srun

Run parallel jobs

seff

Get a summary of how efficiently requested resources were used for a given job

If you want to get a full overview, have a look at the Slurm documentation or enter man <command> or check the documentation on the Slurm website.

There are different ways of how to submit jobs to Slurm. We always recommend to primarily use batch scripts submitted with the sbatch command instead of interactive jobs requested with salloc

Accounting info

The following list a number of commands to get accounting and job statistics.

Command

Info

sacct

Displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database

sstat

Display various status information of a running job/step

sacctmgr

Used to view and modify Slurm account information

sreport

Generate reports from the slurm accounting data

Scheduling info

The following commands can provide useful information to understand the scheduling of your job

Command

Info

sprio

View the factors that comprise a job's scheduling priority

sshare

Tool for listing the shares of associations to a cluster

Environment variables

Any environment variables that you have set with the sbatch command will be passed to your job. For this reason, if your program needs certain environment variables set to function properly, it is best to put them in your job script. This also makes it easier to reproduce your job results later, if necessary.

In addition to setting environment variables yourself, Slurm provides some environment variables of its own that you can use in your job scripts. Information on some of the common Slurm environment variables is listed in the chart below. For additional information, see the man page for sbatch.

Environmental Variable

Definition

$SLURM_JOB_ID

ID of job allocation

$SLURM_SUBMIT_DIR

Directory job where was submitted

$SLURM_JOB_NODELIST

File containing allocated hostnames

$SLURM_NTASKS

Total number of cores for job

$SLURM_CPUS_ON_NODE

Processors available to the job on this node

$SLURM_LAUNCH_NODE_IPADDR

IP address of node where job launched

$SLURM_NNODES

Total number of nodes use by the job

$SLURM_NODEID

Relative node ID of current node

$SLURM_NTASKS

Total number of processes in current job

$SLURM_PROCID

MPI rank (or relative process ID) of the current process

$SLURM_TASK_PID

Process ID of task started

$SLURM_TASKS_PER_NODE

Number of tasks to be run on each node

$CUDA_VISIBLE_DEVICES

List of GPUs are available for use

Environment variables override any options set in a batch script. Command-line options override any previously set environment variables.

Job settings and configuration information

With the slurm command scontrol show you can get a more detailed overview of your running job, node hardware and partitions:

Command

Info

scontrol show <job_id>

Get settings for a specific job

scontrol show <partition>

See the configuration of a given partition

scontrol show <node>

See the configuration of a given node

Slurm partitions

Slurm makes nodes on a cluster available as partitions, i.e., one partition gives access to a certain number of nodes and specific resource limits. Jobs are always submitted to either a default partition or a user-specified partition.


ALICE

You can find an overview of all currently available partitions on ALICE here: Partitions on ALICE


SHARK

You can find an overview of all currently available partitions on SHARK here: Partitions on SHARK


sinfo

sinfo provides information about the current state of partitions on a cluster. The following commands are useful:

Command

Description

sinfo

Shows the partitions available to the users and basic state information

sinfo -a

Shows all partitions on the cluster including those not available to the user

sinfo -s

Prints a summary which includes the number of nodes in the various states state

sinfo -N -l

Provides a node-centric view of the partitions

sinfo -N --Format=Nodelist:20,Available:10,Time:15,Partition:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10

Provides a quick overview of the available resources for all nodes on the cluster.

sinfo -p <parition_name> -N --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10

Provides a quick overview of the available resources for all nodes in the partition <partition_name>

sinfo --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10,Partition

Provides a quick overview of the available resources per partition. Adding -p <parition_name> will show the same information for the given partition only.

Here, we show the output for some the above commands for both ALICE and SHARK. Note that the output will most likely look different when you run it because of the current load of the system and possible changes made to the partition system. We encourage you to try them yourself.


ALICE

[me@nodelogin02 ~]$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
testing       up    1:00:00      1    mix nodelogin01
testing       up    1:00:00      1   idle nodelogin02
cpu-short*    up    4:00:00     20  alloc node[001-020]
cpu-medium    up 1-00:00:00     19  alloc node[002-020]
cpu-long      up 7-00:00:00     18  alloc node[003-020]
cpu-short     up    4:00:00      7    mix node[852-853,856-860]
gpu-short     up    4:00:00      2  alloc node[854-855]
gpu-short     up    4:00:00      1   idle node851
gpu-medium    up 1-00:00:00      7    mix node[852-853,856-860]
gpu-medium    up 1-00:00:00      2  alloc node[854-855]
gpu-medium    up 1-00:00:00      1   idle node851
gpu-long      up 7-00:00:00      7    mix node[852-853,856-860]
gpu-long      up 7-00:00:00      2  alloc node[854-855]
mem           up 14-00:00:0      1    mix node801
amd-short     up    4:00:00      1    mix node802

[me@nodelogin02 ~]$ sinfo -s
PARTITION  AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
testing       up    1:00:00          0/2/0/2 nodelogin[01-02]
cpu-short*    up    4:00:00        20/0/0/20 node[001-020]
cpu-medium    up 1-00:00:00        19/0/0/19 node[002-020]
cpu-long      up 7-00:00:00        18/0/0/18 node[003-020]
gpu-short     up    4:00:00         9/1/0/10 node[851-860]
gpu-medium    up 1-00:00:00         9/1/0/10 node[851-860]
gpu-long      up 7-00:00:00          9/0/0/9 node[852-860]
mem           up 14-00:00:0          1/0/0/1 node801
amd-short     up    4:00:00          1/0/0/1 node802

[me@nodelogin02 ~]$ sinfo -p gpu-long -N --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10
NODELIST            AVAIL     TIMELIMIT      STATE          MEMORY         FREE_MEM       CPU_LOAD       CPUS(A/I/O/T)  GRES_USED      AVAIL_FEATURES      REASON
node852             up        7-00:00:00     mixed          380851         286301         13.42          22/2/0/24      gpu:3          Geforce.rtx2080Ti   none
node853             up        7-00:00:00     mixed          380851         320623         1.04           22/2/0/24      gpu:3          Geforce.rtx2080Ti   none
node854             up        7-00:00:00     allocated      380851         280041         39.07          24/0/0/24      gpu:4          Geforce.rtx2080Ti   none
node855             up        7-00:00:00     allocated      380851         306605         41.95          24/0/0/24      gpu:4          Geforce.rtx2080Ti   none
node856             up        7-00:00:00     mixed          380851         313819         1.20           4/20/0/24      gpu:4          Geforce.rtx2080Ti   none
node857             up        7-00:00:00     mixed          380851         345208         1.22           4/20/0/24      gpu:4          Geforce.rtx2080Ti   none
node858             up        7-00:00:00     mixed          380851         331898         4.26           4/20/0/24      gpu:4          Geforce.rtx2080Ti   none
node859             up        7-00:00:00     mixed          380851         235598         4.21           14/10/0/24     gpu:4          Geforce.rtx2080Ti   none
node860             up        7-00:00:00     mixed          380851         306182         2.76           18/6/0/24      gpu:4          Geforce.rtx2080Ti   none

SHARK

[me@res-hpc-lo02 ~]$ sinfo
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*               up   infinite     13    mix res-hpc-exe[001,014,024,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02]
all*               up   infinite     13   idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031]
gpu                up   infinite      3    mix res-hpc-gpu[01-02,09]
gpu                up   infinite      2   idle res-hpc-gpu[03-04]
lumcdiagnostics    up   infinite     15    mix res-hpc-exe[001,014,024,032-033,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02]
lumcdiagnostics    up   infinite     15   idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031,034-035]
highmem            up   infinite      2    mix res-hpc-mem[01-02]
short              up    1:00:00     16    mix res-hpc-exe[001,014,024,032-033,036,039-043],res-hpc-gpu[01-02,05-06,09]
short              up    1:00:00     19   idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031,034-035],res-hpc-gpu[03-04],res-hpc-path[01-02]
highmemgpu         up   infinite      1   idle res-hpc-gpu07

[me@res-hpc-lo02 ~]$ sinfo -s
PARTITION       AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
all*               up   infinite       13/13/0/26 res-hpc-exe[001-003,005,007-009,011-014,024,027,029-031,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02]
gpu                up   infinite          3/2/0/5 res-hpc-gpu[01-04,09]
lumcdiagnostics    up   infinite       15/15/0/30 res-hpc-exe[001-003,005,007-009,011-014,024,027,029-036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02]
highmem            up   infinite          2/0/0/2 res-hpc-mem[01-02]
short              up    1:00:00       16/19/0/35 res-hpc-exe[001-003,005,007-009,011-014,024,027,029-036,039-043],res-hpc-gpu[01-06,09],res-hpc-path[01-02]
highmemgpu         up   infinite          0/1/0/1 res-hpc-gpu07

[me@res-hpc-lo02 ~]$ sinfo -p gpu -N --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:30,Features:20,Reason:10
NODELIST            AVAIL     TIMELIMIT      STATE          MEMORY         FREE_MEM       CPU_LOAD       CPUS(A/I/O/T)  GRES_USED                     AVAIL_FEATURES      REASON
res-hpc-gpu01       up        infinite       mixed          515000         110911         5.53           9/39/0/48      gpu:TitanXp:1(IDX:0)          Platinum8160        none
res-hpc-gpu02       up        infinite       mixed          514000         198863         0.00           8/40/0/48      gpu:TitanXp:1(IDX:1)          Platinum8160        none
res-hpc-gpu03       up        infinite       idle           48000          13443          0.00           0/6/0/6        gpu:GRIDV10032g:0(IDX:N/A)    Gold6252            none
res-hpc-gpu04       up        infinite       idle           48000          23317          0.00           0/6/0/6        gpu:GRIDV10032g:0(IDX:N/A)    Gold6252            none
res-hpc-gpu09       up        infinite       mixed          48000          17064          0.00           4/2/0/6        gpu:GRIDV10016g:1(IDX:0)      Gold6252            none

The node states have the following meaning:

State

Description

idle

the node is not used, but available for new jobs to run

mix(ed)

there is 1 or more jobs running on the node, but there are still resources free on this node for more jobs

alloc(ated)

the whole node is allocated by 1 or more jobs, no additional jobs can run on this node

draining or drained

the node is being drained or drained. New jobs cannot be scheduled until the node is undrained

The abbreviations (A/I/O/T) in NODES(A/I/O/T) or CPUS(A/I/O/T) mean (Allocated/Idle/Other/Total)

squeue

With squeue, you can get information about your running jobs and jobs from other users:

Command

Description

squeue

Returns a list of jobs in the current queue for all partitions available to the user

squeue -a

Returns a list of jobs in the current queue for all partitions on the cluster

squeue -l

Same as squeue, but with additional information on the jobs in the queue

squeue --me

Lists only a user’s jobs in the queue

squeue --me --start

Lists a user’s pending jobs and their estimated starting time

Jobs typically pass through several states in the course of their execution.
The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. An explanation of some state follows:

State

State (full)

Explanation

CA

CANCELLED

Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

CD

COMPLETED

Job has terminated all processes on all nodes with an exit code of zero.

CG

COMPLETING

Job is in the process of completing. Some processes on some nodes may still be active.

F

FAILED

Job terminated with non-zero exit code or other failure condition.

PD

PENDING

Job is awaiting resource allocation.

R

RUNNING

Job currently has an allocation.

S

SUSPENDED

Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

Job information and cluster configuration with scontrol

With the Slurm command scontrol you can get a more detailed overview of your running job, node hardware and partitions, e.g.,

[user@res-hpc-lo02 ~]$ scontrol show job 260
JobId=260 JobName=IMB
   UserId=user(225812) GroupId=Domain Users(513) MCS_label=N/A
   Priority=35603 Nice=0 Account=dnst-ict QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:13 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2020-01-23T10:27:45 EligibleTime=2020-01-23T10:27:45
   AccrueTime=2020-01-23T10:27:45
   StartTime=2020-01-23T10:27:45 EndTime=2020-01-23T10:57:45 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-01-23T10:27:45
   Partition=all AllocNode:Sid=res-hpc-ma01:46428
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=res-hpc-exe[013-014]
   BatchHost=res-hpc-exe013
   NumNodes=2 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=64G,node=2,billing=32
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/user/Software/imb/mpi-benchmarks/imb.slurm
   WorkDir=/home/user/Software/imb/mpi-benchmarks
   StdErr=/home/user/Software/imb/mpi-benchmarks/job.%J.err
   StdIn=/dev/null
   StdOut=/home/user/Software/imb/mpi-benchmarks/job.%J.out
   Power=
   MailUser=user@gmail.com MailType=BEGIN,END,FAIL

[user@res-hpc-lo02 ~]$ scontrol show node res-hpc-exe014
NodeName=res-hpc-exe014 Arch=x86_64 CoresPerSocket=12 
   CPUAlloc=16 CPUTot=24 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=res-hpc-exe014 NodeHostName=res-hpc-exe014 Version=20.02.0-0pre1
   OS=Linux 4.18.0-80.11.2.el8_0.x86_64 #1 SMP Tue Sep 24 11:32:19 UTC 2019 
   RealMemory=386800 AllocMem=32768 FreeMem=380208 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=all 
   BootTime=2019-12-11T11:51:40 SlurmdStartTime=2020-01-14T15:36:20
   CfgTRES=cpu=24,mem=386800M,billing=24
   AllocTRES=cpu=16,mem=32G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

[user@res-hpc-lo02 ~]$ scontrol show partition all
PartitionName=all
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=res-hpc-exe[013-014]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED

Job Resources

Determining what resources to request

Requesting the right amount of resources for jobs is one the most essential aspects of using Slurm (or running any jobs on an HPC).

Before you submit a job for batch processing, it is important to know what the requirements of your program are so that it can run properly. Each program and workflow has unique requirements so we advise that you determine what resources you need before you submit your job.

Keep in mind that increasing the amount of compute resources may also increase the amount of time that your job spends waiting in the queue. Within some limits, you may request whatever resources you need but bear in mind that other researchers need to be able to use those resources as well.

It is vital that you specify the resources you need as detailed as possible. This will help Slurm to better schedule your job and to allocate free resources to other users.

Finding the right settings often requires a bit of trial and error. It usually helps to run a few small test jobs and use their performance to estimate the resources for the production job. After each job, you can check how much of the requested resources was actually use. It is also worthwhile to check this after production jobs to see if you can or should make adjustments.

Below are some ways to specify the resources to ask for in your job script. These are options defined for the sbatch and salloc commands. There are additional options that you can find by checking the man pages for each command or the Slurm website.

Specifying resources and other settings for jobs

Slurm has its own syntax to request compute resources. In addition, Slurm has a number of settings for jobs that make it easier to organize jobs. Below is a summary table of some commonly requested resources and the Slurm syntax to get it. These options can be passed to sbatch or provided in a batch job script using the #SBATCH --option=<value> syntax. For a complete listing of request syntax, run the command man sbatch or check the Slurm website.

Category

Option

Meaning

Job Organization

--job-name=<job_name>

Setting the name of the job (will be displayed in squeue output)

--mail-user=<email>

Setting where to send email alerts to

--mail-type="<BEGIN|END|FAIL|REQUEUE|ALL>"

Setting when to send email alerts

 --output=<out_file>

Setting the name of the output file (Default is slurm-<jobid>.out).

It is highly recommendable to set a name for the output file that can be readily associated with the job. An easy way to achieve this by using Slurm name patterns “%x” for the job name and “%j” for the job id, i.e., --output=%x_%j.out

--error=<error_file>

Setting a name for a separate file for error messages only

Nodes and CPUs

--nodes=<number>

Request this many nodes on the cluster (default is 1 unless other parameters allow splitting up the job over multiple nodes). Uses by default 1 core on each node.

--ntasks=<number>

Request this many tasks on the cluster. A task is an instance of a running program. Use this for example with MPI. Defaults to 1 task per node. You can request, multiple nodes, multiple tasks and multiple CPUs per task and/or per nodes.

For jobs with --ntasks>1, Slurm will decide on how many nodes to use based on the available resources if --nodes or --tasks-per-node are not set.

--cpus-per-task=<number>

Request this many CPUs per task set by --ntasks.

Running time

--time=<hh:mm:ss>

The walltime or running time of your job. If you do not define how long your job will run, a default value might be set for a given partition. The maximum walltime that is available also depends on the partition that you use.

Memory

--mem=<number[unit]>

Request this amount of memory for your job (single node). Suffix for unit are [K|M|G|T]. If not specified, a default value will be set depending on the partition which might not be suitable for your job. For parallelized jobs, it is better to use --mem-per-cpu

--mem-per-cpu=<number[unit]>

Minimum memory required per allocated CPU. Suffix for unit are [K|M|G|T].

Partition

--partition=<partition_name>

Request specified partition/queue

GPU

--gres=gpu:[type of gpu]:[number of gpus]

Request a number of GPUs for each node, optionally specifying the type of GPU (default is 0 GPUs). The amount and type of GPUs available depend on the nodes in the cluster and partitions.

If you need a GPU, you always have request one explicitly. If you do not specify a GPU, Slurm will not assign one to your job.

--gpus=[number of gpus]

Request GPUs for the job. We recommend using --gres=gpu:[number of gpus] instead to fine-tune the GPU usage on a single node.

For jobs with --gpus>1, Slurm will decide on how many nodes to use based on the available resources if --nodes or --gpus-per-node are not set. This may split up the GPUs across nodes, which makes communication of results between them more complicated with added overhead (and requires your program to work with, e.g., MPI).

--mem-per-gpu=<number[unit]>

Minimum memory required per allocated GPU. Suffix for unit are [K|M|G|T].

Other settings

--constraint=<attribute>

Request nodes which have the specific attributes (e.g. avx, IB)

If you want to run a job that can make use of multiple CPUs, but which does not use MPI, then it is usually best to use #SBATCH --ntasks=1 and #SBATCH --cpus-per-task=c where “c” is the number of CPUs you want. In most cases, setting --ntasks>1 only makes sense when running MPI or creating job steps with srun for parallelization in the batch file.

Adjusting job settings after resubmission

While scontrol show a powerful command is to show info about your job, with scontrol update, you can change certain settings as long as your job is on hold or pending. First put your job on hold, update the settings and then release your job. Here is brief and generic example:

scontrol hold <jobid>
scontrol update job <jobid> NumNodes=2-2 NumTasks=2 Features=intel16
scontrol release <jobid>

See the man page for the scontrol command or the Slurm website for more information.

Adjusting job priority after submission

You can adjust the priority of your job manually yourself to some extent. You cannot actually change the Priority value because it is not a fixed value, but you can manipulate the “Nice” factor to change the Priority value. The Nice factor will be subtracted from the Priority value that Slurm calculated.

For this, you do not have to hold your pending job, but you can adjust the value directly:

scontrol update job <jobid> Nice=<some_value>

Note that you can only give Nice a positive value which will decrease your priority. You cannot give it a negative value to increase your priority.

Batch Jobs

Batch processing is the recommended and common way to use a Slurm-controlled HPC cluster. It is a non-interactive workflow for running jobs which is reusable and reproducible if set up correctly. It makes use of a self-contained shell script, a so-called batch script, that is submitted to slurm using the sbatch command. The batch script contains all necessary information for Slurm to setup and run your job without the need for you to take action.

General workflow for batch processing

The workflow for batch processing can be summarized like this:

  1. Prepare your job e.g, develop/write/compile program/script/software, download data to the cluster

  2. Write your batch script

  3. Submit your batch script with sbatch batch_file_name

  4. Wait for Slurm to start your job

  5. Let Slurm run your job

    1. Optionally, you can monitor your job while it runs

  6. Come back after Slurm has finished running your job and retrieve results (and assess job performance)

  7. Go back to step 1 or 2 and run more jobs.

You can of course cancel your pending (step 4) or running job (stage 5) at any point in time and go back to stage 2 or even 1, in particular if you encountered any issues.

The batch script

The batch script contains everything Slurm needs to make sure your job runs properly and successfully assuming that your program can run without errors. It is written as

Below, you can find an example for a basic layout of a batch script. Specific examples can be found in the section Tutorials

#!/bin/bash  # Always the first line

###########################
# Settings for slurm
# Not all of them are always necessary
# Some examples,
###########################
#SBATCH --job-name=<job_name>
#SBATCH --out=%x_%j.out
#SBATCH --mail=<your_e-mail_address>
#SBATCH --mail-type=<set_mail_type>
#SBATCH --partition=<partition>
#SBATCH --time=<d-hh:mm:ss>
#SBATCH --ntasks=<number_of_tasks>
...

###########################
# Set up your software environment
# e.g.,
###########################
# For example, load modules, set environment variables
...

###########################
# Copy your data to local scratch
###########################
# Stage your data on the local scratch
# if you want to avoid using network storage 
# for reading and writing files while your job runs

###########################
# Execute tasks
###########################
# All the commands that you need
# for executing your program
# including the call to your program itself

###########################
# Move data products back to network storage
###########################
# If you staged or wrote data to
# the local scratch storage,
# you probably want to keep some data products
# after your job has finished and move those
# to your scratch directory in network storage

Job arrays

Job arrays allow you to submit multiple jobs with a single batch script. The jobs have to have the same #SBATCH settings to begin with, but subsequent commands in your batch script can make use of SLURM environment variables specific for array jobs that allow you to adapt commands to jobs in the job array. More information is available on the Slurm website: Slurm Workload Manager - Job Array Support (schedmd.com)

Job arrays are only available for batch jobs

The id of array jobs consists of the general job id and the id of the job in the array separate by an underscore, i.e., <jobid>_<array_id_counter>

Batch commands for job arrays

In order to tell Slurm that your batch file should be run as a job array, add the sbatch setting --array your batch script, for example

# For a job array with an array index from 0 till 20
#SBATCH --array=1-20

# For a job array with a specific index
#SBATCH --array=1,3,5,7,9

# For a job array with index from 0 till 20 limited to 4 jobs running at the same time
#SBATCH --array=1-20%4

When submitting an array job, squeue will show the job id as <primary_id>_<array_index>. However, internally Slurm will also log the job id of a job in a job array as <primary_id> + <array_index>

As such you have two options to easily specify your output file with a unique name.

# Option 1 using %j
# Here, %j will be %j=<primary_id> + <array_index> 
#SBATCH --output=%x_%j.out

# Option 2 using %A (primary id) and %a (array index)
#SBATCH --output=%x_%A_%a.out

For seff, you can also specify the job id in both ways.

Limitations for job arrays

In order to limit users from submitting too many jobs that would occupy the cluster for too long, a number of limitations apply:


ALICE

The total number of jobs that can be submitted with a job array currently are:

[me@nodelogin02 ~]$ scontrol show config | grep MaxArraySize
MaxArraySize            = 1001

In addition, QOS settings limit the number of jobs that can run at any given time.


SHARK

The total number of jobs that can be submitted with a job array currently are:

[me@res-hpc-lo02 ~]$ scontrol show config | grep MaxArraySize
MaxArraySize            = 125


Environment variables for job arrays

These are some of the Slurm environment variables specific to job arrays:

Environment variable

Comment

SLURM_ARRAY_JOB_ID

set to the first job ID of the array

SLURM_ARRAY_TASK_ID

set to the job array index value

SLURM_ARRAY_TASK_COUNT

set to the number of tasks in the job array

SLURM_ARRAY_TASK_MAX

set to the highest job array index value

SLURM_ARRAY_TASK_MIN

set to the lowest job array index value

Interactive Jobs

It is also possible to run interactive jobs on a Slurm-based HPC cluster like ALICE or SHARK. They can be useful for quick interactive tests and data visualization, but they are not recommended for production jobs.

Interactive jobs also require you to specify resources for your job similar to batch jobs and they also have to go through the queue. This means that it depends on the load of the cluster or partition when your interactive job will run and you might have to wait.

The Slurm command salloc allows you to request an interactive job. Here is an example:


ALICE

[me@nodelogin02 ~]$ salloc --ntasks=1 -p cpu-short --mem=1G --time=00:05:00
salloc: Granted job allocation 492489
salloc: Waiting for resource configuration
salloc: Nodes node001 are ready for job
[me@nodelogin02 ~]$ ssh node001
Last login: Tue Dec 14 16:32:57 2021
[me@node001 ~]$ echo $HOSTNAME
node001
[me@node001 ~]$ exit
logout
Connection to node001 closed.
[me@nodelogin02 ~]$ exit
exit
salloc: Relinquishing job allocation 492489

In the example above, we did not run a command so we ended up in the bash environment on the node from which we requested the node. We can then log in to the node with ssh and run commands on it. With exit we left the node and another exit left the environment and released the node.


SHARK

[me@res-hpc-lo02 ~]$ salloc -N1
salloc: Granted job allocation 267
salloc: Waiting for resource configuration
salloc: Nodes res-hpc-exe013 are ready for job

[me@res-hpc-exe013 ~]$ squeue 
             JOBID  PARTITION      USER  ST        TIME   NODES NODELIST(REASON) 
               267        all      user   R        0:04       1 res-hpc-exe013 
[me@res-hpc-exe013 ~]$ exit
exit
salloc: Relinquishing job allocation 267

[me@res-hpc-lo02 ~]$ 

In the example above, we did not run a command so we ended up in the bash environment. With exit we left the environment and we released the node.


If you need X11 forwarding, you can enable it for your interactive session by adding the option --x11, for example:

salloc --ntasks=1 --mem=1G --time=00:05:00 --x11

Job Monitoring

Slurm provides a number of ways for you to monitor the state of your job.

One option is to use squeue to get the overall state.

If you use entire nodes, you can query sinfo to get basic load information.

If your program writes out sufficient information to the slurm output file (or a separate log file), you can check its content on a regular basis.

Another option is to log in to the node on which your job is running. After a job has started to run, Slurm grants you permission to login to the compute node via ssh until the job terminates. You can use this to look at the utilization. Note that anything you do on the compute node, will count against the resources that you requested. So, you should make sure not to run any resource-intensive tasks.

Job Performance

A quick and easy way to get information about the performance of your job is to use the command seff followed by the id of your job, i.e.

seff <jobid>

If you told Slurm to notify you after your job as finished, the e-mail send from slurm will most likely also contain information about used resources.

More detailed performance statistics about your job or the program that you ran cannot be provided through Slurm-based tools. In this case, you need to deploy your own method to gather metrics that allow you to assess the performance.

Cancelling jobs

With the command scancel, you can cancel any of your jobs (running or pending) or even all of your jobs. Here are some example commands:

Command

Meaning

scancel <jobid>

Cancels the job with the given job id

scancel <jobid1> <jobid2>

Cancels jobs with the given space-separated list of job ids

scancel <jobid_[min_arrayid-max_arrayid]>

Cancels jobs of a job array matching the specified and job and array id.

scancel --user=<username>

Cancel all jobs of the given user

scancel --state=PENDING --user=<username> --partition=<partition>

Cancel all pending jobs of the given user in the given partition

See the man page for the scontrol command or the Slurm website for more information

  • No labels