Running jobs on ALICE or SHARK - Workload manager Slurm

This section provides information on the workload manager Slurm used by ALICE and SHARK.

Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler that allocates compute resources on clusters for jobs. Slurm has been deployed at various national and international computing centres, and by approximately 60% of the TOP500 supercomputers in the world.

The following pages will give you a basic overview of Slurm on ALICE. You can learn much more about Slurm and its commands from the official Slurm website.

This section is intended as an overview of the fundamental concepts of using Slurm. In the https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37027894, we provide a more practical introduction on how to use Slurm, including examples for batch job scripts (e.g., https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37027911). However, we recommend novice users to read through both chapters.

Slurm basics

Common user commands

The following is a list of common user commands that will be discussed in more detail in this section and the tutorials.

Command

Definition

Command

Definition

sbatch

Submit a batch job script for execution (queued)

scancel

Delete a job

scontrol

Job status (detailed), several options only available to root

sinfo

Display state of partitions and nodes

squeue

Display state of all (queued) jobs

salloc

Submit a job for execution or initiate job in real-time (interactive job)

srun

Run parallel jobs

seff

Get a summary of how efficiently requested resources were used for a given job

If you want to get a full overview, have a look at the Slurm documentation or enter man <command> or check the documentation on the Slurm website.

There are different ways of how to submit jobs to Slurm. We always recommend to primarily use batch scripts submitted with the sbatch command instead of interactive jobs requested with salloc

Accounting info

The following list a number of commands to get accounting and job statistics.

Command

Info

Command

Info

sacct

Displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database

sstat

Display various status information of a running job/step

sacctmgr

Used to view and modify Slurm account information

sreport

Generate reports from the slurm accounting data

Scheduling info

The following commands can provide useful information to understand the scheduling of your job

Command

Info

Command

Info

sprio

View the factors that comprise a job's scheduling priority

sshare

Tool for listing the shares of associations to a cluster

Environment variables

Any environment variables that you have set with the sbatch command will be passed to your job. For this reason, if your program needs certain environment variables set to function properly, it is best to put them in your job script. This also makes it easier to reproduce your job results later, if necessary.

In addition to setting environment variables yourself, Slurm provides some environment variables of its own that you can use in your job scripts. Information on some of the common Slurm environment variables is listed in the chart below. For additional information, see the man page for sbatch.

Environmental Variable

Definition

Environmental Variable

Definition

$SLURM_JOB_ID

ID of job allocation

$SLURM_SUBMIT_DIR

Directory job where was submitted

$SLURM_JOB_NODELIST

File containing allocated hostnames

$SLURM_NTASKS

Total number of cores for job

$SLURM_CPUS_ON_NODE

Processors available to the job on this node

$SLURM_LAUNCH_NODE_IPADDR

IP address of node where job launched

$SLURM_NNODES

Total number of nodes use by the job

$SLURM_NODEID

Relative node ID of current node

$SLURM_NTASKS

Total number of processes in current job

$SLURM_PROCID

MPI rank (or relative process ID) of the current process

$SLURM_TASK_PID

Process ID of task started

$SLURM_TASKS_PER_NODE

Number of tasks to be run on each node

$CUDA_VISIBLE_DEVICES

List of GPUs are available for use

Environment variables override any options set in a batch script. Command-line options override any previously set environment variables.

Job settings and configuration information

With the slurm command scontrol show you can get a more detailed overview of your running job, node hardware and partitions:

Command

Info

Command

Info

scontrol show <job_id>

Get settings for a specific job

scontrol show <partition>

See the configuration of a given partition

scontrol show <node>

See the configuration of a given node

 

Slurm partitions

Slurm makes nodes on a cluster available as partitions, i.e., one partition gives access to a certain number of nodes and specific resource limits. Jobs are always submitted to either a default partition or a user-specified partition.


ALICE

You can find an overview of all currently available partitions on ALICE here: https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37519504


SHARK

You can find an overview of all currently available partitions on SHARK here: https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37519928


sinfo

sinfo provides information about the current state of partitions on a cluster. The following commands are useful:

Command

Description

Command

Description

sinfo

Shows the partitions available to the users and basic state information

sinfo -a

Shows all partitions on the cluster including those not available to the user

sinfo -s

Prints a summary which includes the number of nodes in the various states state

sinfo -N -l

Provides a node-centric view of the partitions

sinfo -N --Format=Nodelist:20,Available:10,Time:15,Partition:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10

Provides a quick overview of the available resources for all nodes on the cluster.

sinfo -p <parition_name> -N --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10

Provides a quick overview of the available resources for all nodes in the partition <partition_name>

sinfo --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10,Partition

Provides a quick overview of the available resources per partition. Adding -p <parition_name> will show the same information for the given partition only.

Here, we show the output for some the above commands for both ALICE and SHARK. Note that the output will most likely look different when you run it because of the current load of the system and possible changes made to the partition system. We encourage you to try them yourself.


ALICE

[me@nodelogin02 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST testing up 1:00:00 1 mix nodelogin01 testing up 1:00:00 1 idle nodelogin02 cpu-short* up 4:00:00 20 alloc node[001-020] cpu-medium up 1-00:00:00 19 alloc node[002-020] cpu-long up 7-00:00:00 18 alloc node[003-020] cpu-short up 4:00:00 7 mix node[852-853,856-860] gpu-short up 4:00:00 2 alloc node[854-855] gpu-short up 4:00:00 1 idle node851 gpu-medium up 1-00:00:00 7 mix node[852-853,856-860] gpu-medium up 1-00:00:00 2 alloc node[854-855] gpu-medium up 1-00:00:00 1 idle node851 gpu-long up 7-00:00:00 7 mix node[852-853,856-860] gpu-long up 7-00:00:00 2 alloc node[854-855] mem up 14-00:00:0 1 mix node801 amd-short up 4:00:00 1 mix node802 [me@nodelogin02 ~]$ sinfo -s PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST testing up 1:00:00 0/2/0/2 nodelogin[01-02] cpu-short* up 4:00:00 20/0/0/20 node[001-020] cpu-medium up 1-00:00:00 19/0/0/19 node[002-020] cpu-long up 7-00:00:00 18/0/0/18 node[003-020] gpu-short up 4:00:00 9/1/0/10 node[851-860] gpu-medium up 1-00:00:00 9/1/0/10 node[851-860] gpu-long up 7-00:00:00 9/0/0/9 node[852-860] mem up 14-00:00:0 1/0/0/1 node801 amd-short up 4:00:00 1/0/0/1 node802 [me@nodelogin02 ~]$ sinfo -p gpu-long -N --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10 NODELIST AVAIL TIMELIMIT STATE MEMORY FREE_MEM CPU_LOAD CPUS(A/I/O/T) GRES_USED AVAIL_FEATURES REASON node852 up 7-00:00:00 mixed 380851 286301 13.42 22/2/0/24 gpu:3 Geforce.rtx2080Ti none node853 up 7-00:00:00 mixed 380851 320623 1.04 22/2/0/24 gpu:3 Geforce.rtx2080Ti none node854 up 7-00:00:00 allocated 380851 280041 39.07 24/0/0/24 gpu:4 Geforce.rtx2080Ti none node855 up 7-00:00:00 allocated 380851 306605 41.95 24/0/0/24 gpu:4 Geforce.rtx2080Ti none node856 up 7-00:00:00 mixed 380851 313819 1.20 4/20/0/24 gpu:4 Geforce.rtx2080Ti none node857 up 7-00:00:00 mixed 380851 345208 1.22 4/20/0/24 gpu:4 Geforce.rtx2080Ti none node858 up 7-00:00:00 mixed 380851 331898 4.26 4/20/0/24 gpu:4 Geforce.rtx2080Ti none node859 up 7-00:00:00 mixed 380851 235598 4.21 14/10/0/24 gpu:4 Geforce.rtx2080Ti none node860 up 7-00:00:00 mixed 380851 306182 2.76 18/6/0/24 gpu:4 Geforce.rtx2080Ti none

SHARK

[me@res-hpc-lo02 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST all* up infinite 13 mix res-hpc-exe[001,014,024,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02] all* up infinite 13 idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031] gpu up infinite 3 mix res-hpc-gpu[01-02,09] gpu up infinite 2 idle res-hpc-gpu[03-04] lumcdiagnostics up infinite 15 mix res-hpc-exe[001,014,024,032-033,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02] lumcdiagnostics up infinite 15 idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031,034-035] highmem up infinite 2 mix res-hpc-mem[01-02] short up 1:00:00 16 mix res-hpc-exe[001,014,024,032-033,036,039-043],res-hpc-gpu[01-02,05-06,09] short up 1:00:00 19 idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031,034-035],res-hpc-gpu[03-04],res-hpc-path[01-02] highmemgpu up infinite 1 idle res-hpc-gpu07 [me@res-hpc-lo02 ~]$ sinfo -s PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST all* up infinite 13/13/0/26 res-hpc-exe[001-003,005,007-009,011-014,024,027,029-031,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02] gpu up infinite 3/2/0/5 res-hpc-gpu[01-04,09] lumcdiagnostics up infinite 15/15/0/30 res-hpc-exe[001-003,005,007-009,011-014,024,027,029-036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02] highmem up infinite 2/0/0/2 res-hpc-mem[01-02] short up 1:00:00 16/19/0/35 res-hpc-exe[001-003,005,007-009,011-014,024,027,029-036,039-043],res-hpc-gpu[01-06,09],res-hpc-path[01-02] highmemgpu up infinite 0/1/0/1 res-hpc-gpu07 [me@res-hpc-lo02 ~]$ sinfo -p gpu -N --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:30,Features:20,Reason:10 NODELIST AVAIL TIMELIMIT STATE MEMORY FREE_MEM CPU_LOAD CPUS(A/I/O/T) GRES_USED AVAIL_FEATURES REASON res-hpc-gpu01 up infinite mixed 515000 110911 5.53 9/39/0/48 gpu:TitanXp:1(IDX:0) Platinum8160 none res-hpc-gpu02 up infinite mixed 514000 198863 0.00 8/40/0/48 gpu:TitanXp:1(IDX:1) Platinum8160 none res-hpc-gpu03 up infinite idle 48000 13443 0.00 0/6/0/6 gpu:GRIDV10032g:0(IDX:N/A) Gold6252 none res-hpc-gpu04 up infinite idle 48000 23317 0.00 0/6/0/6 gpu:GRIDV10032g:0(IDX:N/A) Gold6252 none res-hpc-gpu09 up infinite mixed 48000 17064 0.00 4/2/0/6 gpu:GRIDV10016g:1(IDX:0) Gold6252 none

The node states have the following meaning:

State

Description

State

Description

idle

the node is not used, but available for new jobs to run

mix(ed)

there is 1 or more jobs running on the node, but there are still resources free on this node for more jobs

alloc(ated)

the whole node is allocated by 1 or more jobs, no additional jobs can run on this node

draining or drained

the node is being drained or drained. New jobs cannot be scheduled until the node is undrained

The abbreviations (A/I/O/T) in NODES(A/I/O/T) or CPUS(A/I/O/T) mean (Allocated/Idle/Other/Total)

squeue

With squeue, you can get information about your running jobs and jobs from other users:

Command

Description

Command

Description

squeue

Returns a list of jobs in the current queue for all partitions available to the user

squeue -a

Returns a list of jobs in the current queue for all partitions on the cluster

squeue -l

Same as squeue, but with additional information on the jobs in the queue

squeue --me

Lists only a user’s jobs in the queue

squeue --me --start

Lists a user’s pending jobs and their estimated starting time

Jobs typically pass through several states in the course of their execution.
The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. An explanation of some state follows:

State

State (full)

Explanation

State

State (full)

Explanation

CA

CANCELLED

Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

CD

COMPLETED

Job has terminated all processes on all nodes with an exit code of zero.

CG

COMPLETING

Job is in the process of completing. Some processes on some nodes may still be active.

F

FAILED

Job terminated with non-zero exit code or other failure condition.

PD

PENDING

Job is awaiting resource allocation.

R

RUNNING

Job currently has an allocation.

S

SUSPENDED

Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

Job information and cluster configuration with scontrol

With the Slurm command scontrol you can get a more detailed overview of your running job, node hardware and partitions, e.g.,

[user@res-hpc-lo02 ~]$ scontrol show job 260 JobId=260 JobName=IMB UserId=user(225812) GroupId=Domain Users(513) MCS_label=N/A Priority=35603 Nice=0 Account=dnst-ict QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:13 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2020-01-23T10:27:45 EligibleTime=2020-01-23T10:27:45 AccrueTime=2020-01-23T10:27:45 StartTime=2020-01-23T10:27:45 EndTime=2020-01-23T10:57:45 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-01-23T10:27:45 Partition=all AllocNode:Sid=res-hpc-ma01:46428 ReqNodeList=(null) ExcNodeList=(null) NodeList=res-hpc-exe[013-014] BatchHost=res-hpc-exe013 NumNodes=2 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,mem=64G,node=2,billing=32 Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=* MinCPUsNode=16 MinMemoryCPU=2G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/user/Software/imb/mpi-benchmarks/imb.slurm WorkDir=/home/user/Software/imb/mpi-benchmarks StdErr=/home/user/Software/imb/mpi-benchmarks/job.%J.err StdIn=/dev/null StdOut=/home/user/Software/imb/mpi-benchmarks/job.%J.out Power= MailUser=user@gmail.com MailType=BEGIN,END,FAIL [user@res-hpc-lo02 ~]$ scontrol show node res-hpc-exe014 NodeName=res-hpc-exe014 Arch=x86_64 CoresPerSocket=12 CPUAlloc=16 CPUTot=24 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=res-hpc-exe014 NodeHostName=res-hpc-exe014 Version=20.02.0-0pre1 OS=Linux 4.18.0-80.11.2.el8_0.x86_64 #1 SMP Tue Sep 24 11:32:19 UTC 2019 RealMemory=386800 AllocMem=32768 FreeMem=380208 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=all BootTime=2019-12-11T11:51:40 SlurmdStartTime=2020-01-14T15:36:20 CfgTRES=cpu=24,mem=386800M,billing=24 AllocTRES=cpu=16,mem=32G CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [user@res-hpc-lo02 ~]$ scontrol show partition all PartitionName=all AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=res-hpc-exe[013-014] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED

 

Job Resources

Determining what resources to request

Requesting the right amount of resources for jobs is one the most essential aspects of using Slurm (or running any jobs on an HPC).

Before you submit a job for batch processing, it is important to know what the requirements of your program are so that it can run properly. Each program and workflow has unique requirements so we advise that you determine what resources you need before you submit your job.

Keep in mind that increasing the amount of compute resources may also increase the amount of time that your job spends waiting in the queue. Within some limits, you may request whatever resources you need but bear in mind that other researchers need to be able to use those resources as well.

It is vital that you specify the resources you need as detailed as possible. This will help Slurm to better schedule your job and to allocate free resources to other users.

Finding the right settings often requires a bit of trial and error. It usually helps to run a few small test jobs and use their performance to estimate the resources for the production job. After each job, you can check how much of the requested resources was actually use. It is also worthwhile to check this after production jobs to see if you can or should make adjustments.

Below are some ways to specify the resources to ask for in your job script. These are options defined for the sbatch and salloc commands. There are additional options that you can find by checking the man pages for each command or the Slurm website.

Specifying resources and other settings for jobs

Slurm has its own syntax to request compute resources. In addition, Slurm has a number of settings for jobs that make it easier to organize jobs. Below is a summary table of some commonly requested resources and the Slurm syntax to get it. These options can be passed to sbatch or provided in a batch job script using the #SBATCH --option=<value> syntax. For a complete listing of request syntax, run the command man sbatch or check the Slurm website.

Category

Option

Meaning

Category

Option

Meaning

Job Organization

--job-name=<job_name>

Setting the name of the job (will be displayed in squeue output)

 

--mail-user=<email>

Setting where to send email alerts to

 

--mail-type="<BEGIN|END|FAIL|REQUEUE|ALL>"

Setting when to send email alerts

 

 --output=<out_file>

Setting the name of the output file (Default is slurm-<jobid>.out).

It is highly recommendable to set a name for the output file that can be readily associated with the job. An easy way to achieve this by using Slurm name patterns “%x” for the job name and “%j” for the job id, i.e., --output=%x_%j.out

 

--error=<error_file>

Setting a name for a separate file for error messages only

Nodes and CPUs

--nodes=<number>

Request this many nodes on the cluster (default is 1 unless other parameters allow splitting up the job over multiple nodes). Uses by default 1 core on each node.

 

--ntasks=<number>

Request this many tasks on the cluster. A task is an instance of a running program. Use this for example with MPI. Defaults to 1 task per node. You can request, multiple nodes, multiple tasks and multiple CPUs per task and/or per nodes.

For jobs with --ntasks>1, Slurm will decide on how many nodes to use based on the available resources if --nodes or --tasks-per-node are not set.

 

--cpus-per-task=<number>

Request this many CPUs per task set by --ntasks.

Running time

--time=<hh:mm:ss>

The walltime or running time of your job. If you do not define how long your job will run, a default value might be set for a given partition. The maximum walltime that is available also depends on the partition that you use.

Memory

--mem=<number[unit]>

Request this amount of memory for your job (single node). Suffix for unit are [K|M|G|T]. If not specified, a default value will be set depending on the partition which might not be suitable for your job. For parallelized jobs, it is better to use --mem-per-cpu

 

--mem-per-cpu=<number[unit]>

Minimum memory required per allocated CPU. Suffix for unit are [K|M|G|T].

Partition

--partition=<partition_name>

Request specified partition/queue

GPU

--gres=gpu:[type of gpu]:[number of gpus]

Request a number of GPUs for each node, optionally specifying the type of GPU (default is 0 GPUs). The amount and type of GPUs available depend on the nodes in the cluster and partitions.

If you need a GPU, you always have request one explicitly. If you do not specify a GPU, Slurm will not assign one to your job.

 

--gpus=[number of gpus]

Request GPUs for the job. We recommend using --gres=gpu:[number of gpus] instead to fine-tune the GPU usage on a single node.

For jobs with --gpus>1, Slurm will decide on how many nodes to use based on the available resources if --nodes or --gpus-per-node are not set. This may split up the GPUs across nodes, which makes communication of results between them more complicated with added overhead (and requires your program to work with, e.g., MPI).

 

--mem-per-gpu=<number[unit]>

Minimum memory required per allocated GPU. Suffix for unit are [K|M|G|T].

Other settings

--constraint=<attribute>

Request nodes which have the specific attributes (e.g. avx, IB)

If you want to run a job that can make use of multiple CPUs, but which does not use MPI, then it is usually best to use #SBATCH --ntasks=1 and #SBATCH --cpus-per-task=c where “c” is the number of CPUs you want. In most cases, setting --ntasks>1 only makes sense when running MPI or creating job steps with srun for parallelization in the batch file.

Adjusting job settings after resubmission

While scontrol show a powerful command is to show info about your job, with scontrol update, you can change certain settings as long as your job is on hold or pending. First put your job on hold, update the settings and then release your job. Here is brief and generic example:

scontrol hold <jobid> scontrol update job <jobid> NumNodes=2-2 NumTasks=2 Features=intel16 scontrol release <jobid>

See the man page for the scontrol command or the Slurm website for more information.

Adjusting job priority after submission

You can adjust the priority of your job manually yourself to some extent. You cannot actually change the Priority value because it is not a fixed value, but you can manipulate the “Nice” factor to change the Priority value. The Nice factor will be subtracted from the Priority value that Slurm calculated.

For this, you do not have to hold your pending job, but you can adjust the value directly:

scontrol update job <jobid> Nice=<some_value>

Note that you can only give Nice a positive value which will decrease your priority. You cannot give it a negative value to increase your priority.

Batch Jobs

Batch processing is the recommended and common way to use a Slurm-controlled HPC cluster. It is a non-interactive workflow for running jobs which is reusable and reproducible if set up correctly. It makes use of a self-contained shell script, a so-called batch script, that is submitted to slurm using the sbatch command. The batch script contains all necessary information for Slurm to setup and run your job without the need for you to take action.

General workflow for batch processing

The workflow for batch processing can be summarized like this:

  1. Prepare your job e.g, develop/write/compile program/script/software, download data to the cluster

  2. Write your batch script

  3. Submit your batch script with sbatch batch_file_name

  4. Wait for Slurm to start your job

  5. Let Slurm run your job

    1. Optionally, you can monitor your job while it runs

  6. Come back after Slurm has finished running your job and retrieve results (and assess job performance)

  7. Go back to step 1 or 2 and run more jobs.

You can of course cancel your pending (step 4) or running job (stage 5) at any point in time and go back to stage 2 or even 1, in particular if you encountered any issues.

The batch script

The batch script contains everything Slurm needs to make sure your job runs properly and successfully assuming that your program can run without errors. It is written as

Below, you can find an example for a basic layout of a batch script. Specific examples can be found in the section https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37027894

#!/bin/bash # Always the first line ########################### # Settings for slurm # Not all of them are always necessary # Some examples, ########################### #SBATCH --job-name=<job_name> #SBATCH --out=%x_%j.out #SBATCH --mail=<your_e-mail_address> #SBATCH --mail-type=<set_mail_type> #SBATCH --partition=<partition> #SBATCH --time=<d-hh:mm:ss> #SBATCH --ntasks=<number_of_tasks> ... ########################### # Set up your software environment # e.g., ########################### # For example, load modules, set environment variables ... ########################### # Copy your data to local scratch ########################### # Stage your data on the local scratch # if you want to avoid using network storage # for reading and writing files while your job runs ########################### # Execute tasks ########################### # All the commands that you need # for executing your program # including the call to your program itself ########################### # Move data products back to network storage ########################### # If you staged or wrote data to # the local scratch storage, # you probably want to keep some data products # after your job has finished and move those # to your scratch directory in network storage

Job arrays

Job arrays allow you to submit multiple jobs with a single batch script. The jobs have to have the same #SBATCH settings to begin with, but subsequent commands in your batch script can make use of SLURM environment variables specific for array jobs that allow you to adapt commands to jobs in the job array. More information is available on the Slurm website: Slurm Workload Manager - Job Array Support (schedmd.com)

Job arrays are only available for batch jobs

The id of array jobs consists of the general job id and the id of the job in the array separate by an underscore, i.e., <jobid>_<array_id_counter>

Batch commands for job arrays

In order to tell Slurm that your batch file should be run as a job array, add the sbatch setting --array your batch script, for example

# For a job array with an array index from 0 till 20 #SBATCH --array=1-20 # For a job array with a specific index #SBATCH --array=1,3,5,7,9 # For a job array with index from 0 till 20 limited to 4 jobs running at the same time #SBATCH --array=1-20%4

When submitting an array job, squeue will show the job id as <primary_id>_<array_index>. However, internally Slurm will also log the job id of a job in a job array as <primary_id> + <array_index>

As such you have two options to easily specify your output file with a unique name.

# Option 1 using %j # Here, %j will be %j=<primary_id> + <array_index> #SBATCH --output=%x_%j.out # Option 2 using %A (primary id) and %a (array index) #SBATCH --output=%x_%A_%a.out

For seff, you can also specify the job id in both ways.

Limitations for job arrays

In order to limit users from submitting too many jobs that would occupy the cluster for too long, a number of limitations apply:


ALICE

The total number of jobs that can be submitted with a job array currently are:

[me@nodelogin02 ~]$ scontrol show config | grep MaxArraySize MaxArraySize = 1001

In addition, QOS settings limit the number of jobs that can run at any given time.


SHARK

The total number of jobs that can be submitted with a job array currently are:

[me@res-hpc-lo02 ~]$ scontrol show config | grep MaxArraySize MaxArraySize = 125

 


Environment variables for job arrays

These are some of the Slurm environment variables specific to job arrays:

Environment variable

Comment

Environment variable

Comment

SLURM_ARRAY_JOB_ID

set to the first job ID of the array

SLURM_ARRAY_TASK_ID

set to the job array index value

SLURM_ARRAY_TASK_COUNT

set to the number of tasks in the job array

SLURM_ARRAY_TASK_MAX

set to the highest job array index value

SLURM_ARRAY_TASK_MIN

set to the lowest job array index value

Interactive Jobs

It is also possible to run interactive jobs on a Slurm-based HPC cluster like ALICE or SHARK. They can be useful for quick interactive tests and data visualization, but they are not recommended for production jobs.

Interactive jobs also require you to specify resources for your job similar to batch jobs and they also have to go through the queue. This means that it depends on the load of the cluster or partition when your interactive job will run and you might have to wait.

The Slurm command salloc allows you to request an interactive job. Here is an example:


ALICE

[me@nodelogin02 ~]$ salloc --ntasks=1 -p cpu-short --mem=1G --time=00:05:00 salloc: Granted job allocation 492489 salloc: Waiting for resource configuration salloc: Nodes node001 are ready for job [me@nodelogin02 ~]$ ssh node001 Last login: Tue Dec 14 16:32:57 2021 [me@node001 ~]$ echo $HOSTNAME node001 [me@node001 ~]$ exit logout Connection to node001 closed. [me@nodelogin02 ~]$ exit exit salloc: Relinquishing job allocation 492489

In the example above, we did not run a command so we ended up in the bash environment on the node from which we requested the node. We can then log in to the node with ssh and run commands on it. With exit we left the node and another exit left the environment and released the node.


SHARK

[me@res-hpc-lo02 ~]$ salloc -N1 salloc: Granted job allocation 267 salloc: Waiting for resource configuration salloc: Nodes res-hpc-exe013 are ready for job [me@res-hpc-exe013 ~]$ squeue JOBID PARTITION USER ST TIME NODES NODELIST(REASON) 267 all user R 0:04 1 res-hpc-exe013 [me@res-hpc-exe013 ~]$ exit exit salloc: Relinquishing job allocation 267 [me@res-hpc-lo02 ~]$

In the example above, we did not run a command so we ended up in the bash environment. With exit we left the environment and we released the node.


If you need X11 forwarding, you can enable it for your interactive session by adding the option --x11, for example:

salloc --ntasks=1 --mem=1G --time=00:05:00 --x11

Job Monitoring

Slurm provides a number of ways for you to monitor the state of your job.

One option is to use squeue to get the overall state.

If you use entire nodes, you can query sinfo to get basic load information.

If your program writes out sufficient information to the slurm output file (or a separate log file), you can check its content on a regular basis.

Another option is to log in to the node on which your job is running. After a job has started to run, Slurm grants you permission to login to the compute node via ssh until the job terminates. You can use this to look at the utilization. Note that anything you do on the compute node, will count against the resources that you requested. So, you should make sure not to run any resource-intensive tasks.

Job Performance

A quick and easy way to get information about the performance of your job is to use the command seff followed by the id of your job, i.e.

seff <jobid>

If you told Slurm to notify you after your job as finished, the e-mail send from slurm will most likely also contain information about used resources.

More detailed performance statistics about your job or the program that you ran cannot be provided through Slurm-based tools. In this case, you need to deploy your own method to gather metrics that allow you to assess the performance.

Cancelling jobs

With the command scancel, you can cancel any of your jobs (running or pending) or even all of your jobs. Here are some example commands:

Command

Meaning

Command

Meaning

scancel <jobid>

Cancels the job with the given job id

scancel <jobid1> <jobid2>

Cancels jobs with the given space-separated list of job ids

scancel <jobid_[min_arrayid-max_arrayid]>

Cancels jobs of a job array matching the specified and job and array id.

scancel --user=<username>

Cancel all jobs of the given user

scancel --state=PENDING --user=<username> --partition=<partition>

Cancel all pending jobs of the given user in the given partition

See the man page for the scontrol command or the Slurm website for more information