Running jobs on ALICE or SHARK - Workload manager Slurm

This section provides information on the workload manager Slurm used by ALICE and SHARK.

Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler that allocates compute resources on clusters for jobs. Slurm has been deployed at various national and international computing centres, and by approximately 60% of the TOP500 supercomputers in the world.

The following pages will give you a basic overview of Slurm on ALICE. You can learn much more about Slurm and its commands from the official Slurm website.

This section is intended as an overview of the fundamental concepts of using Slurm. In the https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37027894, we provide a more practical introduction on how to use Slurm, including examples for batch job scripts (e.g., https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37027911). However, we recommend novice users to read through both chapters.

Slurm basics

Common user commands

The following is a list of common user commands that will be discussed in more detail in this section and the tutorials.

Command

Definition

Command

Definition

sbatch

Submit a batch job script for execution (queued)

scancel

Delete a job

scontrol

Job status (detailed), several options only available to root

sinfo

Display state of partitions and nodes

squeue

Display state of all (queued) jobs

salloc

Submit a job for execution or initiate job in real-time (interactive job)

srun

Run parallel jobs

seff

Get a summary of how efficiently requested resources were used for a given job

If you want to get a full overview, have a look at the Slurm documentation or enter man <command> or check the documentation on the Slurm website.

There are different ways of how to submit jobs to Slurm. We always recommend to primarily use batch scripts submitted with the sbatch command instead of interactive jobs requested with salloc

Accounting info

The following list a number of commands to get accounting and job statistics.

Command

Info

Command

Info

sacct

Displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database

sstat

Display various status information of a running job/step

sacctmgr

Used to view and modify Slurm account information

sreport

Generate reports from the slurm accounting data

Scheduling info

The following commands can provide useful information to understand the scheduling of your job

Command

Info

Command

Info

sprio

View the factors that comprise a job's scheduling priority

sshare

Tool for listing the shares of associations to a cluster

Environment variables

Any environment variables that you have set with the sbatch command will be passed to your job. For this reason, if your program needs certain environment variables set to function properly, it is best to put them in your job script. This also makes it easier to reproduce your job results later, if necessary.

In addition to setting environment variables yourself, Slurm provides some environment variables of its own that you can use in your job scripts. Information on some of the common Slurm environment variables is listed in the chart below. For additional information, see the man page for sbatch.

Environmental Variable

Definition

Environmental Variable

Definition

$SLURM_JOB_ID

ID of job allocation

$SLURM_SUBMIT_DIR

Directory job where was submitted

$SLURM_JOB_NODELIST

File containing allocated hostnames

$SLURM_NTASKS

Total number of cores for job

$SLURM_CPUS_ON_NODE

Processors available to the job on this node

$SLURM_LAUNCH_NODE_IPADDR

IP address of node where job launched

$SLURM_NNODES

Total number of nodes use by the job

$SLURM_NODEID

Relative node ID of current node

$SLURM_NTASKS

Total number of processes in current job

$SLURM_PROCID

MPI rank (or relative process ID) of the current process

$SLURM_TASK_PID

Process ID of task started

$SLURM_TASKS_PER_NODE

Number of tasks to be run on each node

$CUDA_VISIBLE_DEVICES

List of GPUs are available for use

Environment variables override any options set in a batch script. Command-line options override any previously set environment variables.

Job settings and configuration information

With the slurm command scontrol show you can get a more detailed overview of your running job, node hardware and partitions:

Command

Info

Command

Info

scontrol show <job_id>

Get settings for a specific job

scontrol show <partition>

See the configuration of a given partition

scontrol show <node>

See the configuration of a given node

 

Slurm partitions

Slurm makes nodes on a cluster available as partitions, i.e., one partition gives access to a certain number of nodes and specific resource limits. Jobs are always submitted to either a default partition or a user-specified partition.


ALICE

You can find an overview of all currently available partitions on ALICE here: https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37519504


SHARK

You can find an overview of all currently available partitions on SHARK here:


sinfo

sinfo provides information about the current state of partitions on a cluster. The following commands are useful:

Command

Description

Command

Description

sinfo

Shows the partitions available to the users and basic state information

sinfo -a

Shows all partitions on the cluster including those not available to the user

sinfo -s

Prints a summary which includes the number of nodes in the various states state

sinfo -N -l

Provides a node-centric view of the partitions

sinfo -N --Format=Nodelist:20,Available:10,Time:15,Partition:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10

Provides a quick overview of the available resources for all nodes on the cluster.

sinfo -p <parition_name> -N --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10

Provides a quick overview of the available resources for all nodes in the partition <partition_name>

sinfo --Format=Nodelist:20,Available:10,Time:15,StateLong:15,Memory:15,FreeMem:15,CPUsLoad:15,CPUsState:15,GresUsed:15,Features:20,Reason:10,Partition

Provides a quick overview of the available resources per partition. Adding -p <parition_name> will show the same information for the given partition only.

Here, we show the output for some the above commands for both ALICE and SHARK. Note that the output will most likely look different when you run it because of the current load of the system and possible changes made to the partition system. We encourage you to try them yourself.


ALICE


SHARK


The node states have the following meaning:

State

Description

State

Description

idle

the node is not used, but available for new jobs to run

mix(ed)

there is 1 or more jobs running on the node, but there are still resources free on this node for more jobs

alloc(ated)

the whole node is allocated by 1 or more jobs, no additional jobs can run on this node

draining or drained

the node is being drained or drained. New jobs cannot be scheduled until the node is undrained

The abbreviations (A/I/O/T) in NODES(A/I/O/T) or CPUS(A/I/O/T) mean (Allocated/Idle/Other/Total)

squeue

With squeue, you can get information about your running jobs and jobs from other users:

Command

Description

Command

Description

squeue

Returns a list of jobs in the current queue for all partitions available to the user

squeue -a

Returns a list of jobs in the current queue for all partitions on the cluster

squeue -l

Same as squeue, but with additional information on the jobs in the queue

squeue --me

Lists only a user’s jobs in the queue

squeue --me --start

Lists a user’s pending jobs and their estimated starting time

Jobs typically pass through several states in the course of their execution.
The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. An explanation of some state follows:

State

State (full)

Explanation

State

State (full)

Explanation

CA

CANCELLED

Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

CD

COMPLETED

Job has terminated all processes on all nodes with an exit code of zero.

CG

COMPLETING

Job is in the process of completing. Some processes on some nodes may still be active.

F

FAILED

Job terminated with non-zero exit code or other failure condition.

PD

PENDING

Job is awaiting resource allocation.

R

RUNNING

Job currently has an allocation.

S

SUSPENDED

Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

Job information and cluster configuration with scontrol

With the Slurm command scontrol you can get a more detailed overview of your running job, node hardware and partitions, e.g.,

 

Job Resources

Determining what resources to request

Requesting the right amount of resources for jobs is one the most essential aspects of using Slurm (or running any jobs on an HPC).

Before you submit a job for batch processing, it is important to know what the requirements of your program are so that it can run properly. Each program and workflow has unique requirements so we advise that you determine what resources you need before you submit your job.

Keep in mind that increasing the amount of compute resources may also increase the amount of time that your job spends waiting in the queue. Within some limits, you may request whatever resources you need but bear in mind that other researchers need to be able to use those resources as well.

It is vital that you specify the resources you need as detailed as possible. This will help Slurm to better schedule your job and to allocate free resources to other users.

Finding the right settings often requires a bit of trial and error. It usually helps to run a few small test jobs and use their performance to estimate the resources for the production job. After each job, you can check how much of the requested resources was actually use. It is also worthwhile to check this after production jobs to see if you can or should make adjustments.

Below are some ways to specify the resources to ask for in your job script. These are options defined for the sbatch and salloc commands. There are additional options that you can find by checking the man pages for each command or the Slurm website.

Specifying resources and other settings for jobs

Slurm has its own syntax to request compute resources. In addition, Slurm has a number of settings for jobs that make it easier to organize jobs. Below is a summary table of some commonly requested resources and the Slurm syntax to get it. These options can be passed to sbatch or provided in a batch job script using the #SBATCH --option=<value> syntax. For a complete listing of request syntax, run the command man sbatch or check the Slurm website.

Category

Option

Meaning

Category

Option

Meaning

Job Organization

--job-name=<job_name>

Setting the name of the job (will be displayed in squeue output)

 

--mail-user=<email>

Setting where to send email alerts to

 

--mail-type="<BEGIN|END|FAIL|REQUEUE|ALL>"

Setting when to send email alerts

 

 --output=<out_file>

Setting the name of the output file (Default is slurm-<jobid>.out).

It is highly recommendable to set a name for the output file that can be readily associated with the job. An easy way to achieve this by using Slurm name patterns “%x” for the job name and “%j” for the job id, i.e., --output=%x_%j.out

 

--error=<error_file>

Setting a name for a separate file for error messages only

Nodes and CPUs

--nodes=<number>

Request this many nodes on the cluster (default is 1 unless other parameters allow splitting up the job over multiple nodes). Uses by default 1 core on each node.

 

--ntasks=<number>

Request this many tasks on the cluster. A task is an instance of a running program. Use this for example with MPI. Defaults to 1 task per node. You can request, multiple nodes, multiple tasks and multiple CPUs per task and/or per nodes.

 

--cpus-per-task=<number>

Request this many CPUs per task set by --ntasks.

Running time

--time=<hh:mm:ss>

The walltime or running time of your job. If you do not define how long your job will run, a default value might be set for a given partition. The maximum walltime that is available also depends on the partition that you use.

Memory

--mem=<number[unit]>

Request this amount of memory for your job (single node). Suffix for unit are [K|M|G|T]. If not specified, a default value will be set depending on the partition which might not be suitable for your job. For parallelized jobs, it is better to use --mem-per-cpu

 

--mem-per-cpu=<number[unit]>

Minimum memory required per allocated CPU. Suffix for unit are [K|M|G|T].

Partition

--partition=<partition_name>

Request specified partition/queue

GPU

--gres=gpu:[type of gpu]:[number of gpus]

Request a number of GPUs for each node, optionally specifying the type of GPU (default is 0 GPUs). The amount and type of GPUs available depend on the nodes in the cluster and partitions.

 

--gpus=[number of gpus]

Request GPUs for the job. We recommend using --gres=gpu:[number of gpus] instead to fine-tune the GPU usage on a single node.

 

--mem-per-gpu=<number[unit]>

Minimum memory required per allocated GPU. Suffix for unit are [K|M|G|T].

Other settings

--constraint=<attribute>

Request nodes which have the specific attributes (e.g. avx, IB)

Adjusting job settings after resubmission

While scontrol show a powerful command is to show info about your job, with scontrol update, you can change certain settings as long as your job is on hold or pending. First put your job on hold, update the settings and then release your job. Here is brief and generic example:

See the man page for the scontrol command or the Slurm website for more information.

Adjusting job priority after submission

You can adjust the priority of your job manually yourself to some extent. You cannot actually change the Priority value because it is not a fixed value, but you can manipulate the “Nice” factor to change the Priority value. The Nice factor will be subtracted from the Priority value that Slurm calculated.

For this, you do not have to hold your pending job, but you can adjust the value directly:

Batch Jobs

Batch processing is the recommended and common way to use a Slurm-controlled HPC cluster. It is a non-interactive workflow for running jobs which is reusable and reproducible if set up correctly. It makes use of a self-contained shell script, a so-called batch script, that is submitted to slurm using the sbatch command. The batch script contains all necessary information for Slurm to setup and run your job without the need for you to take action.

General workflow for batch processing

The workflow for batch processing can be summarized like this:

  1. Prepare your job e.g, develop/write/compile program/script/software, download data to the cluster

  2. Write your batch script

  3. Submit your batch script with sbatch batch_file_name

  4. Wait for Slurm to start your job

  5. Let Slurm run your job

    1. Optionally, you can monitor your job while it runs

  6. Come back after Slurm has finished running your job and retrieve results (and assess job performance)

  7. Go back to step 1 or 2 and run more jobs.

You can of course cancel your pending (step 4) or running job (stage 5) at any point in time and go back to stage 2 or even 1, in particular if you encountered any issues.

The batch script

The batch script contains everything Slurm needs to make sure your job runs properly and successfully assuming that your program can run without errors. It is written as

Below, you can find an example for a basic layout of a batch script. Specific examples can be found in the section

Job arrays

Job arrays allow you to submit multiple jobs with a single batch script. The jobs have to have the same #SBATCH settings to begin with, but subsequent commands in your batch script can make use of SLURM environment variables specific for array jobs that allow you to adapt commands to jobs in the job array. More information is available on the Slurm website: Slurm Workload Manager - Job Array Support (schedmd.com)

The id of array jobs consists of the general job id and the id of the job in the array separate by an underscore, i.e., <jobid>_<array_id_counter>

Batch commands for job arrays

In order to tell Slurm that your batch file should be run as a job array, add the sbatch setting --array your batch script, for example

When submitting an array job, squeue will show the job id as <primary_id>_<array_index>. However, internally Slurm will also log the job id of a job in a job array as <primary_id> + <array_index>

As such you have two options to easily specify your output file with a unique name.

For seff, you can also specify the job id in both ways.

Limitations for job arrays

In order to limit users from submitting too many jobs that would occupy the cluster for too long, a number of limitations apply:


ALICE

The total number of jobs that can be submitted with a job array currently are:

In addition, QOS settings limit the number of jobs that can run at any given time.


SHARK

The total number of jobs that can be submitted with a job array currently are:

 


Environment variables for job arrays

These are some of the Slurm environment variables specific to job arrays:

Environment variable

Comment

Environment variable

Comment

SLURM_ARRAY_JOB_ID

set to the first job ID of the array

SLURM_ARRAY_TASK_ID

set to the job array index value

SLURM_ARRAY_TASK_COUNT

set to the number of tasks in the job array

SLURM_ARRAY_TASK_MAX

set to the highest job array index value

SLURM_ARRAY_TASK_MIN

set to the lowest job array index value

Interactive Jobs

It is also possible to run interactive jobs on a Slurm-based HPC cluster like ALICE or SHARK. They can be useful for quick interactive tests and data visualization, but they are not recommended for production jobs.

Interactive jobs also require you to specify resources for your job similar to batch jobs and they also have to go through the queue. This means that it depends on the load of the cluster or partition when your interactive job will run and you might have to wait.

The Slurm command salloc allows you to request an interactive job. Here is an example:


ALICE

In the example above, we did not run a command so we ended up in the bash environment on the node from which we requested the node. We can then log in to the node with ssh and run commands on it. With exit we left the node and another exit left the environment and released the node.


SHARK

In the example above, we did not run a command so we ended up in the bash environment. With exit we left the environment and we released the node.


If you need X11 forwarding, you can enable it for your interactive session by adding the option --x11, for example:

Job Monitoring

Slurm provides a number of ways for you to monitor the state of your job.

One option is to use squeue to get the overall state.

If you use entire nodes, you can query sinfo to get basic load information.

If your program writes out sufficient information to the slurm output file (or a separate log file), you can check its content on a regular basis.

Another option is to log in to the node on which your job is running. After a job has started to run, Slurm grants you permission to login to the compute node via ssh until the job terminates. You can use this to look at the utilization. Note that anything you do on the compute node, will count against the resources that you requested. So, you should make sure not to run any resource-intensive tasks.

Job Performance

A quick and easy way to get information about the performance of your job is to use the command seff followed by the id of your job, i.e.

If you told Slurm to notify you after your job as finished, the e-mail send from slurm will most likely also contain information about used resources.

More detailed performance statistics about your job or the program that you ran cannot be provided through Slurm-based tools. In this case, you need to deploy your own method to gather metrics that allow you to assess the performance.

Cancelling jobs

With the command scancel, you can cancel any of your jobs (running or pending) or even all of your jobs. Here are some example commands:

Command

Meaning

Command

Meaning

scancel <jobid>

Cancels the job with the given job id

scancel <jobid1> <jobid2>

Cancels jobs with the given space-separated list of job ids

scancel <jobid_[min_arrayid-max_arrayid]>

Cancels jobs of a job array matching the specified and job and array id.

scancel --user=<username>

Cancel all jobs of the given user

scancel --state=PENDING --user=<username> --partition=<partition>

Cancel all pending jobs of the given user in the given partition

See the man page for the scontrol command or the Slurm website for more information