Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


About this tutorial

This tutorial will guide you through setting up and submitting a very basic job using only bash commands without any modules. The focus of this tutorial is on the workflow with Slurm.

What you will learn?

  • Writing a batch file for your job

  • Submitting your job

  • Monitoring your job

  • Collecting information about your job

What this example will not cover?

  • Loading and using modules for your job

  • Compiling code

What you should know before starting?

While you can go through this tutorial without prior knowledge of Slurm, it is recommended that you read the section on Running jobs on ALICE or SHARK - Workload manager Slurm

Table of Contents

Preparations

Log in to ALICE if you have not done it yet.

Before you set up your job or submit it, it is always best to have a look at the current job load on the cluster and what partitions are available to you. You can do this with the Slurm command sinfo. The output might look something like this:


ALICE

Code Block
 [me@nodelogin02]$ sinfo
 PARTITION      AVAIL   TIMELIMIT  NODES  STATE NODELIST
 testing           up     1:00:00      2   idle nodelogin[01-02]
 cpu-short*        up     3:00:00     11    mix node[002-007,013-014,018-020]
 cpu-short*        up     3:00:00      1  alloc node001
 cpu-short*        up     3:00:00      8   idle node[008-012,015-017]
 cpu-medium        up  1-00:00:00     11    mix node[002-007,013-014,018-020]
 cpu-medium        up  1-00:00:00      8   idle node[008-012,015-017]
 cpu-long          up  7-00:00:00     10    mix node[003-007,013-014,018-020]
 cpu-long          up  7-00:00:00      8   idle node[008-012,015-017]
 gpu-short         up     3:00:00     10    mix node[851-860]
 gpu-medium        up  1-00:00:00     10    mix node[851-860]
 gpu-long          up  7-00:00:00      9    mix node[852-860]
 mem               up 14-00:00:00      1   idle node801
 mem_mi            up  4-00:00:00      1   idle node802
 amd-short         up     4:00:00      1   idle node802

SHARK

Code Block
[me@res-hpc-lo02 ~]$ sinfo
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*               up   infinite     13    mix res-hpc-exe[001,014,024,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02]
all*               up   infinite     13   idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031]
gpu                up   infinite      3    mix res-hpc-gpu[01-02,09]
gpu                up   infinite      2   idle res-hpc-gpu[03-04]
lumcdiagnostics    up   infinite     15    mix res-hpc-exe[001,014,024,032-033,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02]
lumcdiagnostics    up   infinite     15   idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031,034-035]
highmem            up   infinite      2    mix res-hpc-mem[01-02]
short              up    1:00:00     16    mix res-hpc-exe[001,014,024,032-033,036,039-043],res-hpc-gpu[01-02,05-06,09]
short              up    1:00:00     19   idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031,034-035],res-hpc-gpu[03-04],res-hpc-path[01-02]
highmemgpu         up   infinite      1   idle res-hpc-gpu07

You can see that some nodes are idle, i.e., they are not running any jobs; some nodes are allocated, i.e., they run one or more jobs that require all of their resources; some nodes are in a mix state which means that they are running jobs, but have free resources left.

Let's also create a directory for our job on the cluster and change into it.

Code Block
 mkdir -p $HOME/user_guide_tutorials/first_bash_job
 cd $HOME/user_guide_tutorials/first_bash_job

Creating the batch file

A Slurm batch file generally consists of the following three elements

  1. Interpreter

  2. Slurm settings

  3. Job commands

We will first go through each element separately and then combine them into one batch script. While this is not included here, the element “Job commands” can include commands to stage data on local scratch for example and move data back form local scratch.

Interpreter

Defining the type of interpreter for your shell commands is usally the first line in a batch script. Here, we will use bash:

Code Block
#!/bin/bash

It is recommended to set this to the same shell that you use for logging in.

Slurm settings

As you probably have seen in Running jobs on ALICE or SHARK - Workload manager Slurm, there are basically two types of Slurm settings that go into your batch file:

  1. Settings for job organisation

  2. Settings for job resources/execution

Settings for job organization

Let us start with the first type. It is never to early to get used to organising your jobs. This will help you on the long run to keep an overview of all your jobs and their products. It will also make it much easier to repeat jobs with the same settings if necessary. It might not look important if you only write a simple test script like this, but it will be when you are going to run all kinds of different jobs.

This is how these Slurm settings could look like for this example

Code Block
#SBATCH --job-name=test_helloworld
#SBATCH --output=%x_%j.out
#SBATCH --mail-user="<your-email-address>"
#SBATCH --mail-type="ALL"

You can consider these settings the minimum of what you should put in your batch script to organize your jobs, so let us go through them one by one.

  • Line 1: this sets the name of the job to test_helloworld. Defining the job name will make it easier for you later to find the information about the job status.

  • Lines 2-3: here, we have defined files in which Slurm would write the standard output including error messages. You probably have noticed that the file names look somewhat unusual. This is because we have used replacement symbols that are available for batch files. %x is the symbol for the job name which we defined first. %j corresponds to the job id number which will be assigned to the job by Slurm once we submit the job. Of course, you are free to name the output file however you want. However, we strongly advice you to always add %j to your file name in order to prevent Slurm from writing to the same file.

  • Lines 4-5: these settings will tell Slurm to send us notifications about our job to the e-mail address set in --mail-user. Because of --mail-type="All", Slurm will inform us about all events related to our job.

While the settings covering the e-mail notification will probably not change very much for your different jobs, you will most likely adjust the first two settings for your various jobs.

Settings for job resources/execution

There a range of different settings that affect how your job is scheduled or executed by Slurm. Therefore, they might change significantly from job to job.

The job that we will run in this example, does not require a lot of resources. Therefore, the following settings are sufficient

Code Block
#SBATCH --partition="<partition>"
#SBATCH --time=00:00:15
#SBATCH --ntasks=1
#SBATCH --mem=10M

Let us go through them:

  • Line 1: here, we set the partition to the one that we want to use. You should replace “<partition>” by the name of the partition. Since this will be a very simple test, we do not require a lot of processing time. Therefore, you could use the “cpu-short” or “testing” partition if you are on ALICE or the “short” partition if your are on SHARK

  • Line 2: this setting tells Slurm, that we will need a maximum compute time of 15s for this job. The job will not take that long, but we want to include a small time buffer. If our job goes beyond that time limit, it will be cancelled.

  • Line 3: this will tell Slurm the number of cores that we will need. We will only require one core for this job.

  • Line 4: here, we let Slurm know that we need about 10M of memory. Setting a memory limit is important to make sure that no default value is applied and that Slurm know how much memory to reserver.

Job commands

Now that we have the Slurm settings in place, we can define the environment variables and commands that will be executed. All we want to do here is execute a set of bash commands and of course print out "Hello World". We will also make use of some Slurm specific envirtonment variables, so that we get used to them. We will not use or move any data.

Code Block
echo "#### Starting Test"
echo "This is $SLURM_JOB_USER and my first job has the ID $SLURM_JOB_ID"
# get the current working directory
CWD=$(pwd)
echo "This job was submitted from $SLURM_SUBMIT_DIR and I am currently in $CWD"
# get the current time and date
DATE=$(date)
echo "It is now $DATE"
echo "Hello World from $HOSTNAME"
echo "#### Finished Test. Have a nice day"

Let us go through some of them:

  • We use echo to print out a bunch of messages.

  • Line 2: Here, we make use of two important environment variables that are provided by Slurm automatically. $SLURM_JOB_USER contains our Slurm user name and $SLURM_JOB_ID stores the id of our job.

  • Line 4: This uses pwd to get the current working directory and assign it to a new variable

  • Line 5: Another Slurm environment variable is used here to get the directory from where we submitted the job.

  • Line 7,8: The first one gets the current date and the second one prints out.

  • Line 9: This line finally returns the name of the host using the system environment variable $HOSTNAME

The complete batch script

We have finished assembling the batch script for your first job. This is how it looks like when put together:

Code Block
#!/bin/bash
#SBATCH --job-name=test_helloworld
#SBATCH --output=%x_%j.out
#SBATCH --mail-user="your-email-address"
#SBATCH --mail-type="ALL"
#SBATCH --partition="cpu-short"
#SBATCH --time=00:00:15
#SBATCH --ntasks=1
#SBATCH --mem=10M

echo "#### Starting Test"
echo "This is $SLURM_JOB_USER and my first job has the ID $SLURM_JOB_ID"
# get the current working directory
CWD=$(pwd)
echo "This job was submitted from $SLURM_SUBMIT_DIR and I am currently in $CWD"
# get the current time and date
DATE=$(date)
echo "It is now $DATE"
echo "Hello World from $HOSTNAME"
echo "#### Finished Test. Have a nice day"

Remember to replace "your-email-address" with your real e-mail address.

Save the batch file on ALICE either using a command line editor (such as emacs, vim, nano), an editor with a graphical user interface (e.g., gedit) or save the file on your local workstation. In this tutorial, we will call the file test_bash.slurm.

Info

For Windows user:

If you write your batch script on Windows and want to copy it to the cluster later, make sure you use a proper editor. Windows and Linux use different characters for encoding line endings. Some editors for Windows (such as “Notepad++” or “Visual Studio Code”) can already save the file in the correct format. However, the standard editor Notepad cannot.

Running your job

Since this is a fairly simple job, it is okay to run it from a directory in your $HOME. Depending on the type of job that you want to run later on, this might have to change.

If you have written your file on your local workstation, copy it to the directory user_guide_tutorials/first_bash_job in your home directory on the cluster

Info

For Windows users:

If you copied the batch file from Windows and you are not sure if it is correctly encoded, you can run a simple check:

cat -v test_bash.slurm

This will print out the content of the batch file to the command line and show the line ending. If you see the characters ^M at the end of each line, then you need to fix the line ending before you can submit your job. There are different ways on how to do it. The best way is to use an editor on Windows that can use the correct line ending.

If you need to do it on the cluster after you copied the file, you can run for example

sed -i 's/\r$//g' test_bash.slurm

or

dos2unix test_bash.slurm

Run cat -v test_bash.slurm again to check that the end of lines do not show ^M anymore.

You are ready to submit your job like this:

Code Block
sbatch test_bash.slurm

Immediately after you have submitted it, you should see something like this:

Code Block
[me@nodelogin02 first_bash_job]$ sbatch test_bash.slurm
Submitted batch job <job_id>

where <job_id> is the name of your job

Monitoring your job

As mentioned in https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37749335/Running+jobs+on+ALICE+or+SHARK+-+Workload+manager+Slurm#Job-Monitoring, there are various ways of how to monitor your job.

Probably one of the first things that you want to know is when your job is likely about to start

Code Block
 squeue --start -u <username>

If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job.

Eventually, you should see something like this:

Code Block
 JOBID         PARTITION         NAME     USER ST             START_TIME  NODES SCHEDNODES           NODELIST(REASON)
 <job_id>  <partition_name> <job_name>  <username> PD 2020-09-17T10:45:30      1 (null)               (Resources)

Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The NODELIST (REASON) give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with squeue.

Info

If you do not see anything after running the command, it is possible that Slurm is already running your job or that your job has already finished.
If you want to have more time, add the following lines after line 19 in the sbatch script:
sleep 60
echo "And now, it is "$(date)

This will cause the job to wait for 60s and then print out the current time and date again before finishing. You will also have to increase the running time of your job:
#SBATCH --time=00:01:15

Then, resubmit the job.

Once your job starts running, you will get an e-mail from the cluster. It will only have a subject line which will look something like this

Code Block
 Slurm Job_id=<job_id> Name=test_helloworld Began, Queued time 00:00:01

Since this is a very short job, you might receive the email after your job has finished.

Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:

Code Block
 Slurm Job_id=<job_id> Name=test_helloworld Ended, Run time 00:00:01, COMPLETED, ExitCode 0

The body of the message will probably contain some information about your job.

Checking the output of your Job

In the directory where you launched your job, there should be new file created by Slurm: test_bash_<jobid>.out. This file was created by Slurm and contains all the output from your job which would have normally written to the command line.

Have a look at the output in the file. If you are on the command line, you can run

Code Block
cat test_bash_<jobid>.out

to print its content directly on the command line or you can view it in a basic command line editor such as less

Code Block
less test_bash_<jobid>.out

you can leave “less” by typing “q”. Of course, you can also it with an editor with a graphical user interface like gedit or copy the file to your local workstation and view it there.

The content of the file should look something like this:

Code Block
#### Starting Test
This is <username> and my first job has the ID <jobid>
This job was submitted from <some_path> and I am currently in <another_path>
It is now <date>
Hello World from <node_name>
#### Finished Test. Have a nice day

where “<username>”, “<jobid>”, “<some_path>”, “<another_path>”, “<data>”, and “<node_name>” will have been replaced.

Checking resource usage of your job

A quick overview of your resource usage can be retrieved using the command seff

Code Block
 [me@nodelogin02]$ seff <job_id>

Alternatively, you can also get information with Slurm's sacctmgr command:

Code Block
 [me@nodelogin02]$ sacct -n --jobs=<job_id> --format "JobID,JobName,User,AllocNodes,NodeList,Partition,AllocTRES,AveCPUFreq,AveRSS,Submit,Start,End,CPUTime,Elapsed,MaxRSS,ReqCPU"
 <job_id>        <job_name>  <username>        1         node017  cpu-short billing=1+                       2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01               Unknown
 <job_id>.batch       batch                    1         node017            cpu=1,mem+      1.21G      1348K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1348K          0
 <job_id>.extern     extern                    1         node017            billing=1+      1.10G      1320K 2020-09-17T10:45:30 2020-09- 
 17T10:45:30 2020-09-17T10:45:31   00:00:01   00:00:01      1320K          0

Cancelling your job

If you want to cancel your job after you have submitted it, you can run:

Code Block
scancel <job_id>

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.