Your first bash job



About this tutorial

This tutorial will guide you through setting up and submitting a very basic job using only bash commands without any modules. The focus of this tutorial is on the workflow with Slurm.

What you will learn?

  • Writing a batch file for your job

  • Submitting your job

  • Monitoring your job

  • Collecting information about your job

What this example will not cover?

  • Loading and using modules for your job

  • Compiling code

What you should know before starting?

While you can go through this tutorial without prior knowledge of Slurm, it is recommended that you read the section on Running jobs on ALICE or SHARK - Workload manager Slurm

Preparations

Log in to ALICE if you have not done it yet.

Before you set up your job or submit it, it is always best to have a look at the current job load on the cluster and what partitions are available to you. You can do this with the Slurm command sinfo. The output might look something like this:


ALICE

[me@nodelogin02]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST testing up 1:00:00 2 idle nodelogin[01-02] cpu-short* up 3:00:00 11 mix node[002-007,013-014,018-020] cpu-short* up 3:00:00 1 alloc node001 cpu-short* up 3:00:00 8 idle node[008-012,015-017] cpu-medium up 1-00:00:00 11 mix node[002-007,013-014,018-020] cpu-medium up 1-00:00:00 8 idle node[008-012,015-017] cpu-long up 7-00:00:00 10 mix node[003-007,013-014,018-020] cpu-long up 7-00:00:00 8 idle node[008-012,015-017] gpu-short up 3:00:00 10 mix node[851-860] gpu-medium up 1-00:00:00 10 mix node[851-860] gpu-long up 7-00:00:00 9 mix node[852-860] mem up 14-00:00:00 1 idle node801 mem_mi up 4-00:00:00 1 idle node802 amd-short up 4:00:00 1 idle node802

SHARK

[me@res-hpc-lo02 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST all* up infinite 13 mix res-hpc-exe[001,014,024,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02] all* up infinite 13 idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031] gpu up infinite 3 mix res-hpc-gpu[01-02,09] gpu up infinite 2 idle res-hpc-gpu[03-04] lumcdiagnostics up infinite 15 mix res-hpc-exe[001,014,024,032-033,036,039-043],res-hpc-gpu[01-02],res-hpc-mem[01-02] lumcdiagnostics up infinite 15 idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031,034-035] highmem up infinite 2 mix res-hpc-mem[01-02] short up 1:00:00 16 mix res-hpc-exe[001,014,024,032-033,036,039-043],res-hpc-gpu[01-02,05-06,09] short up 1:00:00 19 idle res-hpc-exe[002-003,005,007-009,011-013,027,029-031,034-035],res-hpc-gpu[03-04],res-hpc-path[01-02] highmemgpu up infinite 1 idle res-hpc-gpu07

You can see that some nodes are idle, i.e., they are not running any jobs; some nodes are allocated, i.e., they run one or more jobs that require all of their resources; some nodes are in a mix state which means that they are running jobs, but have free resources left.

Let's also create a directory for our job on the cluster and change into it.

mkdir -p $HOME/user_guide_tutorials/first_bash_job cd $HOME/user_guide_tutorials/first_bash_job

Creating the batch file

A Slurm batch file generally consists of the following three elements

  1. Interpreter

  2. Slurm settings

  3. Job commands

We will first go through each element separately and then combine them into one batch script. While this is not included here, the element “Job commands” can include commands to stage data on local scratch for example and move data back form local scratch.

Interpreter

Defining the type of interpreter for your shell commands is usally the first line in a batch script. Here, we will use bash:

It is recommended to set this to the same shell that you use for logging in.

Slurm settings

As you probably have seen in Running jobs on ALICE or SHARK - Workload manager Slurm, there are basically two types of Slurm settings that go into your batch file:

  1. Settings for job organisation

  2. Settings for job resources/execution

Settings for job organization

Let us start with the first type. It is never to early to get used to organising your jobs. This will help you on the long run to keep an overview of all your jobs and their products. It will also make it much easier to repeat jobs with the same settings if necessary. It might not look important if you only write a simple test script like this, but it will be when you are going to run all kinds of different jobs.

This is how these Slurm settings could look like for this example

You can consider these settings the minimum of what you should put in your batch script to organize your jobs, so let us go through them one by one.

  • Line 1: this sets the name of the job to test_helloworld. Defining the job name will make it easier for you later to find the information about the job status.

  • Lines 2-3: here, we have defined files in which Slurm would write the standard output including error messages. You probably have noticed that the file names look somewhat unusual. This is because we have used replacement symbols that are available for batch files. %x is the symbol for the job name which we defined first. %j corresponds to the job id number which will be assigned to the job by Slurm once we submit the job. Of course, you are free to name the output file however you want. However, we strongly advice you to always add %j to your file name in order to prevent Slurm from writing to the same file.

  • Lines 4-5: these settings will tell Slurm to send us notifications about our job to the e-mail address set in --mail-user. Because of --mail-type="All", Slurm will inform us about all events related to our job.

While the settings covering the e-mail notification will probably not change very much for your different jobs, you will most likely adjust the first two settings for your various jobs.

Settings for job resources/execution

There a range of different settings that affect how your job is scheduled or executed by Slurm. Therefore, they might change significantly from job to job.

The job that we will run in this example, does not require a lot of resources. Therefore, the following settings are sufficient

Let us go through them:

  • Line 1: here, we set the partition to the one that we want to use. You should replace “<partition>” by the name of the partition. Since this will be a very simple test, we do not require a lot of processing time. Therefore, you could use the “cpu-short” or “testing” partition if you are on ALICE or the “short” partition if your are on SHARK

  • Line 2: this setting tells Slurm, that we will need a maximum compute time of 15s for this job. The job will not take that long, but we want to include a small time buffer. If our job goes beyond that time limit, it will be cancelled.

  • Line 3: this will tell Slurm the number of cores that we will need. We will only require one core for this job.

  • Line 4: here, we let Slurm know that we need about 10M of memory. Setting a memory limit is important to make sure that no default value is applied and that Slurm know how much memory to reserver.

Job commands

Now that we have the Slurm settings in place, we can define the environment variables and commands that will be executed. All we want to do here is execute a set of bash commands and of course print out "Hello World". We will also make use of some Slurm specific envirtonment variables, so that we get used to them. We will not use or move any data.

Let us go through some of them:

  • We use echo to print out a bunch of messages.

  • Line 2: Here, we make use of two important environment variables that are provided by Slurm automatically. $SLURM_JOB_USER contains our Slurm user name and $SLURM_JOB_ID stores the id of our job.

  • Line 4: This uses pwd to get the current working directory and assign it to a new variable

  • Line 5: Another Slurm environment variable is used here to get the directory from where we submitted the job.

  • Line 7,8: The first one gets the current date and the second one prints out.

  • Line 9: This line finally returns the name of the host using the system environment variable $HOSTNAME

The complete batch script

We have finished assembling the batch script for your first job. This is how it looks like when put together:

Remember to replace "your-email-address" with your real e-mail address.

Save the batch file on ALICE either using a command line editor (such as emacs, vim, nano), an editor with a graphical user interface (e.g., gedit) or save the file on your local workstation. In this tutorial, we will call the file test_bash.slurm.

For Windows user:

If you write your batch script on Windows and want to copy it to the cluster later, make sure you use a proper editor. Windows and Linux use different characters for encoding line endings. Some editors for Windows (such as “Notepad++” or “Visual Studio Code”) can already save the file in the correct format. However, the standard editor Notepad cannot.

Running your job

Since this is a fairly simple job, it is okay to run it from a directory in your $HOME. Depending on the type of job that you want to run later on, this might have to change.

If you have written your file on your local workstation, copy it to the directory user_guide_tutorials/first_bash_job in your home directory on the cluster

For Windows users:

If you copied the batch file from Windows and you are not sure if it is correctly encoded, you can run a simple check:

cat -v test_bash.slurm

This will print out the content of the batch file to the command line and show the line ending. If you see the characters ^M at the end of each line, then you need to fix the line ending before you can submit your job. There are different ways on how to do it. The best way is to use an editor on Windows that can use the correct line ending.

If you need to do it on the cluster after you copied the file, you can run for example

sed -i 's/\r$//g' test_bash.slurm

or

dos2unix test_bash.slurm

Run cat -v test_bash.slurm again to check that the end of lines do not show ^M anymore.

You are ready to submit your job like this:

Immediately after you have submitted it, you should see something like this:

where <job_id> is the name of your job

Monitoring your job

As mentioned in Running jobs on ALICE or SHARK - Workload manager Slurm | Job Monitoring, there are various ways of how to monitor your job.

Probably one of the first things that you want to know is when your job is likely about to start

If you try this right after your submission, you might not see a start date yet, because it takes Slurm usually a few seconds to estimate the starting date of your job.

Eventually, you should see something like this:

Depending on how busy the system is, you job will not be running right away. Instead, it will be pending in the queue until resources are available for the job to run. The NODELIST (REASON) give you an idea of why your job needs to wait, but we will not go into detail on this here. It might also be useful to simply check the entire queue with squeue.

If you do not see anything after running the command, it is possible that Slurm is already running your job or that your job has already finished.
If you want to have more time, add the following lines after line 19 in the sbatch script:
sleep 60
echo "And now, it is "$(date)

This will cause the job to wait for 60s and then print out the current time and date again before finishing. You will also have to increase the running time of your job:
#SBATCH --time=00:01:15

Then, resubmit the job.

Once your job starts running, you will get an e-mail from the cluster. It will only have a subject line which will look something like this

Since this is a very short job, you might receive the email after your job has finished.

Once the job has finished, you will receive another e-mail which will contain more information about your jobs performance. The subject will look like this if your job completed:

The body of the message will probably contain some information about your job.

Checking the output of your Job

In the directory where you launched your job, there should be new file created by Slurm: test_bash_<jobid>.out. This file was created by Slurm and contains all the output from your job which would have normally written to the command line.

Have a look at the output in the file. If you are on the command line, you can run

to print its content directly on the command line or you can view it in a basic command line editor such as less

you can leave “less” by typing “q”. Of course, you can also it with an editor with a graphical user interface like gedit or copy the file to your local workstation and view it there.

The content of the file should look something like this:

where “<username>”, “<jobid>”, “<some_path>”, “<another_path>”, “<data>”, and “<node_name>” will have been replaced.

Checking resource usage of your job

A quick overview of your resource usage can be retrieved using the command seff

Alternatively, you can also get information with Slurm's sacctmgr command:

Cancelling your job

If you want to cancel your job after you have submitted it, you can run:

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.