Your first R job

About this tutorial

R is a programming language and software environment for statistical computing and graphics.

This tutorial will guide you through creating and running a simple serial and parallel job using R on ALICE. The examples used here are based on the tutorial of Ohio Supercomputing Center (link)

What you will learn?

  • Setting up the batch script for a simple R job

  • Loading the necessary modules

  • Submitting your job

What this example will not cover?

What you should know before starting?

  • Basic R. This tutorial is not intended as a tutorial on R. If you are completely new to R, we recommend that you go through a generic R tutorial first.

  • Basic knowledge of how to use a Linux OS from the command line.

  • How to connect to ALICE or SHARK:

  • How to move files to and from ALICE or SHARK:

  • How to setup a simple batch job as shown in:

R on ALICE and SHARK

There are different versions of R available on ALICE and SHARK. You can find a list of available versions with


ALICE

module -r avail '^R/'.

Some R modules have also been build with CUDA support.


SHARK

module avail /R/

The command R --version returns the version of R you have loaded:

[me@nodelogin02 ~]$ R --version R version 4.4.0 (2024-04-24) -- "Puppy Cup" Copyright (C) 2024 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnuPreparations

Log in to ALICE if you have not done it yet.

Before you set up your job or submit it, it is always best to have a look at the current job load on the cluster and what partitions are available to you.

Also, it helps to run some short, resource-friendly tests to see if your set up is working and you have a correct batch file. The “testing”-partition on ALICE or the “short” partition on SHARK can be used for such purpose. The examples in this tutorial are save to use on those partitions.

Here, we will assume that you have already created a directory called user_guide_tutorials in your $HOME from the previous tutorials. For this job, let's create a sub-directory and change into it:

Since this tutorial will go through different examples R jobs, further preparations are discussed for each example.

A serial R job

We will create simple R programme that calculates the sum of vectors from sampling a normal distribution. Each time the function is executed, a new simulation is being done.

Here, we will run the simulations in a serial manner on a single core.

Preparations

The R script

First, we have to create an R file for our simulation. In the following, we will assume that this file is called test_R_serial.R and looks like this:

We have added a few print statement to the mySim-function which are only there to visualize that the parallelization in the next example is working properly. Also, the run argument is only here for the output messages. Although, sometimes it can help with debugging to start with more verbosity in a program.

The Slurm batch file

Next, we will create the batch file test_R_serial.slurm. We make use of the testing partition on ALICE and the short partition on SHARK. The time and memory requirements have been set after the job has already been run. Usually, it is best to make a conservative estimate for the test runs and then adjust the resources accordingly:


ALICE


SHARK


The batch script will also print out some additional information.

Job submission

Now that we have the R script and the batch file, we are ready to run our job.

Please make sure that you are in the same directory where the script are. If not, then change into

You are ready to submit your job like this:

Immediately after you have submitted it, you should see something like this:

Job output

In the directory where you launched your job, there should be new file created by Slurm: test_R_serial_<jobid>.out. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:

Note how the process id (PID) is the same for all simulation runs because they are done in serial. Also, note the running time of the job when we move on to parallelizing this simulation.

You can get a quick overview of the resources actually used by your job by running:

It might look something like this:

A first parallel R job

Running this simulation in a serial manner is inefficient because each simulation run is independent of the other. This makes it a classic case for parallelization. R comes with different options for parallelization. Here, we will make use of the parallel package and its mclapply function.

Preparations

In order to parallelize our test job, we will have to make a few small changes to the R script and the batch file.

Parallel R script

First, we will make a copy the file from the serial example (test_R_serial.R) which we will name test_R_parallel.R.

Next, open test_R_parallel.R with your favorite editor and add library(parallel) after the first three lines of the script. The beginning of your script should look now like this:

We want our R script to automatically pick up the number of cores that Slurm has assigned to us. In principle, you can do this by reading out the Slurm variable SLURM_CPUS_PER_TASK or use R's system. Here, we will use the latter. After the definition of the mySim-function, add the following lines:

Finally, we have to replace the for-loop in the script with mclapply, i.e., instead of

your script should contain just one line

Slurm batch file

Since this is a different simulation setup with a new R script, it is always best to also create a new Slurm batch file for running it. This greatly improves debugging any issues, reproducibility of your job and tweaking settings and resources.

Let's make a copy of our existing Slurm batch file (test_R_serial.slurm) and name it test_R_parallel.slurm.

We have to change a few #SBATCH settings. Apart from the name of the job, we need to specify the number of cores that we want to request using --cpus-per-tasks. We will also change --mem to --mem-per-cpu to tell Slurm how much memory we need per core. So, the total amount of memory that we will request will be mem-per-cpu * cpus-per-task. The beginning of your batch file should now look something like this:


ALICE


SHARK


Lastly, we just have to change the name of the R script that the batch file executes, i.e., we replace

by

Job submission

If you have completed the previous step, it is time to run your first parallel R job on ALICE. Assuming your are in the directory $HOME/user_guide_tutorials/first_R_job, you can submit your job like this:

Job output

In the directory where you launched your job, there should be new file created by Slurm: test_R_parallel_<jobid>.out. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:

You can clearly see how the running time has gone down by using multiple cores. The parallelization is also evident from the fact that the PID changes (there should be 10 different PIDs in use) and the output from the simulation runs is out of order.

You can get a quick overview of the resources actually used by your job by running:

It might look something like this:

A second parallel R job

Here, we will make use of R's doparallel package to parallelize the simulation.

Preprations

R script with doparallel

Once more, we will make a copy the file from the serial example (test_R_serial.R), but this time we will name it test_R_doparallel.R

You can remove all print(paste(...)) statements in the new file since these will not work with the doparallel package.

As was the case with first parallel R script, we need to add loading the necessary R packages. The beginning of your R script should look something like this now:

Next, we will add getting and print out the number of cores used by our R job. To mix it up, we will read out the Slurm environment variable this time. Also, we will tell doparallel how many cores it can use:

This time, we will replace the for-loop in the serial script with:

At the end of the script, we will remove our compute environment by adding the following lines after running_time:

Slurm batch file

If you worked through the first example, you can just create a copy of test_R_parallel.slurm and name it test_R_doparallel.slurm. Then, you only have to change the job name and the name of the R script. Your sbatch settings should look like this now:


ALICE


SHARK


and replace Rscript test_R_sequential.R with:

Job submission

Assuming your are in the directory $HOME/user_guide_tutorials/first_R_job, you can submit this R job like this:

Job output

In the directory where you launched your job, there should be new file created by Slurm: test_R_doparallel_<jobid>.out. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:

You can clearly see that the running time has gone down compared to the serial R script and is only slightly higher compared to using parallel with mclapply.

You can get a quick overview of the resources actually used by your job by running:

It might look something like this:

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.