Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

About this tutorial

R is a programming language and software environment for statistical computing and graphics.

This tutorial will guide you through creating and running a simple serial and parallel job using R on ALICE. The examples used here are based on the tutorial of Ohio Supercomputing Center (link)

What you will learn?

  • Setting up the batch script for a simple R job

  • Loading the necessary modules

  • Submitting your job

What this example will not cover?

  • Installing R packages (see the section on https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37749364/Installing +your+own+software#Installing- R - packages)

  • Using RMPI (for a simple example see More on using R)

  • How to parallelize code

  • Using RStudio (see More on using R)

What you should know before starting?

  • Basic R. This tutorial is not intended as a tutorial on R. If you are completely new to R, we recommend that you go through a generic R tutorial first.

  • Basic knowledge of how to use a Linux OS from the command line.

  • How to connect to ALICE or SHARK: How to login to ALICE or SHARK

  • How to move files to and from ALICE or SHARK: Transferring Data

  • How to setup a simple batch job as shown in: Your first bash job

Table of Contents
minLevel1
maxLevel2
excludeContents

R on ALICE and SHARK

There are different versions of R available on ALICE and SHARK. You can find a list of available versions with


ALICE

Code Block
module -r avail '^R/'.

Some R modules have also been build with CUDA support.


SHARK

Code Block
module avail /R/

The command R --version returns the version of R you have loaded:

Code Block
 R --version
 R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
 Copyright (C) 2016 The R Foundation for Statistical Computing
 Platform: x86_64-pc-linux-gnu (64-bit)

Preparations

Log in to ALICE if you have not done it yet.

Before you set up your job or submit it, it is always best to have a look at the current job load on the cluster and what partitions are available to you.

Also, it helps to run some short, resource-friendly tests to see if your set up is working and you have a correct batch file. The “testing”-partition on ALICE or the “short” partition on SHARK can be used for such purpose. The examples in this tutorial are save to use on those partitions.

Here, we will assume that you have already created a directory called user_guide_tutorials in your $HOME from the previous tutorials. For this job, let's create a sub-directory and change into it:

Code Block
 mkdir -p $HOME/user_guide_tutorials/first_R_job
 cd $HOME/user_guide_tutorials/first_R_job

Since this tutorial will go through different examples R jobs, further preparations are discussed for each example.

A serial R job

We will create simple R programme that calculates the sum of vectors from sampling a normal distribution. Each time the function is executed, a new simulation is being done.

Here, we will run the simulations in a serial manner on a single core.

Preparations

The R script

First, we have to create an R file for our simulation. In the following, we will assume that this file is called test_R_serial.R and looks like this:

Code Block
languager
# Test script for serial R job
# Based on example from OSC
# https://www.osc.edu/resources/available_software/software_list/r#9

# The function that does the actual work
mySim <- function(run, size=1000000) {
  # print out process ID and run number
  pid <- Sys.getpid()
  # Generate the vector
  vec <- rnorm(size)
  # Sum the values of the vector
  sum_vec <- sum(vec)
  # Print out PID, run and sum
  print(paste("Result of run ", run, " (with PID ", pid,"): ", sum_vec))
  # return sum
  return(sum(vec))
}

# Get the starting time of the script
start_time <- proc.time()

# Go through the simulation runs
for(i in 1:100) {
  mySim(i)
}

# Get the running time of the script and print it
print(paste("Running time of script:"))
running_time <- proc.time() - start_time
running_time

We have added a few print statement to the mySim-function which are only there to visualize that the parallelization in the next example is working properly. Also, the run argument is only here for the output messages. Although, sometimes it can help with debugging to start with more verbosity in a program.

The Slurm batch file

Next, we will create the batch file test_R_serial.slurm. We make use of the testing partition on ALICE and the short partition on SHARK. The time and memory requirements have been set after the job has already been run. Usually, it is best to make a conservative estimate for the test runs and then adjust the resources accordingly:


ALICE

Code Block
#!/bin/bash
#SBATCH --job-name=test_R_serial
#SBATCH --output=%x_%j.out
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --partition=testing
#SBATCH --time=00:00:30
#SBATCH --ntasks=1
#SBATCH --mem=10M

#
# loading the R module
#
module load R/4.0.5-foss-2020b

#
# the actual job commands
#
echo "#### Running R serial test"

# just to illustrate some native slurm environment variables
echo "This is $SLURM_JOB_USER and this job has the ID $SLURM_JOB_ID"
echo "This job was submitted from $SLURM_SUBMIT_DIR"
echo "This job runs on $SLURMD_NODENAME"
# get the current working directory
CWD=$(pwd)
echo "I am currently in $CWD"
# get the current time and date
DATE=$(date)
echo "It is now $DATE"

# Run the file
echo "[$SHELL] Run script"
Rscript test_R_serial.R
echo "[$SHELL] Script finished"

echo "#### Finished R serial test"

SHARK

Code Block
#!/bin/bash
#SBATCH --job-name=test_R_serial
#SBATCH --output=%x_%j.out
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --partition=short
#SBATCH --time=00:00:30
#SBATCH --ntasks=1
#SBATCH --mem=10M

#
# loading the R module
#
module load statistical/R/4.1.2/gcc.8.3.1

#
# the actual job commands
#
echo "#### Running R serial test"

# just to illustrate some native slurm environment variables
echo "This is $SLURM_JOB_USER and this job has the ID $SLURM_JOB_ID"
echo "This job was submitted from $SLURM_SUBMIT_DIR"
echo "This job runs on $SLURMD_NODENAME"
# get the current working directory
CWD=$(pwd)
echo "I am currently in $CWD"
# get the current time and date
DATE=$(date)
echo "It is now $DATE"

# Run the file
echo "[$SHELL] Run script"
Rscript test_R_serial.R
echo "[$SHELL] Script finished"

echo "#### Finished R serial test"

The batch script will also print out some additional information.

Job submission

Now that we have the R script and the batch file, we are ready to run our job.

Please make sure that you are in the same directory where the script are. If not, then change into

Code Block
 cd $HOME/user_guide_tutorials/first_R_job

You are ready to submit your job like this:

Code Block
 sbatch test_R_serial.slurm

Immediately after you have submitted it, you should see something like this:

Code Block
 [me@<login_node> first_R_job]$ sbatch test_R_serial.slurm
 Submitted batch job <job_id>

Job output

In the directory where you launched your job, there should be new file created by Slurm: test_R_serial_<jobid>.out. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:

Code Block
#### Running R serial test
This is <me> and this job has the ID <jobid>
This job was submitted from /home/<me>/User_Guide/First_Job/First_R_Job
I am currently in /home/<me>/User_Guide/First_Job/First_R_Job
This job runs on nodelogin01
It is now Tue Apr  6 16:27:25 CEST 2021
[/bin/bash] Run script
[1] "Result of run  1  (with PID  303059 ):  797.015491864457"
[1] "Result of run  2  (with PID  303059 ):  20.3788396192479"
[1] "Result of run  3  (with PID  303059 ):  475.990385694449"
...
[1] "Result of run  100  (with PID  303059 ):  1142.99359880815"
[1] "Running time of script:"
   user  system elapsed
  6.698   0.376   7.074
[/bin/bash] Script finished
#### Finished R serial test

Note how the process id (PID) is the same for all simulation runs because they are done in serial. Also, note the running time of the job when we move on to parallelizing this simulation.

You can get a quick overview of the resources actually used by your job by running:

Code Block
 seff <job_id>

It might look something like this:

Code Block
Job ID: <job_id>
Cluster: <cluster_name>
User/Group: <user_name>/<group_name>
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:11
CPU Efficiency: 55.00% of 00:00:20 core-walltime
Job Wall-clock time: 00:00:20
Memory Utilized: 1.29 MB
Memory Efficiency: 12.85% of 10.00 MB

A first parallel R job

Running this simulation in a serial manner is inefficient because each simulation run is independent of the other. This makes it a classic case for parallelization. R comes with different options for parallelization. Here, we will make use of the parallel package and its mclapply function.

Preparations

In order to parallelize our test job, we will have to make a few small changes to the R script and the batch file.

Parallel R script

First, we will make a copy the file from the serial example (test_R_serial.R) which we will name test_R_parallel.R.

Next, open test_R_parallel.R with your favorite editor and add library(parallel) after the first three lines of the script. The beginning of your script should look now like this:

Code Block
# Test script for serial R job
# Based on example from OSC
# https://www.osc.edu/resources/available_software/software_list/r#9

# loading libraries
library(parallel)

We want our R script to automatically pick up the number of cores that Slurm has assigned to us. In principle, you can do this by reading out the Slurm variable SLURM_CPUS_PER_TASK or use R's system. Here, we will use the latter. After the definition of the mySim-function, add the following lines:

Code Block
...
  return(sum(vec))
}

# get the number of cores and print them out
cores <- system("nproc", intern=TRUE)
print(paste("Using ",cores, " cores"))

Finally, we have to replace the for-loop in the script with mclapply, i.e., instead of

Code Block
# Go through the simulation runs
for(i in 1:100) {
  mySim(i)
}

your script should contain just one line

Code Block
# Go through the simulation runs in parallel
result <- mclapply(1:100, function(i) mySim(i), mc.cores=cores)

Slurm batch file

Since this is a different simulation setup with a new R script, it is always best to also create a new Slurm batch file for running it. This greatly improves debugging any issues, reproducibility of your job and tweaking settings and resources.

Let's make a copy of our existing Slurm batch file (test_R_serial.slurm) and name it test_R_parallel.slurm.

We have to change a few #SBATCH settings. Apart from the name of the job, we need to specify the number of cores that we want to request using --cpus-per-tasks. We will also change --mem to --mem-per-cpu to tell Slurm how much memory we need per core. So, the total amount of memory that we will request will be mem-per-cpu * cpus-per-task. The beginning of your batch file should now look something like this:


ALICE

Code Block
#!/bin/bash
#SBATCH --job-name=test_R_parallel     # <- new job name
#SBATCH --output=%x_%j.out
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --partition=testing
#SBATCH --time=00:00:30
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10             # <- new option to set the number of cores for our job
#SBATCH --mem-per-cpu=10M              # <- set the memory per core

SHARK

Code Block
#!/bin/bash
#SBATCH --job-name=test_R_parallel     # <- new job name
#SBATCH --output=%x_%j.out
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --partition=short
#SBATCH --time=00:00:30
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10             # <- new option to set the number of cores for our job
#SBATCH --mem-per-cpu=10M              # <- set the memory per core

Lastly, we just have to change the name of the R script that the batch file executes, i.e., we replace

Code Block
Rscript test_R_sequential.R

by

Code Block
Rscript test_R_parallel.R

Job submission

If you have completed the previous step, it is time to run your first parallel R job on ALICE. Assuming your are in the directory $HOME/user_guide_tutorials/first_R_job, you can submit your job like this:

Code Block
[me@nodelogin02 first_R_job]$ sbatch test_R_parallel.slurm
Submitted batch job <job_id>

Job output

In the directory where you launched your job, there should be new file created by Slurm: test_R_parallel_<jobid>.out. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:

Code Block
#### Running R serial test
This is <me> and this job has the ID <jobid>
This job was submitted from /home/<me>/User_Guide/First_Job/First_R_Job
This job runs on nodelogin01
I am currently in /home/<me>/User_Guide/First_Job/First_R_Job
It is now Wed Apr  7 09:31:48 CEST 2021
[/bin/bash] Run script
[1] "Using  10  cores"
[1] "Result of run  6  (with PID  464024 ):  -25.4041249177298"
[1] "Result of run  7  (with PID  464025 ):  695.207612786061"
[1][1] "Result of run  8  (with PID  464026 ):  1953.82997266006"
 "Result of run  9  (with PID  464027 ):  65.457175604765"
...
[1] "Result of run  92  (with PID  464020 ):  -600.477274309403"
[1] "Running time of script:"
   user  system elapsed
  6.582   0.513   0.943
[/bin/bash] Script finished

You can clearly see how the running time has gone down by using multiple cores. The parallelization is also evident from the fact that the PID changes (there should be 10 different PIDs in use) and the output from the simulation runs is out of order.

You can get a quick overview of the resources actually used by your job by running:

Code Block
 seff <job_id>

It might look something like this:

Code Block
Job ID: <job_id>
Cluster: <cluster_name>
User/Group: <user_name>/<group_name>
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 10
CPU Utilized: 00:00:09
CPU Efficiency: 30.00% of 00:00:30 core-walltime
Job Wall-clock time: 00:00:03
Memory Utilized: 1.46 MB
Memory Efficiency: 1.46% of 100.00 MB

A second parallel R job

Here, we will make use of R's doparallel package to parallelize the simulation.

Preprations

R script with doparallel

Once more, we will make a copy the file from the serial example (test_R_serial.R), but this time we will name it test_R_doparallel.R

You can remove all print(paste(...)) statements in the new file since these will not work with the doparallel package.

As was the case with first parallel R script, we need to add loading the necessary R packages. The beginning of your R script should look something like this now:

Code Block
# Test script for serial R job
# Based on example from OSC
# https://www.osc.edu/resources/available_software/software_list/r#9

# loading libraries
library(doParallel, quiet = TRUE)
library(foreach) 

Next, we will add getting and print out the number of cores used by our R job. To mix it up, we will read out the Slurm environment variable this time. Also, we will tell doparallel how many cores it can use:

Code Block
...
  return(sum(vec))
}

# get the number of cores and print them out
cores <- Sys.getenv(paste("SLURM_CPUS_PER_TASK"))
print(paste("Using ",cores, " cores"))

# initiate compute environment for doparallel
cl <- makeCluster(as.numeric(cores)-1)
registerDoParallel(cl)

This time, we will replace the for-loop in the serial script with:

Code Block
# Go through the simulation runs in parallel
result <- foreach(i=1:100, .combine=c) %dopar% {
  mySim()
} 

At the end of the script, we will remove our compute environment by adding the following lines after running_time:

Code Block
# remove compute environment
invisible(stopCluster(cl))

Slurm batch file

If you worked through the first example, you can just create a copy of test_R_parallel.slurm and name it test_R_doparallel.slurm. Then, you only have to change the job name and the name of the R script. Your sbatch settings should look like this now:


ALICE

Code Block
#!/bin/bash
#SBATCH --job-name=test_R_doparallel     # <- new job name
#SBATCH --output=%x_%j.out
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --partition=testing
#SBATCH --time=00:00:30
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10             # <- same as for first parallel example
#SBATCH --mem-per-cpu=10M              # <- same as for first parallel example 

SHARK

Code Block
#!/bin/bash
#SBATCH --job-name=test_R_doparallel     # <- new job name
#SBATCH --output=%x_%j.out
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --partition=short
#SBATCH --time=00:00:30
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10             # <- same as for first parallel example
#SBATCH --mem-per-cpu=10M              # <- same as for first parallel example 

and replace Rscript test_R_sequential.R with:

Code Block
Rscript test_R_doparallel.R

Job submission

Assuming your are in the directory $HOME/user_guide_tutorials/first_R_job, you can submit this R job like this:

Code Block
[me@nodelogin02 first_R_job]$ sbatch test_R_doparallel.slurm
Submitted batch job <job_id>

Job output

In the directory where you launched your job, there should be new file created by Slurm: test_R_doparallel_<jobid>.out. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:

Code Block
#### Running R serial test
This is <me> and this job has the ID <jobid>
This job was submitted from /home/<me>/User_Guide/First_Job/First_R_Job
This job runs on nodelogin01
I am currently in /home/<me>/User_Guide/First_Job/First_R_Job
It is now Thu Apr  8 13:52:37 CEST 2021
[/bin/bash] Run script
[1] "SLURM: Using 10 cores"
   user  system elapsed
  0.115   0.029   1.220
[/bin/bash] Script finished
#### Finished R serial test 

You can clearly see that the running time has gone down compared to the serial R script and is only slightly higher compared to using parallel with mclapply.

You can get a quick overview of the resources actually used by your job by running:

Code Block
 seff <job_id>

It might look something like this:

Code Block
Job ID: <job_id>
Cluster: <cluster_name>
User/Group: <user_name>/<group_name>
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 10
CPU Utilized: 00:00:02
CPU Efficiency: 2.22% of 00:01:30 core-walltime
Job Wall-clock time: 00:00:09
Memory Utilized: 1.46 MB
Memory Efficiency: 1.46% of 100.00 MB

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

Code Block
 scancel <job_id>

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.