R on ALICE and SHARK
There are different versions of R available on ALICE and SHARK. You can find a list of available versions with
ALICE
Code Block |
---|
module -r avail '^R/'. |
Some R modules have also been build with CUDA support.
SHARK
Code Block |
---|
module avail /R/ |
The command R --version
returns the version of R you have loaded:
Code Block |
---|
R --version R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) |
Preparations
Log in to ALICE if you have not done it yet.
Before you set up your job or submit it, it is always best to have a look at the current job load on the cluster and what partitions are available to you.
Also, it helps to run some short, resource-friendly tests to see if your set up is working and you have a correct batch file. The “testing”-partition on ALICE or the “short” partition on SHARK can be used for such purpose. The examples in this tutorial are save to use on those partitions.
Here, we will assume that you have already created a directory called user_guide_tutorials
in your $HOME
from the previous tutorials. For this job, let's create a sub-directory and change into it:
Code Block |
---|
mkdir -p $HOME/user_guide_tutorials/first_R_job cd $HOME/user_guide_tutorials/first_R_job |
Since this tutorial will go through different examples R jobs, further preparations are discussed for each example.
A serial R job
We will create simple R programme that calculates the sum of vectors from sampling a normal distribution. Each time the function is executed, a new simulation is being done.
Here, we will run the simulations in a serial manner on a single core.
Preparations
The R script
First, we have to create an R file for our simulation. In the following, we will assume that this file is called test_R_serial.R
and looks like this:
Code Block | ||
---|---|---|
| ||
# Test script for serial R job # Based on example from OSC # https://www.osc.edu/resources/available_software/software_list/r#9 # The function that does the actual work mySim <- function(run, size=1000000) { # print out process ID and run number pid <- Sys.getpid() # Generate the vector vec <- rnorm(size) # Sum the values of the vector sum_vec <- sum(vec) # Print out PID, run and sum print(paste("Result of run ", run, " (with PID ", pid,"): ", sum_vec)) # return sum return(sum(vec)) } # Get the starting time of the script start_time <- proc.time() # Go through the simulation runs for(i in 1:100) { mySim(i) } # Get the running time of the script and print it print(paste("Running time of script:")) running_time <- proc.time() - start_time running_time |
We have added a few print statement to the mySim
-function which are only there to visualize that the parallelization in the next example is working properly. Also, the run argument is only here for the output messages. Although, sometimes it can help with debugging to start with more verbosity in a program.
The Slurm batch file
Next, we will create the batch file test_R_serial.slurm
. We make use of the testing partition on ALICE and the short partition on SHARK. The time and memory requirements have been set after the job has already been run. Usually, it is best to make a conservative estimate for the test runs and then adjust the resources accordingly:
ALICE
Code Block |
---|
#!/bin/bash #SBATCH --job-name=test_R_serial #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_e-mail>" #SBATCH --mail-type="ALL" #SBATCH --partition=testing #SBATCH --time=00:00:30 #SBATCH --ntasks=1 #SBATCH --mem=10M # # loading the R module # module load R/4.0.5-foss-2020b # # the actual job commands # echo "#### Running R serial test" # just to illustrate some native slurm environment variables echo "This is $SLURM_JOB_USER and this job has the ID $SLURM_JOB_ID" echo "This job was submitted from $SLURM_SUBMIT_DIR" echo "This job runs on $SLURMD_NODENAME" # get the current working directory CWD=$(pwd) echo "I am currently in $CWD" # get the current time and date DATE=$(date) echo "It is now $DATE" # Run the file echo "[$SHELL] Run script" Rscript test_R_serial.R echo "[$SHELL] Script finished" echo "#### Finished R serial test" |
SHARK
Code Block |
---|
#!/bin/bash #SBATCH --job-name=test_R_serial #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_e-mail>" #SBATCH --mail-type="ALL" #SBATCH --partition=short #SBATCH --time=00:00:30 #SBATCH --ntasks=1 #SBATCH --mem=10M # # loading the R module # module load statistical/R/4.1.2/gcc.8.3.1 # # the actual job commands # echo "#### Running R serial test" # just to illustrate some native slurm environment variables echo "This is $SLURM_JOB_USER and this job has the ID $SLURM_JOB_ID" echo "This job was submitted from $SLURM_SUBMIT_DIR" echo "This job runs on $SLURMD_NODENAME" # get the current working directory CWD=$(pwd) echo "I am currently in $CWD" # get the current time and date DATE=$(date) echo "It is now $DATE" # Run the file echo "[$SHELL] Run script" Rscript test_R_serial.R echo "[$SHELL] Script finished" echo "#### Finished R serial test" |
The batch script will also print out some additional information.
Job submission
Now that we have the R script and the batch file, we are ready to run our job.
Please make sure that you are in the same directory where the script are. If not, then change into
Code Block |
---|
cd $HOME/user_guide_tutorials/first_R_job |
You are ready to submit your job like this:
Code Block |
---|
sbatch test_R_serial.slurm |
Immediately after you have submitted it, you should see something like this:
Code Block |
---|
[me@<login_node> first_R_job]$ sbatch test_R_serial.slurm Submitted batch job <job_id> |
Job output
In the directory where you launched your job, there should be new file created by Slurm: test_R_serial_<jobid>.out
. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:
Code Block |
---|
#### Running R serial test This is <me> and this job has the ID <jobid> This job was submitted from /home/<me>/User_Guide/First_Job/First_R_Job I am currently in /home/<me>/User_Guide/First_Job/First_R_Job This job runs on nodelogin01 It is now Tue Apr 6 16:27:25 CEST 2021 [/bin/bash] Run script [1] "Result of run 1 (with PID 303059 ): 797.015491864457" [1] "Result of run 2 (with PID 303059 ): 20.3788396192479" [1] "Result of run 3 (with PID 303059 ): 475.990385694449" ... [1] "Result of run 100 (with PID 303059 ): 1142.99359880815" [1] "Running time of script:" user system elapsed 6.698 0.376 7.074 [/bin/bash] Script finished #### Finished R serial test |
Note how the process id (PID) is the same for all simulation runs because they are done in serial. Also, note the running time of the job when we move on to parallelizing this simulation.
You can get a quick overview of the resources actually used by your job by running:
Code Block |
---|
seff <job_id> |
It might look something like this:
Code Block |
---|
Job ID: <job_id> Cluster: <cluster_name> User/Group: <user_name>/<group_name> State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:11 CPU Efficiency: 55.00% of 00:00:20 core-walltime Job Wall-clock time: 00:00:20 Memory Utilized: 1.29 MB Memory Efficiency: 12.85% of 10.00 MB |
A first parallel R job
Running this simulation in a serial manner is inefficient because each simulation run is independent of the other. This makes it a classic case for parallelization. R comes with different options for parallelization. Here, we will make use of the parallel package and its mclapply
function.
Preparations
In order to parallelize our test job, we will have to make a few small changes to the R script and the batch file.
Parallel R script
First, we will make a copy the file from the serial example (test_R_serial.R
) which we will name test_R_parallel.R
.
Next, open test_R_parallel.R
with your favorite editor and add library(parallel)
after the first three lines of the script. The beginning of your script should look now like this:
Code Block |
---|
# Test script for serial R job # Based on example from OSC # https://www.osc.edu/resources/available_software/software_list/r#9 # loading libraries library(parallel) |
We want our R script to automatically pick up the number of cores that Slurm has assigned to us. In principle, you can do this by reading out the Slurm variable SLURM_CPUS_PER_TASK
or use R's system
. Here, we will use the latter. After the definition of the mySim
-function, add the following lines:
Code Block |
---|
... return(sum(vec)) } # get the number of cores and print them out cores <- system("nproc", intern=TRUE) print(paste("Using ",cores, " cores")) |
Finally, we have to replace the for-loop in the script with mclapply
, i.e., instead of
Code Block |
---|
# Go through the simulation runs for(i in 1:100) { mySim(i) } |
your script should contain just one line
Code Block |
---|
# Go through the simulation runs in parallel result <- mclapply(1:100, function(i) mySim(i), mc.cores=cores) |
Slurm batch file
Since this is a different simulation setup with a new R script, it is always best to also create a new Slurm batch file for running it. This greatly improves debugging any issues, reproducibility of your job and tweaking settings and resources.
Let's make a copy of our existing Slurm batch file (test_R_serial.slurm
) and name it test_R_parallel.slurm
.
We have to change a few #SBATCH
settings. Apart from the name of the job, we need to specify the number of cores that we want to request using --cpus-per-tasks
. We will also change --mem
to --mem-per-cpu
to tell Slurm how much memory we need per core. So, the total amount of memory that we will request will be mem-per-cpu * cpus-per-task
. The beginning of your batch file should now look something like this:
ALICE
Code Block |
---|
#!/bin/bash #SBATCH --job-name=test_R_parallel # <- new job name #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_e-mail>" #SBATCH --mail-type="ALL" #SBATCH --partition=testing #SBATCH --time=00:00:30 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=10 # <- new option to set the number of cores for our job #SBATCH --mem-per-cpu=10M # <- set the memory per core |
SHARK
Code Block |
---|
#!/bin/bash #SBATCH --job-name=test_R_parallel # <- new job name #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_e-mail>" #SBATCH --mail-type="ALL" #SBATCH --partition=short #SBATCH --time=00:00:30 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=10 # <- new option to set the number of cores for our job #SBATCH --mem-per-cpu=10M # <- set the memory per core |
Lastly, we just have to change the name of the R script that the batch file executes, i.e., we replace
Code Block |
---|
Rscript test_R_sequential.R |
by
Code Block |
---|
Rscript test_R_parallel.R |
Job submission
If you have completed the previous step, it is time to run your first parallel R job on ALICE. Assuming your are in the directory $HOME/user_guide_tutorials/first_R_job
, you can submit your job like this:
Code Block |
---|
[me@nodelogin02 first_R_job]$ sbatch test_R_parallel.slurm Submitted batch job <job_id> |
Job output
In the directory where you launched your job, there should be new file created by Slurm: test_R_parallel_<jobid>.out
. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:
Code Block |
---|
#### Running R serial test This is <me> and this job has the ID <jobid> This job was submitted from /home/<me>/User_Guide/First_Job/First_R_Job This job runs on nodelogin01 I am currently in /home/<me>/User_Guide/First_Job/First_R_Job It is now Wed Apr 7 09:31:48 CEST 2021 [/bin/bash] Run script [1] "Using 10 cores" [1] "Result of run 6 (with PID 464024 ): -25.4041249177298" [1] "Result of run 7 (with PID 464025 ): 695.207612786061" [1][1] "Result of run 8 (with PID 464026 ): 1953.82997266006" "Result of run 9 (with PID 464027 ): 65.457175604765" ... [1] "Result of run 92 (with PID 464020 ): -600.477274309403" [1] "Running time of script:" user system elapsed 6.582 0.513 0.943 [/bin/bash] Script finished |
You can clearly see how the running time has gone down by using multiple cores. The parallelization is also evident from the fact that the PID changes (there should be 10 different PIDs in use) and the output from the simulation runs is out of order.
You can get a quick overview of the resources actually used by your job by running:
Code Block |
---|
seff <job_id> |
It might look something like this:
Code Block |
---|
Job ID: <job_id> Cluster: <cluster_name> User/Group: <user_name>/<group_name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 10 CPU Utilized: 00:00:09 CPU Efficiency: 30.00% of 00:00:30 core-walltime Job Wall-clock time: 00:00:03 Memory Utilized: 1.46 MB Memory Efficiency: 1.46% of 100.00 MB |
A second parallel R job
Here, we will make use of R's doparallel
package to parallelize the simulation.
Preprations
R script with doparallel
Once more, we will make a copy the file from the serial example (test_R_serial.R
), but this time we will name it test_R_doparallel.R
You can remove all print(paste(...))
statements in the new file since these will not work with the doparallel package.
As was the case with first parallel R script, we need to add loading the necessary R packages. The beginning of your R script should look something like this now:
Code Block |
---|
# Test script for serial R job # Based on example from OSC # https://www.osc.edu/resources/available_software/software_list/r#9 # loading libraries library(doParallel, quiet = TRUE) library(foreach) |
Next, we will add getting and print out the number of cores used by our R job. To mix it up, we will read out the Slurm environment variable this time. Also, we will tell doparallel how many cores it can use:
Code Block |
---|
... return(sum(vec)) } # get the number of cores and print them out cores <- Sys.getenv(paste("SLURM_CPUS_PER_TASK")) print(paste("Using ",cores, " cores")) # initiate compute environment for doparallel cl <- makeCluster(as.numeric(cores)-1) registerDoParallel(cl) |
This time, we will replace the for
-loop in the serial script with:
Code Block |
---|
# Go through the simulation runs in parallel result <- foreach(i=1:100, .combine=c) %dopar% { mySim() } |
At the end of the script, we will remove our compute environment by adding the following lines after running_time
:
Code Block |
---|
# remove compute environment invisible(stopCluster(cl)) |
Slurm batch file
If you worked through the first example, you can just create a copy of test_R_parallel.slurm
and name it test_R_doparallel.slurm
. Then, you only have to change the job name and the name of the R script. Your sbatch settings should look like this now:
ALICE
Code Block |
---|
#!/bin/bash #SBATCH --job-name=test_R_doparallel # <- new job name #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_e-mail>" #SBATCH --mail-type="ALL" #SBATCH --partition=testing #SBATCH --time=00:00:30 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=10 # <- same as for first parallel example #SBATCH --mem-per-cpu=10M # <- same as for first parallel example |
SHARK
Code Block |
---|
#!/bin/bash #SBATCH --job-name=test_R_doparallel # <- new job name #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_e-mail>" #SBATCH --mail-type="ALL" #SBATCH --partition=short #SBATCH --time=00:00:30 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=10 # <- same as for first parallel example #SBATCH --mem-per-cpu=10M # <- same as for first parallel example |
and replace Rscript test_R_sequential.R
with:
Code Block |
---|
Rscript test_R_doparallel.R |
Job submission
Assuming your are in the directory $HOME/user_guide_tutorials/first_R_job
, you can submit this R job like this:
Code Block |
---|
[me@nodelogin02 first_R_job]$ sbatch test_R_doparallel.slurm Submitted batch job <job_id> |
Job output
In the directory where you launched your job, there should be new file created by Slurm: test_R_doparallel_<jobid>.out
. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:
Code Block |
---|
#### Running R serial test This is <me> and this job has the ID <jobid> This job was submitted from /home/<me>/User_Guide/First_Job/First_R_Job This job runs on nodelogin01 I am currently in /home/<me>/User_Guide/First_Job/First_R_Job It is now Thu Apr 8 13:52:37 CEST 2021 [/bin/bash] Run script [1] "SLURM: Using 10 cores" user system elapsed 0.115 0.029 1.220 [/bin/bash] Script finished #### Finished R serial test |
You can clearly see that the running time has gone down compared to the serial R script and is only slightly higher compared to using parallel
with mclapply
.
You can get a quick overview of the resources actually used by your job by running:
Code Block |
---|
seff <job_id> |
It might look something like this:
Code Block |
---|
Job ID: <job_id> Cluster: <cluster_name> User/Group: <user_name>/<group_name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 10 CPU Utilized: 00:00:02 CPU Efficiency: 2.22% of 00:01:30 core-walltime Job Wall-clock time: 00:00:09 Memory Utilized: 1.46 MB Memory Efficiency: 1.46% of 100.00 MB |
Cancelling your job
In case you need to cancel the job that you have submitted, you can use the following command
Code Block |
---|
scancel <job_id> |
You can use it to cancel the job at any stage in the queue, i.e., pending or running.
Note that you might not be able to cancel the job in this example, because it has already finished.