OpenMPI on ALICE and SHARK
There are various versions of OpenMPI available on ALICE and SHARK. You can get an overview by running the following command
ALICE
module -r avail ^OpenMPI
Various modules on ALICE have been built with OpenMPI. When you load these modules, the version of OpenMPI that was used to built the module will be loaded automatically.
SHARK
module avail /mpi/
For this tutorial, we will be using OpenMPI 4.1.1.
Preparations
Log in to ALICE if you have not done it yet.
Before you set up your job or submit it, it is always best to have a look at the current job load on the cluster and what partitions are available to you.
Also, it helps to run some short, resource-friendly tests to see if your set up is working and you have a correct batch file. The “testing”-partition on ALICE or the “short” partition on SHARK can be used for such purpose. The examples in this tutorial are save to use on those partitions.
Here, we will assume that you have already created a directory called user_guide_tutorials
in your $HOME
from the previous tutorials. For this job, let's create a sub-directory and change into it:
mkdir -p $HOME/user_guide_tutorials/first_MPI_job cd $HOME/user_guide_tutorials/first_MPI_job
We will first create the MPI program and then write the slurm batch file.
MPI program
This is a very basic Hello-World type of MPI program. It will print out information about the rank and node that it is running on. We will name this file helloworld_mpi.c
#include <stdio.h> #include <mpi.h> int main (int argc, char *argv[]) { int i, rank, size, processor_name_len; char name [MPI_MAX_PROCESSOR_NAME]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Get_processor_name (name, &processor_name_len); printf ("Hello World from rank %03d out of %03d running on %s!\n", rank, size, name); if (rank == 0 ) printf ("MPI World size = %d processes\n", size); MPI_Finalize (); return 0; }
Next, we load a version of OpenMPI and then we use mpicc
to compile our program:
ALICE
module load OpenMPI/4.1.1-GCC-10.3.0 mpicc helloworld_mpi.c -o helloworld_mpi
SHARK
module load library/mpi/openmpi/4.1.1 mpicc helloworld_mpi.c -o helloworld_mpi
Slurm batch file
The slurm batch script helloworld_mpi.slurm
for our MPI example program looks like this:
ALICE
#!/bin/bash #SBATCH --job-name=helloworld_mpi #SBATCH --mail-user="<your-email-address>" #SBATCH --mail-type="ALL" #SBATCH --time=00:00:10 #SBATCH --partition=testing #SBATCH --output=%x_%j.out #SBATCH --nodes=2 #SBATCH --ntasks=10 #SBATCH --mem-per-cpu=10M #SBATCH --constraint=ib # making sure we start with a clean module environment module purge echo "## Loading module" module load slurm module load OpenMPI/4.1.1-GCC-10.3.0 TEST_DIR=$(pwd) echo "## Current dircectory $TEST_DIR" echo "## Running test" srun ./helloworld_mpi # alternative command, but not needed because srun takes care of it # mpirun -np $SLURM_NTASKS ./helloworld_mpi echo "## Test finished. Goodbye"
SHARK
#!/bin/bash #SBATCH --job-name=helloworld_mpi #SBATCH --mail-user="<your-email-address>" #SBATCH --mail-type="ALL" #SBATCH --time=00:00:10 #SBATCH --partition=short #SBATCH --output=%x_%j.out #SBATCH --nodes=2 #SBATCH --ntasks=10 #SBATCH --mem-per-cpu=10M # making sure we start with a clean module environment module purge echo "## Loading module" module load slurm module load library/mpi/openmpi/4.1.1 TEST_DIR=$(pwd) echo "## Current dircectory $TEST_DIR" echo "## Running test" srun ./helloworld_mpi # alternative command, but not needed because srun takes care of it # mpirun -np $SLURM_NTASKS ./helloworld_mpi echo "## Test finished. Goodbye"
where you should replace <your-email-address>
by your e-mail address. Here, we have requested two nodes to run 10 tasks. The tasks will be distributed automatically over the two nodes.
The output from our MPI program will go into the Slurm output file. This is fine for the example here, but not the best approach because the processes running in parallel have to write to the same file.
The resources set in the batch script have been determined after running the job at least once with more conservative estimates. In this configuration, it is fine to run the job on the testing partition.
Job submission
Let us submit this MPI job to slurm:
sbatch helloworld_mpi.slurm
Immediately after you have submitted this job, you should see something like this:
[me@<login_node> first_MPI_job]$ sbatch helloworld_mpi.slurm Submitted batch job <job_id>
Job output
In the directory where you launched your job, there should be new file created by Slurm: test_openmpi_<jobid>.out
. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:
## Loading module ## Current dircectory <your_path> ## Setting Infiniband variables ## Running test Hello World from rank 0 running on nodelogin01! MPI World size = 10 processes Hello World from rank 1 running on nodelogin01! Hello World from rank 2 running on nodelogin01! Hello World from rank 4 running on nodelogin01! Hello World from rank 3 running on nodelogin01! Hello World from rank 6 running on nodelogin02! Hello World from rank 7 running on nodelogin02! Hello World from rank 9 running on nodelogin02! Hello World from rank 8 running on nodelogin02! Hello World from rank 5 running on nodelogin02! ## Test finished. Goodbye
Because this is a parallel job, the output from each process is out of order.
You can get a quick overview of the resources actually used by your job by running:
seff <job_id>
It might look something like this:
Job ID: <job_id> Cluster: <cluster_name> User/Group: <user_name>/<group_name> State: COMPLETED (exit code 0) Nodes: 2 Cores per node: 5 CPU Utilized: 00:00:01 CPU Efficiency: 5.00% of 00:00:20 core-walltime Job Wall-clock time: 00:00:02 Memory Utilized: 1.35 MB Memory Efficiency: 0.13% of 1000.00 MB
Cancelling your job
If you need to cancel your job, you can do so with:
scancel <job_id>