Your first GPU job
About this tutorial
This tutorial will guide you through running a job using a GPU on ALICE or SHARK. Here, we will use a scripts from NVIDIA’s CUDA samples repository to run on the GPU.
What you will learn?
Setting up the batch script for a job using GPUs
Loading the necessary modules
Submitting your job
Collect information about your job
What this example will not cover?
Using multiple GPUs
Writing and compiling code for GPU
What you should know before starting?
Basic knowledge of how to use a Linux OS from the command line.
How to connect to ALICE or SHARK: How to login to ALICE or SHARK
How to move files to and from ALICE or SHARK: Transferring Data
How to setup a simple batch job as shown in: Your first bash job
CUDA on ALICE and SHARK
CUDA is available on both clusters in different versions. You can get an overview by running:
ALICE
module -r avail ^CUDA
Various modules on ALICE have been built with CUDA (e.g., pytorch, tensorflow). When you load these modules, the version of CUDA that was used to built the module will be loaded automatically.
SHARK
module avail /cuda/
In this tutorial, we will make use of CUDA 11.3.
Preparations
As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.
If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials
in your $HOME
. Let's create a directory for this job and change into it:
mkdir -p $HOME/user_guide_tutorials/first_gpu_job
cd $HOME/user_guide_tutorials/first_gpu_job
CUDA samples
The CUDA samples for each CUDA release are available on NVIDIA/cuda-samples. In principle, you can download the release of CUDA samples specific to the CUDA version that you will use, but here we will just clone the repository and get the latest version. This should work fine, but if you encounter any issues, please let us know.
This should have created a directory called cuda-samples
.
Next, we need to build one sample script so that we can run it. Here, we will build cuda-samples/Samples/6_Performance/transpose/
. There two more samples and in this directory and you are welcome to try them all if you like.
As of writing of this tutorial, the CUDA samples cannot be compiled with CUDA samples 10 or higher.
ALICE
On ALICE, you can build the CUDA samples this way:
SHARK
On SHARK, you can build the CUDA samples like this
The batch script
Next, we are going to set up or batch script. To this purpose, we are going to change back to the top-level directory for this job with cd $HOME/user_guide_tutorials/first_gpu_job
.
You can copy the content below directly into a text file, which we name here test_gpu.slurm
. The batch file is again a bit more elaborate than perhaps necessary, but it is always helpful to have a few log messages more in the beginning.
ALICE
SHARK
Running your job
Now, we have everything that we need to run this job. Please make sure that you are in the same directory where the script are. If not, then change into
You are ready to submit your job like this:
Immediately after you have submitted it, you should see something like this:
Job output
In the directory where you launched your job, there should be new file created by Slurm: test_gpu_<jobid>.out
. It contains all the output from your job which would have been normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:
You can get a quick overview of the resources actually used by your job by running:
It might look something like this:
As you can see this job runs very quickly and does almost nothing on the CPU.
Cancelling your job
In case you need to cancel the job that you have submitted, you can use the following command
You can use it to cancel the job at any stage in the queue, i.e., pending or running.
Note that you might not be able to cancel the job in this example, because it has already finished.