/
Your first GPU job

Your first GPU job

About this tutorial

This tutorial will guide you through running a job using a GPU on ALICE or SHARK. Here, we will use a scripts from NVIDIA’s CUDA samples repository to run on the GPU.

What you will learn?

  • Setting up the batch script for a job using GPUs

  • Loading the necessary modules

  • Submitting your job

  • Collect information about your job

What this example will not cover?

  • Using multiple GPUs

  • Writing and compiling code for GPU

What you should know before starting?

CUDA on ALICE and SHARK

CUDA is available on both clusters in different versions. You can get an overview by running:


ALICE

module -r avail ^CUDA

Various modules on ALICE have been built with CUDA (e.g., pytorch, tensorflow). When you load these modules, the version of CUDA that was used to built the module will be loaded automatically.


SHARK

module avail /cuda/

In this tutorial, we will make use of CUDA 11.3.

Preparations

As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.

If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials in your $HOME. Let's create a directory for this job and change into it:

mkdir -p $HOME/user_guide_tutorials/first_gpu_job cd $HOME/user_guide_tutorials/first_gpu_job

CUDA samples

The CUDA samples for each CUDA release are available on NVIDIA/cuda-samples. In principle, you can download the release of CUDA samples specific to the CUDA version that you will use, but here we will just clone the repository and get the latest version. This should work fine, but if you encounter any issues, please let us know.

This should have created a directory called cuda-samples.

Next, we need to build one sample script so that we can run it. Here, we will build cuda-samples/Samples/6_Performance/transpose/. There two more samples and in this directory and you are welcome to try them all if you like.

As of writing of this tutorial, the CUDA samples cannot be compiled with CUDA samples 10 or higher.


ALICE

On ALICE, you can build the CUDA samples this way:


SHARK

On SHARK, you can build the CUDA samples like this


The batch script

Next, we are going to set up or batch script. To this purpose, we are going to change back to the top-level directory for this job with cd $HOME/user_guide_tutorials/first_gpu_job.

You can copy the content below directly into a text file, which we name here test_gpu.slurm. The batch file is again a bit more elaborate than perhaps necessary, but it is always helpful to have a few log messages more in the beginning.


ALICE


SHARK


Running your job

Now, we have everything that we need to run this job. Please make sure that you are in the same directory where the script are. If not, then change into

You are ready to submit your job like this:

Immediately after you have submitted it, you should see something like this:

Job output

In the directory where you launched your job, there should be new file created by Slurm: test_gpu_<jobid>.out. It contains all the output from your job which would have been normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:

You can get a quick overview of the resources actually used by your job by running:

It might look something like this:

As you can see this job runs very quickly and does almost nothing on the CPU.

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.

Related pages