About this tutorial

This tutorial will guide you through running a job using a GPU on ALICE or SHARK. Here, we will use a scripts from NVIDIA’s CUDA samples repository to run on the GPU.

What you will learn?

Setting up the batch script for a job using GPUs
Loading the necessary modules
Submitting your job
Collect information about your job

What this example will not cover?

Using multiple GPUs
Writing and compiling code for GPU

What you should know before starting?

Basic knowledge of how to use a Linux OS from the command line.
How to connect to ALICE or SHARK: How to login to ALICE or SHARK
How to move files to and from ALICE or SHARK: Transferring Data
How to setup a simple batch job as shown in: Your first bash job

CUDA on ALICE and SHARK

CUDA is available on both clusters in different versions. You can get an overview by running:

ALICE

module -r avail ^CUDA

Various modules on ALICE have been built with CUDA (e.g., pytorch, tensorflow). When you load these modules, the version of CUDA that was used to built the module will be loaded automatically.

SHARK

module avail /cuda/

In this tutorial, we will make use of CUDA 11.3.

Preparations

As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.

If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials in your $HOME. Let's create a directory for this job and change into it:

 mkdir -p $HOME/user_guide_tutorials/first_gpu_job
 cd $HOME/user_guide_tutorials/first_gpu_job

CUDA samples

The CUDA samples for each CUDA release are available on NVIDIA/cuda-samples. In principle, you can download the release of CUDA samples specific to the CUDA version that you will use, but here we will just clone the repository and get the latest version. This should work fine, but if you encounter any issues, please let us know.

git clone https://github.com/NVIDIA/cuda-samples.git

This should have created a directory called cuda-samples.

Next, we need to build one sample script so that we can run it. Here, we will build cuda-samples/Samples/6_Performance/transpose/. There two more samples and in this directory and you are welcome to try them all if you like.

As of writing of this tutorial, the CUDA samples cannot be compiled with CUDA samples 10 or higher.

ALICE

On ALICE, you can build the CUDA samples this way:

module purge
module load CUDA/11.3.1
module load GCC/9.3.0
cd cuda-samples/Samples/6_Performance/transpose/
make SMS="75 80"

SHARK

On SHARK, you can build the CUDA samples like this

module purge
module load library/cuda/11.3/gcc.8.3.1

cd cuda-samples/Samples/6_Performance/transpose/
make

The batch script

Next, we are going to set up or batch script. To this purpose, we are going to change back to the top-level directory for this job with cd $HOME/user_guide_tutorials/first_gpu_job.

You can copy the content below directly into a text file, which we name here test_gpu.slurm. The batch file is again a bit more elaborate than perhaps necessary, but it is always helpful to have a few log messages more in the beginning.

ALICE

#!/bin/bash
#SBATCH --job-name=test_gpu
#SBATCH --mail-user="<your-email-address>"
#SBATCH --mail-type="ALL"
#SBATCH --time=00:01:00
#SBATCH --partition=testing
#SBATCH --output=%x_%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --gres=gpu:1

# making sure we start with a clean module environment
module purge

echo "## Starting GPU test on $HOSTNAME"

echo "## Loading module"
module load slurm
module load CUDA/11.3.1
module load GCC/9.3.0

TEST_DIR=$(pwd)
echo "## Current dircectory $TEST_DIR"

echo "## Number of available CUDA devices: $CUDA_VISIBLE_DEVICES"

echo "## Checking status of CUDA device with nvidia-smi"
nvidia-smi

echo "## Running test"
$HOME/user_guide_tutorials/first_gpu_job/cuda-samples/Samples/6_Performance/transpose/transpose

echo "## Test finished. Goodbye"

SHARK

#!/bin/bash
#SBATCH --job-name=test_gpu
#SBATCH --mail-user="<your-email-address>"
#SBATCH --mail-type="ALL"
#SBATCH --time=00:01:00
#SBATCH --partition=gpu
#SBATCH --output=%x_%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --gres=gpu:1

# making sure we start with a clean module environment
module purge

echo "## Starting GPU test on $HOSTNAME"

echo "## Loading module"
module load library/cuda/11.3/gcc.8.3.1

TEST_DIR=$(pwd)
echo "## Current dircectory $TEST_DIR"

echo "## Number of available CUDA devices: $CUDA_VISIBLE_DEVICES"

echo "## Checking status of CUDA device with nvidia-smi"
nvidia-smi

echo "## Running test"
$HOME/user_guide_tutorials/first_gpu_job/cuda-samples/Samples/6_Performance/transpose/transpose

echo "## Test finished. Goodbye"

Running your job

Now, we have everything that we need to run this job. Please make sure that you are in the same directory where the script are. If not, then change into

 cd $HOME/user_guide_tutorials/first_gpu_job

You are ready to submit your job like this:

 sbatch test_gpu.slurm

Immediately after you have submitted it, you should see something like this:

 [me@<login_node> first_gpu_job]$ sbatch test_gpu.slurm
 Submitted batch job <job_id>

Job output

In the directory where you launched your job, there should be new file created by Slurm: test_gpu_<jobid>.out. It contains all the output from your job which would have been normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:

## Starting GPU test on <compute_node>
## Loading module
## Current dircectory /home/<user_name>/user_guide_tutorials/first_gpu_job
## Number of available CUDA devices: 0
## Checking status of CUDA device with nvidia-smi
Thu Aug 11 14:15:44 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100D-16Q      On   | 00000000:02:00.0 Off |                  N/A |
| N/A   N/A    P0    N/A /  N/A |   1104MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
## Running test
Transpose Starting...

GPU Device 0: "Volta" with compute capability 7.0

> Device 0: "GRID V100D-16Q"
> SM Capability 7.0 detected:
> [GRID V100D-16Q] has 80 MP(s) x 64 (Cores/MP) = 5120 (Cores)
> Compute performance scaling factor = 1.00

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose simple copy       , Throughput = 386.2985 GB/s, Time = 0.02022 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 378.6300 GB/s, Time = 0.02063 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive             , Throughput = 269.5899 GB/s, Time = 0.02898 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced         , Throughput = 400.9141 GB/s, Time = 0.01949 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized         , Throughput = 393.8769 GB/s, Time = 0.01983 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained    , Throughput = 389.4536 GB/s, Time = 0.02006 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained      , Throughput = 399.0269 GB/s, Time = 0.01958 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal          , Throughput = 379.7667 GB/s, Time = 0.02057 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed
## Test finished. Goodbye

You can get a quick overview of the resources actually used by your job by running:

 seff <job_id>

It might look something like this:

Job ID: <job_id>
Cluster: <cluster_name>
User/Group: <user_name>/<group_name>
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:01
CPU Efficiency: 0.00% of 00:00:00 core-walltime
Job Wall-clock time: 00:00:00
Memory Utilized: 1.35 MB
Memory Efficiency: 1.35% of 100.00 MB

As you can see this job runs very quickly and does almost nothing on the CPU.

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

 scancel <job_id>

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.

Your first GPU job

About this tutorial

What you will learn?

What this example will not cover?

What you should know before starting?

CUDA on ALICE and SHARK

Preparations

CUDA samples

The batch script

Running your job

Job output

Cancelling your job