Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

About this tutorial

This tutorial will guide you through running a job using a GPU on ALICE or SHARK. Here, we will use a scripts from NVIDIA’s CUDA samples repository to run on the GPU.

What you will learn?

  • Setting up the batch script for a job using GPUs

  • Loading the necessary modules

  • Submitting your job

  • Collect information about your job

What this example will not cover?

  • Using multiple GPUs

  • Writing and compiling code for GPU

What you should know before starting?

Table of Contents
minLevel1
maxLevel7
excludeContents

CUDA on ALICE and SHARK

CUDA is available on both clusters in different versions. You can get an overview by running:


ALICE

Code Block
module -r avail ^CUDA
Info

Various modules on ALICE have been built with CUDA (e.g., pytorch, tensorflow). When you load these modules, the version of CUDA that was used to built the module will be loaded automatically.


SHARK

Code Block
module avail /cuda/

In this tutorial, we will make use of CUDA 11.3.

Preparations

As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.

If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials in your $HOME. Let's create a directory for this job and change into it:

Code Block
 mkdir -p $HOME/user_guide_tutorials/first_gpu_job
 cd $HOME/user_guide_tutorials/first_gpu_job

CUDA samples

The CUDA samples for each CUDA release are available on NVIDIA/cuda-samples. In principle, you can download the release of CUDA samples specific to the CUDA version that you will use, but here we will just clone the repository and get the latest version. This should work fine, but if you encounter any issues, please let us know.

Code Block
git clone https://github.com/NVIDIA/cuda-samples.git

This should have created a directory called cuda-samples.

Next, we need to build one sample script so that we can run it. Here, we will build cuda-samples/Samples/6_Performance/transpose/. There two more samples and in this directory and you are welcome to try them all if you like.

Note

As of writing of this tutorial, the CUDA samples cannot be compiled with CUDA samples 10 or higher.


ALICE

On ALICE, you can build the CUDA samples this way:

Code Block
module purge
module load ALICE/default
module load CUDA/1112.3.12
module load GCC/911.3.0
cd cuda-samples/Samples/6_Performance/transpose/
make SMS="75 80"

SHARK

On SHARK, you can build the CUDA samples like this

Code Block
module purge
module load library/cuda/11.3/gcc.8.3.1

cd cuda-samples/Samples/6_Performance/transpose/
make SMS="75 80"

The batch script

Next, we are going to set up or batch script. To this purpose, we are going to change back to the top-level directory for this job with cd $HOME/user_guide_tutorials/first_gpu_job.

You can copy the content below directly into a text file, which we name here test_gpu.slurm. The batch file is again a bit more elaborate than perhaps necessary, but it is always helpful to have a few log messages more in the beginning.


ALICE

Code Block
#!/bin/bash
#SBATCH --job-name=test_gpu
#SBATCH --mail-user="<your-email-address>"
#SBATCH --mail-type="ALL"
#SBATCH --time=00:01:00
#SBATCH --partition=testing
#SBATCH --output=%x_%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --gres=gpu:1

# making sure we start with a clean module environment
module purge

echo "## Starting GPU test on $HOSTNAME"

echo "## Loading module"
module load ALICE/default
module load slurm
module load CUDA/1112.3.12
module load GCC/911.3.0

TEST_DIR=$(pwd)
echo "## Current dircectory $TEST_DIR"

echo "## Number of available CUDA devices: $CUDA_VISIBLE_DEVICES"

echo "## Checking status of CUDA device with nvidia-smi"
nvidia-smi

echo "## Running test"
$HOME/user_guide_tutorials/first_gpu_job/cuda-samples/Samples/6_Performance/transpose/transpose

echo "## Test finished. Goodbye"

SHARK

Code Block
#!/bin/bash
#SBATCH --job-name=test_gpu
#SBATCH --mail-user="<your-email-address>"
#SBATCH --mail-type="ALL"
#SBATCH --time=00:01:00
#SBATCH --partition=gpu
#SBATCH --output=%x_%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=100M
#SBATCH --gres=gpu:1

# making sure we start with a clean module environment
module purge

echo "## Starting GPU test on $HOSTNAME"

echo "## Loading module"
module load library/cuda/11.3/gcc.8.3.1

TEST_DIR=$(pwd)
echo "## Current dircectory $TEST_DIR"

echo "## Number of available CUDA devices: $CUDA_VISIBLE_DEVICES"

echo "## Checking status of CUDA device with nvidia-smi"
nvidia-smi

echo "## Running test"
$HOME/user_guide_tutorials/first_gpu_job/cuda-samples/Samples/6_Performance/transpose/transpose

echo "## Test finished. Goodbye"

Running your job

Now, we have everything that we need to run this job. Please make sure that you are in the same directory where the script are. If not, then change into

Code Block
 cd $HOME/user_guide_tutorials/first_gpu_job

You are ready to submit your job like this:

Code Block
 sbatch test_gpu.slurm

Immediately after you have submitted it, you should see something like this:

Code Block
 [me@<login_node> first_gpu_job]$ sbatch test_gpu.slurm
 Submitted batch job <job_id>

Job output

In the directory where you launched your job, there should be new file created by Slurm: test_gpu_<jobid>.out. It contains all the output from your job which would have been normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:

Code Block
## Starting GPU test on <compute_node>
## Loading module
## Current dircectory /home/<user_name>/user_guide_tutorials/first_gpu_job
## Number of available CUDA devices: 0
## Checking status of CUDA device with nvidia-smi
Thu Aug 11 14:15:44 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100D-16Q      On   | 00000000:02:00.0 Off |                  N/A |
| N/A   N/A    P0    N/A /  N/A |   1104MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
## Running test
Transpose Starting...

GPU Device 0: "Volta" with compute capability 7.0

> Device 0: "GRID V100D-16Q"
> SM Capability 7.0 detected:
> [GRID V100D-16Q] has 80 MP(s) x 64 (Cores/MP) = 5120 (Cores)
> Compute performance scaling factor = 1.00

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose simple copy       , Throughput = 386.2985 GB/s, Time = 0.02022 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 378.6300 GB/s, Time = 0.02063 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive             , Throughput = 269.5899 GB/s, Time = 0.02898 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced         , Throughput = 400.9141 GB/s, Time = 0.01949 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized         , Throughput = 393.8769 GB/s, Time = 0.01983 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained    , Throughput = 389.4536 GB/s, Time = 0.02006 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained      , Throughput = 399.0269 GB/s, Time = 0.01958 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal          , Throughput = 379.7667 GB/s, Time = 0.02057 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed
## Test finished. Goodbye

You can get a quick overview of the resources actually used by your job by running:

Code Block
 seff <job_id>

It might look something like this:

Code Block
Job ID: <job_id>
Cluster: <cluster_name>
User/Group: <user_name>/<group_name>
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:01
CPU Efficiency: 0.00% of 00:00:00 core-walltime
Job Wall-clock time: 00:00:00
Memory Utilized: 1.35 MB
Memory Efficiency: 1.35% of 100.00 MB

As you can see this job runs very quickly and does almost nothing on the CPU.

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

Code Block
 scancel <job_id>

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.