About this tutorial
This tutorial will guide you through running a job using a GPU on ALICE or SHARK. Here, we will use a scripts from NVIDIA’s CUDA samples repository to run on the GPU.
What you will learn?
Setting up the batch script for a job using GPUs
Loading the necessary modules
Submitting your job
Collect information about your job
What this example will not cover?
Using multiple GPUs
Writing and compiling code for GPU
What you should know before starting?
Basic knowledge of how to use a Linux OS from the command line.
How to connect to ALICE or SHARK: How to login to ALICE or SHARK
How to move files to and from ALICE or SHARK: Transferring Data
How to setup a simple batch job as shown in: Your first bash job
CUDA on ALICE and SHARK
CUDA is available on both clusters in different versions. You can get an overview by running:
ALICE
module -r avail ^CUDA
Various modules on ALICE have been built with CUDA (e.g., pytorch, tensorflow). When you load these modules, the version of CUDA that was used to built the module will be loaded automatically.
SHARK
module avail /cuda/
In this tutorial, we will make use of CUDA 11.3.
Preparations
As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.
If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials
in your $HOME
. Let's create a directory for this job and change into it:
mkdir -p $HOME/user_guide_tutorials/first_gpu_job cd $HOME/user_guide_tutorials/first_gpu_job
CUDA samples
The CUDA samples for each CUDA release are available on NVIDIA/cuda-samples. In principle, you can download the release of CUDA samples specific to the CUDA version that you will use, but here we will just clone the repository and get the latest version. This should work fine, but if you encounter any issues, please let us know.
git clone https://github.com/NVIDIA/cuda-samples.git
This should have created a directory called cuda-samples
.
Next, we need to build one sample script so that we can run it. Here, we will build cuda-samples/Samples/6_Performance/transpose/
. There two more samples and in this directory and you are welcome to try them all if you like.
As of writing of this tutorial, the CUDA samples cannot be compiled with CUDA samples 10 or higher.
ALICE
On ALICE, you can build the CUDA samples this way:
module purge module load ALICE/default module load CUDA/12.3.2 module load GCC/11.3.0 cd cuda-samples/Samples/6_Performance/transpose/ make SMS="75 80"
SHARK
On SHARK, you can build the CUDA samples like this
module purge module load library/cuda/11.3/gcc.8.3.1 cd cuda-samples/Samples/6_Performance/transpose/ make SMS="75 80"
The batch script
Next, we are going to set up or batch script. To this purpose, we are going to change back to the top-level directory for this job with cd $HOME/user_guide_tutorials/first_gpu_job
.
You can copy the content below directly into a text file, which we name here test_gpu.slurm
. The batch file is again a bit more elaborate than perhaps necessary, but it is always helpful to have a few log messages more in the beginning.
ALICE
#!/bin/bash #SBATCH --job-name=test_gpu #SBATCH --mail-user="<your-email-address>" #SBATCH --mail-type="ALL" #SBATCH --time=00:01:00 #SBATCH --partition=testing #SBATCH --output=%x_%j.out #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=100M #SBATCH --gres=gpu:1 # making sure we start with a clean module environment module purge echo "## Starting GPU test on $HOSTNAME" echo "## Loading module" module load ALICE/default module load slurm module load CUDA/12.3.2 module load GCC/11.3.0 TEST_DIR=$(pwd) echo "## Current dircectory $TEST_DIR" echo "## Number of available CUDA devices: $CUDA_VISIBLE_DEVICES" echo "## Checking status of CUDA device with nvidia-smi" nvidia-smi echo "## Running test" $HOME/user_guide_tutorials/first_gpu_job/cuda-samples/Samples/6_Performance/transpose/transpose echo "## Test finished. Goodbye"
SHARK
#!/bin/bash #SBATCH --job-name=test_gpu #SBATCH --mail-user="<your-email-address>" #SBATCH --mail-type="ALL" #SBATCH --time=00:01:00 #SBATCH --partition=gpu #SBATCH --output=%x_%j.out #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=100M #SBATCH --gres=gpu:1 # making sure we start with a clean module environment module purge echo "## Starting GPU test on $HOSTNAME" echo "## Loading module" module load library/cuda/11.3/gcc.8.3.1 TEST_DIR=$(pwd) echo "## Current dircectory $TEST_DIR" echo "## Number of available CUDA devices: $CUDA_VISIBLE_DEVICES" echo "## Checking status of CUDA device with nvidia-smi" nvidia-smi echo "## Running test" $HOME/user_guide_tutorials/first_gpu_job/cuda-samples/Samples/6_Performance/transpose/transpose echo "## Test finished. Goodbye"
Running your job
Now, we have everything that we need to run this job. Please make sure that you are in the same directory where the script are. If not, then change into
cd $HOME/user_guide_tutorials/first_gpu_job
You are ready to submit your job like this:
sbatch test_gpu.slurm
Immediately after you have submitted it, you should see something like this:
[me@<login_node> first_gpu_job]$ sbatch test_gpu.slurm Submitted batch job <job_id>
Job output
In the directory where you launched your job, there should be new file created by Slurm: test_gpu_<jobid>.out
. It contains all the output from your job which would have been normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:
## Starting GPU test on <compute_node> ## Loading module ## Current dircectory /home/<user_name>/user_guide_tutorials/first_gpu_job ## Number of available CUDA devices: 0 ## Checking status of CUDA device with nvidia-smi Thu Aug 11 14:15:44 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GRID V100D-16Q On | 00000000:02:00.0 Off | N/A | | N/A N/A P0 N/A / N/A | 1104MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ## Running test Transpose Starting... GPU Device 0: "Volta" with compute capability 7.0 > Device 0: "GRID V100D-16Q" > SM Capability 7.0 detected: > [GRID V100D-16Q] has 80 MP(s) x 64 (Cores/MP) = 5120 (Cores) > Compute performance scaling factor = 1.00 Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16 transpose simple copy , Throughput = 386.2985 GB/s, Time = 0.02022 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose shared memory copy, Throughput = 378.6300 GB/s, Time = 0.02063 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose naive , Throughput = 269.5899 GB/s, Time = 0.02898 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose coalesced , Throughput = 400.9141 GB/s, Time = 0.01949 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose optimized , Throughput = 393.8769 GB/s, Time = 0.01983 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose coarse-grained , Throughput = 389.4536 GB/s, Time = 0.02006 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose fine-grained , Throughput = 399.0269 GB/s, Time = 0.01958 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose diagonal , Throughput = 379.7667 GB/s, Time = 0.02057 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 Test passed ## Test finished. Goodbye
You can get a quick overview of the resources actually used by your job by running:
seff <job_id>
It might look something like this:
Job ID: <job_id> Cluster: <cluster_name> User/Group: <user_name>/<group_name> State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:01 CPU Efficiency: 0.00% of 00:00:00 core-walltime Job Wall-clock time: 00:00:00 Memory Utilized: 1.35 MB Memory Efficiency: 1.35% of 100.00 MB
As you can see this job runs very quickly and does almost nothing on the CPU.
Cancelling your job
In case you need to cancel the job that you have submitted, you can use the following command
scancel <job_id>
You can use it to cancel the job at any stage in the queue, i.e., pending or running.
Note that you might not be able to cancel the job in this example, because it has already finished.