Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Warning

This tutorial is being revised

About this tutorial

This tutorial will guide you through running a job on one of ALICE's GPU nodes. It uses TensorFlow and Keras to train a model on an example dataset using one GPU. You can find the full tutorial here: Link to TensorFlow Tutorial (MIT License)

What you will learn?

  • Setting up the batch script for a job using GPUs

  • Setting up a basic TensorFlow+Keras job

  • Move data to and from local node scratch

  • Loading the necessary modules

  • Submitting your job

  • Monitoring your job

  • Collect information about your job

What this example will not cover?

  • Introducing TensorFlow, Keras or machine learning in general

  • Installing your own or special Python modules

  • Using multiple GPUs

  • Compiling code for GPU

What you should know before starting?

  • Basic Python is recommended. This tutorial is not intended as a tutorial on Python. If you are completely new to Python, we recommend that you go through a generic Python tutorial first. There are many great ones out there.

  • Basic understanding of machine learning or TensorFlow is not required, but helpful. This is a kind of HelloWorld programme for TensorFlow. Therefore, you do not need prior knowledge of TensorFlow.

  • Basic knowledge of how to use a Linux OS from the command line.

  • How to connect to ALICE or SHARK: /wiki/spaces/ALICEWIKI/pages/11108355

  • How to move files to and from ALICE or SHARK: /wiki/spaces/ALICEWIKI/pages/17072131

  • How to setup a simple batch job as shown in: Your first bash job

Table of Contents
minLevel1
maxLevel7
excludeContents

Preparations

As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.

If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials in your $HOME. Let's create a directory for this job and change into it:

Code Block
mkdir -p $HOME/user_guide_tutorials/first_gpu_job
cd $HOME/user_guide_tutorials/first_gpu_job

The Python Script

Based on the TensorFlow tutorial, we will use the following Python3 script to train a model using example data available in TensorFlow and apply it once. The script also runs some basic tests to confirm that it will work with the GPU.

Copy the Python code below into a file which we assume here is named test_gpu_tensorflow.py and stored in $HOME/user_guide_tutorials/first_gpu_job

Code Block
"""
This is a HelloWorld-type of script to run on the GPU nodes. 
It uses Tensorflow with Keras and is based on this TensorFlow tutorial:
https://www.tensorflow.org/tutorials/keras/classification
"""

# Import TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras

# Some helper libraries
import os
import numpy as np
import matplotlib.pyplot as plt

# Some helper functions
# +++++++++++++++++++++
def plot_image(i, predictions_array, true_label, img):
  true_label, img = true_label[i], img[i]
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])

  plt.imshow(img, cmap=plt.cm.binary)

  predicted_label = np.argmax(predictions_array)
  if predicted_label == true_label:
    color = 'blue'
  else:
    color = 'red'

  plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                100*np.max(predictions_array),
                                class_names[true_label]),
                                color=color)

def plot_value_array(i, predictions_array, true_label):
  true_label = true_label[i]
  plt.grid(False)
  plt.xticks(range(10))
  plt.yticks([])
  thisplot = plt.bar(range(10), predictions_array, color="#777777")
  plt.ylim([0, 1])
  predicted_label = np.argmax(predictions_array)

  thisplot[predicted_label].set_color('red')
  thisplot[true_label].set_color('blue')

# Run some tests
# ++++++++++++++

# get the version of TensorFlow
print("TensorFlow version: {}".format(tf.__version__))

# Check that TensorFlow was build with CUDA to use the gpus
print("Device name: {}".format(tf.test.gpu_device_name()))
print("Build with GPU Support? {}".format(tf.test.is_built_with_gpu_support()))
print("Build with CUDA? {} ".format(tf.test.is_built_with_cuda()))

# Get the data
# ++++++++++++

# Get an example dataset
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Class names for later use
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
               
# Get some information about the data
print("Size of training dataset: {}".format(train_images.shape))
print("Number of labels training dataset: {}".format(len(train_labels)))
print("Size of test dataset: {}".format(test_images.shape))
print("Number of labels test dataset: {}".format(len(test_labels)))

# Convert the data from integer to float
train_images = train_images / 255.0
test_images = test_images / 255.0

# plot the first 25 images of the training Set
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[train_labels[i]])
plt.savefig("./plots/trainingset_example.png",bbox_inches='tight',overwrite=True)
plt.close('all')

# Set and train the model
# +++++++++++++++++++++++


# Set up the layers
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10)
])

# Compile the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=10)

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('\nTest accuracy: {}'.format(test_acc))

# Use the model
# +++++++++++++

# grab an image
img_index=10
img = test_images[img_index]
print(img.shape)

# add image to a batch
img = (np.expand_dims(img,0))
print(img.shape)

# to make predictions, add a new layer
probability_model = tf.keras.Sequential([model, 
                                         tf.keras.layers.Softmax()])

# predict the label for the image
predictions_img = probability_model.predict(img)

print("Predictions for image {}:".format(img_index))
print(predictions_img[0])
print("Label with highest confidence: {}".format(np.argmax(predictions_img[0])))

# plot it
plt.figure(figsize=(6, 3))
plt.subplot(1,2, 1)
plot_image(img_index, predictions_img[0], test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(img_index, predictions_img[0], test_labels)
plt.savefig("./plots/trainingset_prediction_img{}.png".format(img_index),bbox_inches='tight',overwrite=True)

The batch script

The bash script is again a bit more elaborate than it might be necessary, but it is always helpful to have a few log messages more in the beginning.

You can copy the different elements below directly into a text file. We assume that you name it test_gpu_tensorflow.slurm and that you place it in the same location as the python file.

Slurm settings

The slurm settings are very similar to the previous examples :

Code Block
#!/bin/bash
#SBATCH --job-name=test_gpu_tensorflow
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --mem=5G
#SBATCH --time=00:02:00
#SBATCH --partition=gpu-short
#SBATCH --ntasks=1
#SBATCH --gpusgres=gpu:1

Note, that we changed the partition to one of the gpu-partitions. Since the job won't take long, it the gpu-short partition is sufficient. Another important change is that we added #SBATCH --gpus=1. This will tell slurm to give us one of the four GPU on the nodes. It is vital that you specify the number of GPUs that you need so that the remaining once can be used by other users (if resources permit).

Job Commands

First, let's load the modules that we need. We assume here that you do not have any other modules loaded except for the default ones after you logged in. While it is not strictly necessary, we explicitly define version of the modules that we want to use. This improves the reproducibility of our job in case the default modules change.

Code Block
# load modules (assuming you start from the default environment)
# we explicitely call the modules to improve reproducability
# in case the default settings change
module load Python/3.7.4-GCCcore-8.3.0
module load SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4
module load matplotlib/3.1.1-foss-2019b-Python-3.7.4
module load TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4

Let's define a few variables and get some basic information. Note that we print out which of the four GPUs are being used.

Code Block
 
echo "[$SHELL] #### Starting GPU TensorFlow test"
echo "[$SHELL] This is $SLURM_JOB_USER and my first job has the ID $SLURM_JOB_ID"
# get the current working directory
export CWD=$(pwd)
echo "[$SHELL] CWD: "$CWD

# Which GPU has been assigned
echo "[$SHELL] Using GPU: "$CUDA_VISIBLE_DEVICES

# Set the path to the python file
export PATH_TO_PYFILE=$CWD
echo "[$SHELL] Path of python file: "$PATH_TO_PYFILE

# Set name of the python file
export PYFILE=$CWD/test_gpu_tensorflow.py

Since the python script will write out files, we want to use the scratch space local to the node for our job instead of the shared scratch. In the example here, this is not really necessary, because we only write two, small files. However, if you want to process large amounts of data and have to perform a lot of I/O, it is highly recommended to use the node's local scratch for this. It will generally be faster than the network storage which is shared by all users.

Code Block
# Create a directory of local scratch on the node
echo "[$SHELL] Node scratch: "$SCRATCH
export RUNDIR=$SCRATCH/test_tf
mkdir $RUNDIR
echo "[$SHELL] Run directory"$RUNDIR

# Create directory for plots
export PLOTDIR=$RUNDIR/plots
mkdir $PLOTDIR

# copy script to local scratch directory and change into it
cp $PYFILE $RUNDIR/
cd $RUNDIR

Next, we add running the Python script to the batch file:

Code Block
# Run the file
echo "[$SHELL] Run script"
python3 test_gpu_tensorflow.py
echo "[$SHELL] Script finished"

Last but not least, we have to copy the files written to the node's local scratch back to our shared scratch space. This is very important because all the files on the node's local scratch will be deleted after our jobs has finished. Make sure that you only copy the data products back that you really need.

Code Block
# Move stat directory back to CWD
echo "[$SHELL] Copy files back to cwd"
cp -r $PLOTDIR $CWD/

echo "[$SHELL] #### Finished GPU TensorFLow test. Have a nice day"

Running your job

Now that we have the Python script and batch file, we are ready to run our job.

Please make sure that you are in the same directory where the script are. If not, then change into

Code Block
 cd $HOME/user_guide_tutorials/first_gpu_job

You are ready to submit your job like this:

Code Block
 sbatch test_gpu_tensorflow.slurm

Immediately after you have submitted it, you should see something like this:

Code Block
 [me@nodelogin02 first_bash_job]$ sbatch test_gpu_tensorflow.slurm
 Submitted batch job <job_id>

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

Code Block
 scancel <job_id>

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.