This tutorial is being revised

About this tutorial

This tutorial will guide you through running a job on one of ALICE's GPU nodes. It uses TensorFlow and Keras to train a model on an example dataset using one GPU. You can find the full tutorial here: Link to TensorFlow Tutorial (MIT License)

What you will learn?

Setting up the batch script for a job using GPUs
Setting up a basic TensorFlow+Keras job
Move data to and from local node scratch
Loading the necessary modules
Submitting your job
Monitoring your job
Collect information about your job

What this example will not cover?

Introducing TensorFlow, Keras or machine learning in general
Installing your own or special Python modules
Using multiple GPUs
Compiling code for GPU

What you should know before starting?

Basic Python is recommended. This tutorial is not intended as a tutorial on Python. If you are completely new to Python, we recommend that you go through a generic Python tutorial first. There are many great ones out there.
Basic understanding of machine learning or TensorFlow is not required, but helpful. This is a kind of HelloWorld programme for TensorFlow. Therefore, you do not need prior knowledge of TensorFlow.
Basic knowledge of how to use a Linux OS from the command line.
How to connect to ALICE or SHARK: https://pubappslu.atlassian.net/wiki/spaces/ALICEWIKI/pages/11108355
How to move files to and from ALICE or SHARK: https://pubappslu.atlassian.net/wiki/spaces/ALICEWIKI/pages/17072131
How to setup a simple batch job as shown in: Your first bash job

1 About this tutorial
2 Preparations
- 2.1 The Python Script
- 2.2 The batch script
  - 2.2.1 Slurm settings
  - 2.2.2 Job Commands
3 Running your job
4 Cancelling your job

Preparations

As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.

If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials in your $HOME. Let's create a directory for this job and change into it:

mkdir -p $HOME/user_guide_tutorials/first_gpu_job
cd $HOME/user_guide_tutorials/first_gpu_job

The Python Script

Based on the TensorFlow tutorial, we will use the following Python3 script to train a model using example data available in TensorFlow and apply it once. The script also runs some basic tests to confirm that it will work with the GPU.

Copy the Python code below into a file which we assume here is named test_gpu_tensorflow.py and stored in $HOME/user_guide_tutorials/first_gpu_job

"""
This is a HelloWorld-type of script to run on the GPU nodes. 
It uses Tensorflow with Keras and is based on this TensorFlow tutorial:
https://www.tensorflow.org/tutorials/keras/classification
"""

# Import TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras

# Some helper libraries
import os
import numpy as np
import matplotlib.pyplot as plt

# Some helper functions
# +++++++++++++++++++++
def plot_image(i, predictions_array, true_label, img):
  true_label, img = true_label[i], img[i]
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])

  plt.imshow(img, cmap=plt.cm.binary)

  predicted_label = np.argmax(predictions_array)
  if predicted_label == true_label:
    color = 'blue'
  else:
    color = 'red'

  plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                100*np.max(predictions_array),
                                class_names[true_label]),
                                color=color)

def plot_value_array(i, predictions_array, true_label):
  true_label = true_label[i]
  plt.grid(False)
  plt.xticks(range(10))
  plt.yticks([])
  thisplot = plt.bar(range(10), predictions_array, color="#777777")
  plt.ylim([0, 1])
  predicted_label = np.argmax(predictions_array)

  thisplot[predicted_label].set_color('red')
  thisplot[true_label].set_color('blue')

# Run some tests
# ++++++++++++++

# get the version of TensorFlow
print("TensorFlow version: {}".format(tf.__version__))

# Check that TensorFlow was build with CUDA to use the gpus
print("Device name: {}".format(tf.test.gpu_device_name()))
print("Build with GPU Support? {}".format(tf.test.is_built_with_gpu_support()))
print("Build with CUDA? {} ".format(tf.test.is_built_with_cuda()))

# Get the data
# ++++++++++++

# Get an example dataset
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Class names for later use
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
               
# Get some information about the data
print("Size of training dataset: {}".format(train_images.shape))
print("Number of labels training dataset: {}".format(len(train_labels)))
print("Size of test dataset: {}".format(test_images.shape))
print("Number of labels test dataset: {}".format(len(test_labels)))

# Convert the data from integer to float
train_images = train_images / 255.0
test_images = test_images / 255.0

# plot the first 25 images of the training Set
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[train_labels[i]])
plt.savefig("./plots/trainingset_example.png",bbox_inches='tight',overwrite=True)
plt.close('all')

# Set and train the model
# +++++++++++++++++++++++


# Set up the layers
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10)
])

# Compile the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=10)

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('\nTest accuracy: {}'.format(test_acc))

# Use the model
# +++++++++++++

# grab an image
img_index=10
img = test_images[img_index]
print(img.shape)

# add image to a batch
img = (np.expand_dims(img,0))
print(img.shape)

# to make predictions, add a new layer
probability_model = tf.keras.Sequential([model, 
                                         tf.keras.layers.Softmax()])

# predict the label for the image
predictions_img = probability_model.predict(img)

print("Predictions for image {}:".format(img_index))
print(predictions_img[0])
print("Label with highest confidence: {}".format(np.argmax(predictions_img[0])))

# plot it
plt.figure(figsize=(6, 3))
plt.subplot(1,2, 1)
plot_image(img_index, predictions_img[0], test_labels, test_images)
plt.subplot(1,2,2)
plot_value_array(img_index, predictions_img[0], test_labels)
plt.savefig("./plots/trainingset_prediction_img{}.png".format(img_index),bbox_inches='tight',overwrite=True)

The batch script

The bash script is again a bit more elaborate than it might be necessary, but it is always helpful to have a few log messages more in the beginning.

You can copy the different elements below directly into a text file. We assume that you name it test_gpu_tensorflow.slurm and that you place it in the same location as the python file.

Slurm settings

The slurm settings are very similar to the previous examples :

#!/bin/bash
#SBATCH --job-name=test_gpu_tensorflow
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-user="<your_e-mail>"
#SBATCH --mail-type="ALL"
#SBATCH --mem=5G
#SBATCH --time=00:02:00
#SBATCH --partition=gpu-short
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1

Note, that we changed the partition to one of the gpu-partitions. Since the job won't take long, it the gpu-short partition is sufficient. Another important change is that we added #SBATCH --gpus=1. This will tell slurm to give us one of the four GPU on the nodes. It is vital that you specify the number of GPUs that you need so that the remaining once can be used by other users (if resources permit).

Job Commands

First, let's load the modules that we need. We assume here that you do not have any other modules loaded except for the default ones after you logged in. While it is not strictly necessary, we explicitly define version of the modules that we want to use. This improves the reproducibility of our job in case the default modules change.

Let's define a few variables and get some basic information. Note that we print out which of the four GPUs are being used.

Since the python script will write out files, we want to use the scratch space local to the node for our job instead of the shared scratch. In the example here, this is not really necessary, because we only write two, small files. However, if you want to process large amounts of data and have to perform a lot of I/O, it is highly recommended to use the node's local scratch for this. It will generally be faster than the network storage which is shared by all users.

Next, we add running the Python script to the batch file:

Last but not least, we have to copy the files written to the node's local scratch back to our shared scratch space. This is very important because all the files on the node's local scratch will be deleted after our jobs has finished. Make sure that you only copy the data products back that you really need.

Running your job

Now that we have the Python script and batch file, we are ready to run our job.

Please make sure that you are in the same directory where the script are. If not, then change into

You are ready to submit your job like this:

Immediately after you have submitted it, you should see something like this:

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.

HPC wiki

Tensorflow example