Tensorflow example

This tutorial is being revised

About this tutorial

This tutorial will guide you through running a job on one of ALICE's GPU nodes. It uses TensorFlow and Keras to train a model on an example dataset using one GPU. You can find the full tutorial here: Link to TensorFlow Tutorial (MIT License)

What you will learn?

  • Setting up the batch script for a job using GPUs

  • Setting up a basic TensorFlow+Keras job

  • Move data to and from local node scratch

  • Loading the necessary modules

  • Submitting your job

  • Monitoring your job

  • Collect information about your job

What this example will not cover?

  • Introducing TensorFlow, Keras or machine learning in general

  • Installing your own or special Python modules

  • Using multiple GPUs

  • Compiling code for GPU

What you should know before starting?

Preparations

As usual, it is always helpful to check the current cluster status and load. The GPU nodes are being used quite extensively at the moment. Therefore, it might take longer for your job to be scheduled. This makes it even more important define the resources in your bash script as much as possible to help Slurm schedule your job.

If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials in your $HOME. Let's create a directory for this job and change into it:

mkdir -p $HOME/user_guide_tutorials/first_gpu_job cd $HOME/user_guide_tutorials/first_gpu_job

The Python Script

Based on the TensorFlow tutorial, we will use the following Python3 script to train a model using example data available in TensorFlow and apply it once. The script also runs some basic tests to confirm that it will work with the GPU.

Copy the Python code below into a file which we assume here is named test_gpu_tensorflow.py and stored in $HOME/user_guide_tutorials/first_gpu_job

""" This is a HelloWorld-type of script to run on the GPU nodes. It uses Tensorflow with Keras and is based on this TensorFlow tutorial: https://www.tensorflow.org/tutorials/keras/classification """ # Import TensorFlow and Keras import tensorflow as tf from tensorflow import keras # Some helper libraries import os import numpy as np import matplotlib.pyplot as plt # Some helper functions # +++++++++++++++++++++ def plot_image(i, predictions_array, true_label, img): true_label, img = true_label[i], img[i] plt.grid(False) plt.xticks([]) plt.yticks([]) plt.imshow(img, cmap=plt.cm.binary) predicted_label = np.argmax(predictions_array) if predicted_label == true_label: color = 'blue' else: color = 'red' plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label], 100*np.max(predictions_array), class_names[true_label]), color=color) def plot_value_array(i, predictions_array, true_label): true_label = true_label[i] plt.grid(False) plt.xticks(range(10)) plt.yticks([]) thisplot = plt.bar(range(10), predictions_array, color="#777777") plt.ylim([0, 1]) predicted_label = np.argmax(predictions_array) thisplot[predicted_label].set_color('red') thisplot[true_label].set_color('blue') # Run some tests # ++++++++++++++ # get the version of TensorFlow print("TensorFlow version: {}".format(tf.__version__)) # Check that TensorFlow was build with CUDA to use the gpus print("Device name: {}".format(tf.test.gpu_device_name())) print("Build with GPU Support? {}".format(tf.test.is_built_with_gpu_support())) print("Build with CUDA? {} ".format(tf.test.is_built_with_cuda())) # Get the data # ++++++++++++ # Get an example dataset fashion_mnist = keras.datasets.fashion_mnist (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() # Class names for later use class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'] # Get some information about the data print("Size of training dataset: {}".format(train_images.shape)) print("Number of labels training dataset: {}".format(len(train_labels))) print("Size of test dataset: {}".format(test_images.shape)) print("Number of labels test dataset: {}".format(len(test_labels))) # Convert the data from integer to float train_images = train_images / 255.0 test_images = test_images / 255.0 # plot the first 25 images of the training Set plt.figure(figsize=(10,10)) for i in range(25): plt.subplot(5,5,i+1) plt.xticks([]) plt.yticks([]) plt.grid(False) plt.imshow(train_images[i], cmap=plt.cm.binary) plt.xlabel(class_names[train_labels[i]]) plt.savefig("./plots/trainingset_example.png",bbox_inches='tight',overwrite=True) plt.close('all') # Set and train the model # +++++++++++++++++++++++ # Set up the layers model = keras.Sequential([ keras.layers.Flatten(input_shape=(28, 28)), keras.layers.Dense(128, activation='relu'), keras.layers.Dense(10) ]) # Compile the model model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) # Train the model model.fit(train_images, train_labels, epochs=10) # Evaluate the model test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2) print('\nTest accuracy: {}'.format(test_acc)) # Use the model # +++++++++++++ # grab an image img_index=10 img = test_images[img_index] print(img.shape) # add image to a batch img = (np.expand_dims(img,0)) print(img.shape) # to make predictions, add a new layer probability_model = tf.keras.Sequential([model, tf.keras.layers.Softmax()]) # predict the label for the image predictions_img = probability_model.predict(img) print("Predictions for image {}:".format(img_index)) print(predictions_img[0]) print("Label with highest confidence: {}".format(np.argmax(predictions_img[0]))) # plot it plt.figure(figsize=(6, 3)) plt.subplot(1,2, 1) plot_image(img_index, predictions_img[0], test_labels, test_images) plt.subplot(1,2,2) plot_value_array(img_index, predictions_img[0], test_labels) plt.savefig("./plots/trainingset_prediction_img{}.png".format(img_index),bbox_inches='tight',overwrite=True)

The batch script

The bash script is again a bit more elaborate than it might be necessary, but it is always helpful to have a few log messages more in the beginning.

You can copy the different elements below directly into a text file. We assume that you name it test_gpu_tensorflow.slurm and that you place it in the same location as the python file.

Slurm settings

The slurm settings are very similar to the previous examples :

#!/bin/bash #SBATCH --job-name=test_gpu_tensorflow #SBATCH --output=%x_%j.out #SBATCH --error=%x_%j.err #SBATCH --mail-user="<your_e-mail>" #SBATCH --mail-type="ALL" #SBATCH --mem=5G #SBATCH --time=00:02:00 #SBATCH --partition=gpu-short #SBATCH --ntasks=1 #SBATCH --gres=gpu:1

Note, that we changed the partition to one of the gpu-partitions. Since the job won't take long, it the gpu-short partition is sufficient. Another important change is that we added #SBATCH --gpus=1. This will tell slurm to give us one of the four GPU on the nodes. It is vital that you specify the number of GPUs that you need so that the remaining once can be used by other users (if resources permit).

Job Commands

First, let's load the modules that we need. We assume here that you do not have any other modules loaded except for the default ones after you logged in. While it is not strictly necessary, we explicitly define version of the modules that we want to use. This improves the reproducibility of our job in case the default modules change.

Let's define a few variables and get some basic information. Note that we print out which of the four GPUs are being used.

Since the python script will write out files, we want to use the scratch space local to the node for our job instead of the shared scratch. In the example here, this is not really necessary, because we only write two, small files. However, if you want to process large amounts of data and have to perform a lot of I/O, it is highly recommended to use the node's local scratch for this. It will generally be faster than the network storage which is shared by all users.

Next, we add running the Python script to the batch file:

Last but not least, we have to copy the files written to the node's local scratch back to our shared scratch space. This is very important because all the files on the node's local scratch will be deleted after our jobs has finished. Make sure that you only copy the data products back that you really need.

Running your job

Now that we have the Python script and batch file, we are ready to run our job.

Please make sure that you are in the same directory where the script are. If not, then change into

You are ready to submit your job like this:

Immediately after you have submitted it, you should see something like this:

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.