TensorFlow with TensorBoard example

About this tutorial

This tutorial will guide you through running a job on one of ALICE's GPU nodes. It uses TensorFlow and Keras to train a simple model on an example dataset using one GPU. The purpose of this tutorial is to show how to interactively analyze the statistical results of the training run using TensorBoard. You can find the source of the tutorial here: Link to TensorBoard tutorial (CC-BY 4.0/Apache 2.0 License)

What will you learn?

This tutorial will not go in-depth to cover the topics that are provided in the Tensorflow example tutorial, which should be used as a starting point if you are a new user to TensorFlow. We will however discuss the following topics:

  • Move data to and from a local node scratch

  • Submitting your job

  • Starting TensorBoard

  • Collecting information about your job

What should you know before starting?

  • Basic knowledge of Python

  • Basic knowledge of machine learning. It is not necessary to know exactly what metrics are involved and so on, but it can be helpful. This is a kind of Hello World program for TensorFlow. Therefore, you do not need prior knowledge of TensorFlow.

  • How to connect to ALICE or SHARK

  • How to move files to and from ALICE or SHARK

  • How to setup a simple batch job as shown in Your first bash job

  • Everything you could learn from the Tensorflow example

Preparations

As usual, it is always helpful to check the current cluster status and load. The GPU nodes are often being used quite extensively. Therefore, it might take longer for your job to be scheduled. This makes it even more important to define the resources in your bash script as much as possible to help Slurm schedule your job.

If you have been following the previous tutorial, you should already have a directory called user_guide_tutorials in your $HOME. Let's create a directory for this job and change into it:

mkdir -p $HOME/user_guide_tutorials/advanced_gpu_job cd $HOME/user_guide_tutorials/advanced_gpu_job

ALICE

For this tutorial, we only need the TensorFlow module, which already enables the modules for Python and also enables the use of tensorboard. However, you may want to select more modules in case an actual script uses them. While it is not strictly necessary, we explicitly define version of the modules that we want to use. This improves the reproducibility of our job in case the default modules change.


SHARK

TensorFlow is not available as a specific module on SHARK. One option to set it up is to load a CUDA, cudnn, and Python module and install TensorFlow in a Python virtual environment.

The Python script

Based on the TensorFlow tutorial, we will use the following Python3 script to train a model using example data available in TensorFlow and apply it once.

Copy the Python code below into a file which we assume here is named test_gpu_tensorboard.py and stored in $HOME/user_guide_tutorials/advanced_gpu_job.

# Example on using TensorBoard from TensorFlow import datetime from pathlib import Path import sys import tensorflow as tf # Using the MNIST dataset as the example, normalize the data and use a function # that creates a simple Keras model for classifying the images into 10 classes mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 def create_model(): return tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(512, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ]) # When training with Keras's Model.fit(), add the # tf.keras.callbacks.TensorBoard # callback to ensure that logs are created and stored. Additionally, enable # histogram computation every epoch with histogram_freq=1 (off by default) model = create_model() model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) log_path = Path(sys.argv[1] if len(sys.argv) > 1 else "logs/fit") log_dir = log_path / datetime.datetime.now().strftime("%Y%m%d-%H%M%S") tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=str(log_dir), histogram_freq=1) model.fit(x=x_train, y=y_train, epochs=5, validation_data=(x_test, y_test), callbacks=[tensorboard_callback])

The batch script

The bash script is again a bit more elaborate than it might be necessary, but it is always helpful to have a few log messages more in the beginning.

You can copy the different elements below directly into a text file. We assume that you name it test_gpu_tensorflow.slurm and that you place it in the same location as the Python file.

Slurm settings

The Slurm settings are very similar to the previous examples:


ALICE

#!/bin/bash #SBATCH --job-name=test_gpu_tensorboard #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_e-mail>" #SBATCH --mail-type="ALL" #SBATCH --mem=5G #SBATCH --time=00:15:00 #SBATCH --partition=gpu-short #SBATCH --ntasks=1 #SBATCH --gres=gpu:1

Since the job won't take long, the gpu-short partition is sufficient.


SHARK

 

 


We again use one GPU on the nodes, as declared by #SBATCH --gpus=1. It is vital that you specify the number of GPUs that you need. Otherwise, Slurm will not assign a GPU to your job.

Once the TensorFlow part of the job is done, the resources are freed again on the nodes. The TensorBoard part does not use the worker nodes.

Job commands

First, let's load the modules that we need. We assume here that you do not have any other modules loaded except for the default ones after you logged in.


ALICE


SHARK

where <path_to_tensorflow_venv> should be replaced by the path to the virtual environment. If you used different modules to create the environment, you will have to adjust the module that should be loaded, too


Let's define a few variables and get some basic information on GPU usage and locations. Since the Python script will write out files, we want to use the scratch space local to the node for our job instead of the shared scratch. In the example here, this is not really necessary, because we only write a few log files, and these are stored in $HOME (see next step). However, if you want to process large amounts of data and have to perform a lot of I/O, it is highly recommended to use the node's local scratch for this. It will generally be faster than the network storage which is shared by all users. In our case, the Python script downloads the MNIST dataset to the local scratch space.

Next, we add running the Python script to the batch file. We let the script know to put the log files in the path below our $HOME. This means that these files will be written over the network while they are updated, so care should be taken that the number of files is low and the file sizes are small.

If you have other results to obtain, you can copy them after the Python script is done, but before the job terminates, by adding commands to the script. This is very important because all the files on the node's local scratch will be deleted after our jobs has finished. Make sure that you only copy the files back that you really need.

Running your job

Now that we have the Python script and batch file, we are ready to run our job.

Please make sure that you are in the same directory where the script are. If not, then change into

You are ready to submit your job like this:

Immediately after you have submitted it, you should see something like this:

Even though the job might take a little time to get out of the pending state in the queue, you can now start tensorboard on the login node by running:

To connect, you can first open an additional ssh connection by running:

Replace <port> with the port number shown in the output of tensorboard. You can also select a specific port by changing the --port=0 part. This assumes that the configuration from wiki page 'Login to ALICE or SHARK from Linux' -> 'For regular users or "the more elegant way"' is used. Then, in your browser, navigate to: http://localhost:6006

If the data appears to be missing, you can always use the 'Update' button on tensorboard to refresh the data.

Cancelling your job

In case you need to cancel the job that you have submitted, you can use the following command:

You can use it to cancel the job at any stage in the queue, i.e., pending or running.

Note that you might not be able to cancel the job in this example, because it has already finished.