Checkpointing and requeue

Checkpointing and requeue

Checkpointing refers to the technique of storing a running program's execution state so that it can restart in the exact same situation later on or on another machine. This is applicable to serial or parallel/distributed computing on CPUs and GPUs.

On HPC clusters like ALICE and SHARK, checkpointing is useful for various reasons, for example:

  1. With checkpointing, it is possible to split a job that would normally run a (very) long time (perhaps exceeding the limit of the partition) into shorter jobs. On ALICE, the running time is limited on all partitions and jobs are terminated automatically when the running time of the job is reached. Short jobs are also sometimes faster to start, because they might receive a higher priority in the queue than jobs in long partitions.

  2. In case the job fails for whatever reasons after already running a significant amount of time, it is possible to continue in a new job from the last checkpoint. This aspect is even more important because a job can fail for any number of reasons:

    1. A job might fail because the job itself because it ran over memory or it ran into other kinds of issues within the program.

    2. A job might also fail because of an issue on the compute node or in case the compute node needed to be rebooted to apply critical security patches. This can be a concern on both clusters, but in particular on SHARK because partitions on SHARK do not have a time limit.

If your jobs are running fine, you probably do not need worry about checkpointing.

Some applications already have built-in support, while others require additional tools.

Requeue

A simple approach would be to stop your program before the wall time limit and do a requeue. The application should be able to continue from a checkpoint otherwise it would run forever in a loop.
--open-mode=append could be useful to ensure Slurm appends to your output files instead of overwriting them when a job restarts or requeues.

 In this example your_program will be stopped 1 hour before the walltime limit and the job requeued.

#SBATCH --time=7-00:00:00 #SBATCH --ntasks=1 #SBATCH --open-mode=append timeout 6d23h your_program if [[ $? == 124 ]]; then   scontrol requeue $SLURM_JOB_ID fi

 

Use a Python script inline for simplicity; you can call your real program the same way.

python3 - <<'PY'
import json, os, time, sys

ckpt_file = os.environ.get("CKPT_FILE", "checkpoints/state.json")
progress_file = os.environ.get("PROGRESS_FILE", "checkpoints/progress.txt")
os.makedirs(os.path.dirname(ckpt_file), exist_ok=True)

Load last completed step if present

start_idx = 0
if os.path.exists(progress_file):
with open(progress_file) as f:
try:
start_idx = int(f.read().strip())
except:
start_idx = 0

total = 1000000 # pretend this is big
print(f"[start] Resuming from step {start_idx} of {total}", flush=True)

for i in range(start_idx, total):
# Simulated work
time.sleep(0.01)

# Periodically save progress (cheap checkpoint) if i % 1000 == 0: with open(progress_file, "w") as f: f.write(str(i))

print("[done] Work completed.", flush=True)

Here, we aim to provide some examples of using checkpointing. We will expand this section further.