Checkpointing

Checkpointing refers to the technique of storing a running program's execution state so that it can restart in the exact same situation later on or on another machine. This is applicable to serial or parallel/distributed computing on CPUs and GPUs.

On HPC clusters like ALICE and SHARK, checkpointing is useful for various reasons, for example:

With checkpointing, it is possible to split a job that would normally run a (very) long time (perhaps exceeding the limit of the partition) into shorter jobs. On ALICE, the running time is limited on all partitions and jobs are terminated automatically when the running time of the job is reached. Short jobs are also sometimes faster to start, because they might receive a higher priority in the queue than jobs in long partitions.
In case the job fails for whatever reasons after already running a significant amount of time, it is possible to continue in a new job from the last checkpoint. This aspect is even more important because a job can fail for any number of reasons:
1. A job might fail because the job itself because it ran over memory or it ran into other kinds of issues within the program.
2. A job might also fail because of an issue on the compute node or in case the compute node needed to be rebooted to apply critical security patches. This can be a concern on both clusters, but in particular on SHARK because partitions on SHARK do not have a time limit.

If your jobs are running fine, you probably do not need worry about checkpointing.

Some applications already have built-in support, while others require additional tools.

Here, we aim to provide some examples of using checkpointing. We will expand this section further.

Checkpointing

Using checkpoints for ML models