Maintenance on ALICE

This section is used to announce upcoming maintenance and provide information before, during and after it. For general information about our maintenance policy, please have a look here: https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37519739

Next Maintenance:

Update 26 April 2024:

Due to circumstances beyond our control, we have not been able to complete all required maintenance. This means that we are not yet able to bring ALICE online again. We hope to have ALICE fully operational again by May 3rd, but we will keep you informed of our progress.

ALICE will be offline for system maintenance from 12 April 2024 till 26 April 2024.

On the weekend of 13/14 April 2024, various tests will be performed in the data center that houses ALICE. Those tests require that we take ALICE completely offline, for which we will start the procedure on Friday, 12 April.

Moreover, we will use this opportunity to perform extensive system maintenance. Because of the amount of work involved, we are conservatively planning a downtime of two weeks. We will try to bring back ALICE sooner, but we do not want make any promises.

We will update the current status of ALICE on the HPC wiki page as usual: https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37519792

We realize the impact that such a long downtime of the cluster has on you. Please know that this maintenance is absolutely necessary. If you have any questions, please let us know.

What will we do?

The system maintenance will be used for major updates to ALICE, such as:

  • migrate to a new operating system (RHEL 9), because CentOS 7 will become end of life end of June

  • update the cluster management software

  • migrate the shared scratch /data1 storage system from BeeGFS to CephFS, because it is unfortunately not possible for us to continue using BeeGFS.

  • new software stack (only one primary software stack to simplify usage)

  • changes to the SLURM partitions

The first two items alone require that we completely re-install the entire cluster.

What does this mean for you?

During the system maintenance, you will not be able to use ALICE, run any jobs or access your data. All currently running or pending jobs will be cancelled.

As has always been our recommendation, it is vital that you have a copy of all your relevant data on the shared scratch (/data1) including project directories that you have access to. While we are planning to migrate the user data to the new scratch storage system, it is possible that data will be lost.
If possible, please remove any data on the cluster that is no longer needed.

We also have to update the NFS server that hosts your home directory. Again, we strongly recommend that you have a copy of relevant data and please remove data that is no longer needed

What will change after the maintenance?

  1. The gpu-* and amd-gpu-* partitions will be merged as will be the cpu-* and amd-short/long partitions. This will mean that all public resources will be available from the cpu-* and gpu-* partitions with the exception of partition mem and of course private resources. For legacy purposes, we will keep the amd-* partitions, but we recommend that users migrate to the cpu-* and gpu-* partitions.

  2. It will not be possible to run jobs on the gpu partitions without requesting a GPU.

  3. We have to re-install our scientific software stacks because of the migration to the new operating system. The stacks have grown significantly and include many old packages and toolchains, which is why we will not install everything. Instead, we will install the most commonly used packages based on module usage. If you are missing a module, just let us know and we can add it.

  4. In addition, we will move from two software stacks to a single primary software stack that can run on all nodes. This will make it easier for you to run your jobs on different nodes, but it also means that we cannot build the software with full optimization for the underlying CPU architecture. However, for many workloads on ALICE this is not an issue. For users, who need optimized modules, we are planning a separate stack after the maintenance is done.

  5. Because we have to reinstall the software stack and install a new operating system, you will have to reinstall/compile software or packages that you installed locally, including python environments, conda environments, R packages, etc. For python and conda environments in particular, we recommend that you create requirements or environment files so that you can quickly setup your environment again.

Important dates

  • 12 April 2024 at 10:00: ALICE goes offline. All jobs will be cancelled. No further access to ALICE.

  • 13/14 April 2024: Data center tests

  • 15 April: Start system maintenance

  • 26 April: Expected end of system maintenance