Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This section is used to announce upcoming maintenance and provide information before, during and after it. For general information about our maintenance policy, please have a look here: Maintenance Policy

Table of Contents

Next Maintenance:

Note

ALICE will be offline for system maintenance from 12 April 2024 till 26 April 2024.

On the weekend of 13/14 April 2024, various tests will be performed in the data center that houses ALICE. Those tests require that we take ALICE completely offline, for which we will start the procedure on Friday, 12 April.

Moreover, we will use this opportunity to perform extensive system maintenance. Because of the amount of work involved, we are conservatively planning a downtime of two weeks. We will try to bring back ALICE sooner, but we do not want make any promises.

We will update the current status of ALICE on the HPC wiki page as usual: ALICE status page

We realize the impact that such a long downtime of the cluster has on you. Please know that this maintenance is absolutely necessary. If you have any questions, please let us know.

What will we do?

The system maintenance will be used for major updates to ALICE, such as:

  • migrate to a new operating system (RHEL 9), because CentOS 7 will become end of life end of June

  • update the cluster management software

  • migrate the shared scratch /data1 storage system from BeeGFS to CephFS, because it is unfortunately not possible for us to continue using BeeGFS.

  • new software stack (only one primary software stack to simplify usage)

  • changes to the SLURM partitions

The first two items alone require that we completely re-install the entire cluster.

What does this mean for you?

During the system maintenance, you will not be able to use ALICE, run any jobs or access your data. All currently running or pending jobs will be cancelled.

As has always been our recommendation, it is vital that you have a copy of all your relevant data on the shared scratch (/data1) including project directories that you have access to. While we are planning to migrate the user data to the new scratch storage system, it is possible that data will be lost.
If possible, please remove any data on the cluster that is no longer needed.

We also have to update the NFS server that hosts your home directory. Again, we strongly recommend that you have a copy of relevant data and please remove data that is no longer needed

We have to re-install our scientific software stacks because of the migration to the new operating system. The stacks have grown significantly and include many old packages and toolchains, which is why we will not install everything. Instead, we will install the most commonly used packages based on module usage. If you are missing a module, just let us know and we can add it.
In addition, we will move from two software stacks to a single primary software stack that can run on all nodes. This will make it easier for you to run your jobs on different nodes, but it also means that we cannot build the software with full optimization for the underlying CPU architecture. However, for many workloads on ALICE this is not an issue. For users, who need optimized modules, we are planning a separate stack after the maintenance is done.

Important dates

  • 12 April 2024 at 10:00: ALICE goes offline. All jobs will be cancelled. No further access to ALICE.

  • 13/14 April 2024: Data center tests

  • 15 April: Start system maintenance

  • 26 April: Expected end of system maintenance

    TBA