Next Maintenance:
Note |
---|
Update 26 April 2024: Due to circumstances beyond our control, we have not been able to complete all required maintenance. This means that we are not yet able to bring ALICE online again. We hope to have ALICE fully operational again by May 3rd, but we will keep you informed of our progress. |
Note |
---|
ALICE will be offline for system maintenance from 12 April 2024 till 26 April 2024. |
On the weekend of 13/14 April 2024, various tests will be performed in the data center that houses ALICE. Those tests require that we take ALICE completely offline, for which we will start the procedure on Friday, 12 April.
Moreover, we will use this opportunity to perform extensive system maintenance. Because of the amount of work involved, we are conservatively planning a downtime of two weeks. We will try to bring back ALICE sooner, but we do not want make any promises.
We will update the current status of ALICE on the HPC wiki page as usual: ALICE status page
We realize the impact that such a long downtime of the cluster has on you. Please know that this maintenance is absolutely necessary. If you have any questions, please let us know.
What will we do?
The system maintenance will be used for major updates to ALICE, such as:
migrate to a new operating system (RHEL 9), because CentOS 7 will become end of life end of June
update the cluster management software
migrate the shared scratch
/data1
storage system from BeeGFS to CephFS, because it is unfortunately not possible for us to continue using BeeGFS.
new software stack (only one primary software stack to simplify usage)
changes to the SLURM partitions
The first two items alone require that we completely re-install the entire cluster.
What does this mean for you?
During the system maintenance, you will not be able to use ALICE, run any jobs or access your data. All currently running or pending jobs will be cancelled.
As has always been our recommendation, it is vital that you have a copy of all your relevant data on the shared scratch (/data1
) including project directories that you have access to. While we are planning to migrate the user data to the new scratch storage system, it is possible that data will be lost.
If possible, please remove any data on the cluster that is no longer needed.
We also have to update the NFS server that hosts your home directory. Again, we strongly recommend that you have a copy of relevant data and please remove data that is no longer needed
What will change after the maintenance?
The
gpu-*
andamd-gpu-*
partitions will be merged as will be thecpu-*
andamd-short/long
partitions. This will mean that all public resources will be available from thecpu-*
andgpu-*
partitions with the exception of partitionmem
and of course private resources. For legacy purposes, we will keep theamd-*
partitions, but we recommend that users migrate to thecpu-*
andgpu-*
partitions.It will not be possible to run jobs on the gpu partitions without requesting a GPU.
We have to re-install our scientific software stacks because of the migration to the new operating system. The stacks have grown significantly and include many old packages and toolchains, which is why we will not install everything. Instead, we will install the most commonly used packages based on module usage. If you are missing a module, just let us know and we can add it.
In addition, we will move from two software stacks to a single primary software stack that can run on all nodes. This will make it easier for you to run your jobs on different nodes, but it also means that we cannot build the software with full optimization for the underlying CPU architecture. However, for many workloads on ALICE this is not an issue. For users, who need optimized modules, we are planning a separate stack after the maintenance is done.
Because we have to reinstall the software stack and install a new operating system, you will have to reinstall/compile software or packages that you installed locally, including python environments, conda environments, R packages, etc. For python and conda environments in particular, we recommend that you create requirements or environment files so that you can quickly setup your environment again.
Important dates
12 April 2024 at 10:00: ALICE goes offline. All jobs will be cancelled. No further access to ALICE.
13/14 April 2024: Data center tests
15 April: Start system maintenance
26 April: Expected end of system maintenance