Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 19 Current »

This section is used to announce upcoming maintenance and provide information before, during and after it. For general information about our maintenance policy, please have a look here: Maintenance Policy

Next Maintenance:

Update 8 May 2024:

ALICE mostly available again!

We are very happy to announce that almost all ALICE nodes are available again, including 12 of our 14 A100 GPU nodes.

Also, if you need to use any of the older modules that were available prior to the maintenance, use

  • module load ALICE/legacy

Update 7 May 2024:

ALICE is (partially) available again!

We are very happy to announce that ALICE is (partially) available again. Available are

  • the login nodes

  • the CPU nodes

  • the 'old' GPU nodes with NVidia RTX2080ti GPUs

  • nodes acquired by specific users or groups, such as mem_mi, strw_gpu etc

  • all data volumes, including /data1 with all data fully preserved during the maintenance

Currently not available:

  • 'new' GPU nodes with NVidia A100 GPUs. This is due to hardware instabilities, which we are continuing to investigate with the supplier.

NOTE
There are still a few quirks with the system. Most notably, the loading of the module "slurm" does not always happen automatically. If the slurm commands (sbatch, salloc, sinfo ) are not available after logging in, then use
module load slurm
to make them available again. This quirk will be fixed very soon.

ALICE has been upgraded to run RedHat Enterprise Linux 9 (an upgrade from CentOS 7). This means that your old code may no longer run anymore and may require recompilation for the new operating system. This is also (and especially) true for any existing Conda environments, which most likely will need to be redeployed.

Also note that we have not rebuilt all modules available with the old installation (yet). Over the next weeks, we will re-add those modules that users report missing or non-functioning.

What has been done?

The system maintenance will be used for major updates to ALICE, such as:

  • migrated to a new operating system (RHEL 9), because CentOS 7 will become end of life end of June

  • updated the cluster management software

  • migrate the shared scratch /data1 storage system from BeeGFS to CephFS, because it is unfortunately not possible for us to continue using BeeGFS.

  • new software stack (only one primary software stack to simplify usage)

  • changes to the SLURM partitions

The first two items alone require that we completely re-install the entire cluster.

What will change after the maintenance?

  1. The gpu-* and amd-gpu-* partitions will be merged as will be the cpu-* and amd-short/long partitions. This will mean that all public resources will be available from the cpu-* and gpu-* partitions with the exception of partition mem and of course private resources. For legacy purposes, we will keep the amd-* partitions, but we recommend that users migrate to the cpu-* and gpu-* partitions.

  2. It will not be possible to run jobs on the gpu partitions without requesting a GPU.

  3. We have to re-install our scientific software stacks because of the migration to the new operating system. The stacks have grown significantly and include many old packages and toolchains, which is why we will not install everything. Instead, we will install the most commonly used packages based on module usage. If you are missing a module, just let us know and we can add it.

  4. In addition, we will move from two software stacks to a single primary software stack that can run on all nodes. This will make it easier for you to run your jobs on different nodes, but it also means that we cannot build the software with full optimization for the underlying CPU architecture. However, for many workloads on ALICE this is not an issue. For users, who need optimized modules, we are planning a separate stack after the maintenance is done.

  5. Because we have to reinstall the software stack and install a new operating system, you will have to reinstall/compile software or packages that you installed locally, including python environments, conda environments, R packages, etc. For python and conda environments in particular, we recommend that you create requirements or environment files so that you can quickly setup your environment again.

Important dates

  • 12 April 2024 at 10:00: ALICE goes offline. All jobs will be cancelled. No further access to ALICE.

  • 13/14 April 2024: Data center tests

  • 15 April: Start system maintenance

  • 26 April: Expected end of system maintenance

  • No labels