Next Maintenance:
Note |
---|
Update 8 May 2024: ALICE mostly available again! |
We are very happy to announce that almost all ALICE nodes are available again, including 12 of our 14 A100 GPU nodes.
Also, if you need to use any of the older modules that were available prior to the maintenance, use
module load ALICE/legacy
Note |
---|
Update 7 May 2024: ALICE is (partially) available again! |
We are very happy to announce that ALICE is (partially) available again. Available are
the login nodes
the CPU nodes
the 'old' GPU nodes with NVidia RTX2080ti GPUs
nodes acquired by specific users or groups, such as mem_mi, strw_gpu etc
all data volumes, including /data1 with all data fully preserved during the maintenance
Currently not available:
'new' GPU nodes with NVidia A100 GPUs. This is due to hardware instabilities, which we are continuing to investigate with the supplier.
NOTE
There are still a few quirks with the system. Most notably, the loading of the module "slurm" does not always happen automatically. If the slurm commands (sbatch
, salloc
, sinfo
) are not available after logging in, then use
module load slurm
to make them available again. This quirk will be fixed very soon.
ALICE has been upgraded to run RedHat Enterprise Linux 9 (an upgrade from CentOS 7). This means that your old code may no longer run anymore and may require recompilation for the new operating system. This is also (and especially) true for any existing Conda environments, which most likely will need to be redeployed.
Also note that we have not rebuilt all modules available with the old installation (yet). Over the next weeks, we will re-add those modules that users report missing or non-functioning.
What has been done?
The system maintenance will be used for major updates to ALICE, such as:
migrated to a new operating system (RHEL 9), because CentOS 7 will become end of life end of June
updated the cluster management software
migrate the shared scratch/data1
storage system from BeeGFS to CephFS, because it is unfortunately not possible for us to continue using BeeGFS.
new software stack (only one primary software stack to simplify usage)
changes to the SLURM partitions
The first two items alone require that we completely re-install the entire cluster.
What will change after the maintenance?
The
gpu-*
andamd-gpu-*
partitions will be merged as will be thecpu-*
andamd-short/long
partitions. This will mean that all public resources will be available from thecpu-*
andgpu-*
partitions with the exception of partitionmem
and of course private resources. For legacy purposes, we will keep theamd-*
partitions, but we recommend that users migrate to thecpu-*
andgpu-*
partitions.It will not be possible to run jobs on the gpu partitions without requesting a GPU.
We have to re-install our scientific software stacks because of the migration to the new operating system. The stacks have grown significantly and include many old packages and toolchains, which is why we will not install everything. Instead, we will install the most commonly used packages based on module usage. If you are missing a module, just let us know and we can add it.
In addition, we will move from two software stacks to a single primary software stack that can run on all nodes. This will make it easier for you to run your jobs on different nodes, but it also means that we cannot build the software with full optimization for the underlying CPU architecture. However, for many workloads on ALICE this is not an issue. For users, who need optimized modules, we are planning a separate stack after the maintenance is done.
Because we have to reinstall the software stack and install a new operating system, you will have to reinstall/compile software or packages that you installed locally, including python environments, conda environments, R packages, etc. For python and conda environments in particular, we recommend that you create requirements or environment files so that you can quickly setup your environment again.
Important dates
12 April 2024 at 10:00: ALICE goes offline. All jobs will be cancelled. No further access to ALICE.
13/14 April 2024: Data center tests
15 April: Start system maintenance
26 April: Expected end of system maintenance