Maintenance on ALICE

Maintenance on ALICE

This section is used to announce upcoming maintenance and provide information before, during and after it. For general information about our maintenance policy, please have a look here: https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37519739

ALICE User Migration to new Cluster Managament system

This migration affects all users on ALICE.

Status of Migration

finishing - Last Updated: Nov 5, 2025

  • 5 Nov

    • Logging into the gateway (ssh-gw) with new passwords now works.

    • The open-ondemand porttal has been migrated and has new resources in the interactive partition.

  • 29 Oct

    • All nodes have been migrated and old queueing system is down as also the license of the management system has expired.

    • The open-ondemand portal and ssh gateway are worked on.

    • eduvpn is an alternative to the ssh-gw if you have an ulcn: eduVPN (with ULCN) - HPC wiki

  • 15 Oct

    • Third batch of nodes have been migrated to the new environment and are available through slurm

  • 10 Oct.

    • The third batch of nodes are put in reservation and will be migrated on october 15th (see “Migration of compute resources on 15 Oct 2025“ below)

    • The private partitions have not yet been integrated. We are working on it The private partitions have all been migrated to the new system.

    • The second batch of nodes are put in reservation and will be migrated on october 9th.

  • 2 oct

    • Early access to the new system is now available through the new login nodes login3 and login4.

Not all compute resources have been migrated, because the “old” system is still in use during the transition phase. Until the end of the transition phase, we will gradually migrate all compute resources to the new system.

Open issues

  • RDP on login3

    • RDP on login3 is not working properly yet. Apps can start, but you will see a black background. RDP is working fine on login4.

  • Open on demand is not yet migrated

  • Passwords can differ on the ssh-gw and login nodes as the ssh-gw is not migrated (yet)

Questions and Assistance

If you have any questions or need assistance, do not hesitate to contact us through the ALICE Helpdesk email address.

Access

  • the new ssh gateway for the new system is not yet online

  • for user created before 01 Oct. 2025

    • for LEI users

      • you can directly connect to the login3 and login4 without the ssh gateway using eduVPN

      • or connect through the ssh gateway, but we recommend that you set up ssh keys for this

    • Other user

      • you can reach login3 and login4 by tunneling through the current ALICE ssh gateway. If you have not done so yet, we recommend setting up ssh keys.

  • for user created after 01 Oct. 2025

    • for LEI users

      • you can directly connect to login3 and login4 without the ssh gateway using eduVPN

      • or connect through the ssh gateway, but you will have to set up ssh keys for this

    • Other user

      • you can reach login3 and login4 by tunneling through the current ALICE ssh gateway, but you will have to set up ssh keys.

  • The Open OnDemand portal is not yet available on the new system

Slurm changes

  • long and medium partitions have been merged and replaced by hardware-specific partitions (e.g., cpu-zen4, gpu-l4)

  • selecting specific hardware for the cpu-short and gpu-short partitions can still be done through features using --constraints

  • you will always have to specify a time limit for jobs

  • gpu paritions are only for jobs that need a gpu

  • separate testing partition for testing and debugging jobs

  • separate interactive partition

  • check available partitions and nodes with sinfo

Software

  • There have been no changes to the module stack. If the stack is not automatically available, just run

    module load ALICE
  • If you see the following warning after logging in:

    Lmod has detected the following error: The following module(s) are unknown: "slurm" # and/or Lmod has detected the following error: The following module(s) are unknown: "gcc"

    you have to remove the line “module load slurm” and/or “module load gcc” in your .bashrc. On the new system, slurm is not a module anymore and gcc should be loaded with a different module.

  • Software installed in your own user environment, should continue to work on the new system.

Migration of compute resources on 09 Oct 2025

On 09 Oct 2025, we will migrate the next set of compute resources from all partitions, which is why they have been placed in a reservation.

We will also migrate all private partitions:

  • cpu_natbio

  • cpu_lorentz

  • gpu_strw

  • gpu_lion

  • gpu_cml

  • gpu_lucdh

This is the complete list of nodes that will be migrated on this day:

  • node[011-015,018-024,801-802,857-862,867-873,877-879]

During the migration, the nodes will be temporarily offline.

After the migration is complete, the compute resources will become available on the new system.

If you want to make use of the private paritions, you will have to switch to login3 or login4.

Migration of compute resources on 15 Oct 2025

On 15 Oct 2025, we will migrate the next set of compute resources from all partitions, which is why they have been placed in a reservation. They will continue to proces josb that will finish until the reservation starts. If you need to run longer job, please migrate to the new environment

This is the complete list of nodes that will be migrated on this day:

  • node[005-010,030-033,851-852,863-864,880-883]

During the migration, the nodes will be temporarily offline.

After the migration is complete, the compute resources will become available on the new system.

Overview

We have been building a new management node, because we are migrating to a new cluster management system (TrinityX) for ALICE. This step is necessary because of increasing licencing costs for the current cluster management system (Bright).

The process requires a complete rebuild of the ALICE cluster nodes (except the storage). On the plus side, it means that we can finally make all the new compute hardware that was bought earlier this year available to you. The new system will also make it easier for us to integrate new hardware into the cluster.

User Migration

Instead of a hard switch of all users from one system to the other, we aim for a transition phase for a limited amount of time. The transition phase will start on 01 Oct and end on 26 Oct. 2025.
During this transition phase both systems will run and users can start getting to know the new system, while we move nodes from the old to the new system.

There is a hard deadline for the migration though which is set by the expiration of the license of the current cluster management system. After 26 Oct., the “old” environment including access to login1 and login2 will no longer be available.

What does this mean for you?

During this migration, you will continue to have access to your data and submit jobs. The storage will be unaffected and shared between the systems.

We will update the user documentation for ALICE to reflect the changes.

On the new system, accessable via 2 new login nodes, the partition layout will become slightly different. We hope that this will make the partitions more intuitive.

  • the testing partition will be moved to compute nodes and will no longer run on the login nodes

  • we will add a dedicated partition for interactive jobs

  • there will only be a short and long partition. The medium partition will be dropped.

  • separate partitions per gpu type (e.g., gpu-l4, gpu-2080ti, gpu-a100, ..)

The Slurm accounting will start fresh, so job ids will start (almost) from the beginning. We recommend that you store job output files for jobs on the new system in a different location so that old files do not get overwritten.

A new ssh gateway and new login nodes will be used for the new system. You will have to adjust your ssh config settings accordingly. Your ssh keys will remain unaffected. We will communicate when the new gateway is available. Until then, you can use the current gateway to access the new login nodes, but we recommend that you set up ssh keys.

The software and ALICE module stack will be shared. There will be no changes to the ALICE module stack. The current stack will continue to work.

For some time now, we have been recommending RDP as a replacement for X2Go. On the new system, only RDP will be available. There is still an alternative to RDP through the ALICE Open OnDemand portal. We are planning to migrate the Open OnDemand portal on 27 Oct 2025 after the end of the transition phase.

Please try running your jobs on the new system as soon as it becomes available to you (a separate announcement will follow). The closer we will get to the deadline for the transition phase, the fewer resources will be available in the old system. If you run into issues, please let us know.

Timeline

  • 01 Oct. 2025: New system will become online. First users will get access

  • 09 Oct. 2025: New system will be generally available to all users

  • 15 Oct. 2025: Third batch of resourcess wil be moved.

  • 26 Oct. 2025: Transition phase will end. Old system will become unavailable. Only the new system will be available

  • 27 Oct. 2025: Migration of Open Ondemand

 

FAQ and issues

  • The hostkey of login.alice.universiteitleiden.nl changed. The alias to the login servers has been changed to login3 and login4.
    If you get the warning below, please update your ~/.ssh/known_hosts file by removing the old line ssh-keygen -R 'login.alice.universiteitleiden.nl'

    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ED25519 key sent by the remote host is SHA256:7+NddwW6meoT3eFEUUt5dt1juFnvmmgllhnsrih0AQo.
  • Some users had problems loading Matlab in the graphical user interface, with error: Failed to load module "canberra-gtk-module".
    This was solved by first loading the GTK3 from the software stack “module load GTK3”.

Questions and Assistance

If you have any questions or need assistance, do not hesitate to contact us through the ALICE Helpdesk email address.