Maintenance Policy

Maintenance policy

Most maintenance is performed during regular hours with no interruption to service, so you would not notice anything happening. Naturally, we also need to perform maintenance that affect your job processing or data access. Here, we describe how we generally perform maintenance on ALICE.

General

Maintenance generally occurs on Mondays during working hours.

Minor Maintenance

Minor maintenance only affects a small part of the cluster (i.e., of the order of one node) and it will not affect you as the users significantly. This can be for example:

  • Test a new node image on single node

  • Test new features that require reserving a node.

  • Restarting of nodes to update images (including reserving/draining the nodes in question)

Minor maintenance can occur every Monday without prior notice to the users

Major Maintenance

Major maintenance affects large parts of the cluster and your ability to run jobs and/or access data. For example, this can be

  • Updates to one or more compute node groups which would make entire partitions unavailable

  • Updates to the cluster management nodes

  • Updates to the storage server

Major maintenance can be scheduled on the first Monday of every month and will be announced at least one week in advance. If there is no need major maintenance then it will not happen and there will be no announcement. The corresponding Monday will still be available for minor maintenance.

System Maintenance

System maintenance refers to maintenance that requires the entire cluster to be taken offline. Such maintenance can happen every half a year and will be announced at least in advance.

Critical/Emergency Maintenance

Critical/Emergency maintenance refers to maintenance that is required because of a critical and sudden issue which requires immediate attention. While this can happen any time, the general maintenance will ensure that such events remain rare. Naturally, critical/Emergency maintenance cannot be announced in advanced. However, we will strive to inform you when it happens, most likely through the maintenance page (see below)

Updates and Announcement of Maintenance

Information about upcoming or ongoing maintenance can be found here: https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37519592. Major and system maintenance will also be announced via email.