Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

About

Partition gpu_strw with nodes node86[1-2] is a private partition exclusive to with priority access for researchers from the group of E. Rossi (STRW) and general access for all users from STRW.

The hardware configuration of the nodes can be found here: https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/37519378/About+ALICE#Hardware-Description

Table of Contents
minLevel1
maxLevel7
excludeContents

Access

  • Only for members of the research group or with Priority access only after confirmation from the PI

  • Access Priority access to the partition can be requested via e-mail to the ALICE Helpdesk and with confirmation from the PI.

  • Users who belong to the group of E. Rossi need to be a member of the group gpu_strw and have access to the account gpu_strw

    • you can check whether you a member of the group by running the command id on the command line

    • you can check whether you have access to the account by running sacctmgr show associations user=<username> where <username> should be replaced by your ALICE user name.

  • Other STRW users also have access through their STRW group and account

Partition Settings

Partition settings can be changed to adjust the need of the group. Requests for changes can be done either by the PI or by a group member with confirmation from the PI and should be send to the ALICE Helpdesk.

Job submission

This partition requires a specific account

  • Members of the group of E. Rossi should

    • should use their account gpu_strw for running jobs. This is necessary so that usage of this

    partition
    • account does not impact the fairshare of other strw users

    who do not have access to this partition.The account is called
    • .

      • In your batch script, you have add: #SBATCH --account=gpu_strw

    If you have access to the partition, you have been added to the account
        • For other jobs on ALICE, your regular ALICE account is sufficient and you do not need to set this

    • should use the qos of their PI

  • You can check this with the following command (where <username> should be replaced by your ALICE username):

Code Block
sacctmgr show association where user=<username>
  • In your batch script, you have add: #SBATCH --account=gpu_strw

    • For other jobs on ALICE, your regular ALICE account is sufficient and you do not need to set this

Software

AMD-specific software stack

  • Because this node uses AMD CPUs, it is recommended to use the AMD software stack if you want to make use of the modules available on ALICE.

    • The main ALICE software stack was compiled with Intel CPUs. It is possible that modules from the Intel software stack work on this node, but it is recommended to try the AMD software stack

    • New software can be added to the AMD software stack by contacting the ALICE Helpdesk.

  • To get a list of currently available modules in the AMD software stack when you are on a login node, use module load ALICE/AMD and then module avail

    • If you want to go back to the Intel software stack, use module load ALICE/Intel

  • For a Slurm batch script that you submit to gpu_strw, you can also use module load ALICE/default which will load the correct software stack based on the CPU architecture of the node.

    You can include module load ALICE/default in all your batch scripts if you like

    Other STRW members can use the partition gpu_strw without any specific settings.

Software

Scientific software stack

  • You can make use of the general scientific software stack which can be accessed by running

    Code Block
    module load ALICE/default

    It is recommend to add this to your batch scripts, too.

  • If you want to use software fully optimized for the CPU architecture of the nodes, you have to build the software yourself.

Your own scripts/programmes

  • Because a separate software stack is necessary for an AMD machine, you should not compile jobs with GNU-compilers or the likes this node has a different CPU, it is possible that conda environments or other software that you build on the login nodes which are Intel based

  • This is also the case for installing conda and Python virtual environment.

  • If you plan to make use of both the Intel and the AMD nodes, it might be necessary to setup two versions. Of course, you can always try out if only one version is necessary

  • You should always are not working if the software is build optimized for CPU architecture.

  • In this case, you need to compile such scripts/software as part of a batch or interactive job

    • One way to do this is to create a short slurm batch job specifically for compiling your software, setting up your conda/Python environments, etc. If you only need to do this once, then there is no need to make this part of your production batch job.

    • Another option is to compile the first time you run your programme as part of a job. In this first job, you copy the compiled program back to your shared storage or home directory. For the next job, you use the already compiled version (see example below).

  • You can still use the login nodes for testing/debugging. In this case, you need to compile on the login nodes, run your test and for your job, compile on the compute node again.

Example

Here is an example of how a Slurm batch script could look like for using the node, including a HelloWorld OpenMP program to demonstrate the compiling and use of the local scratch storage.

If you are new to HPC, ALICE or Slurm, have a look at the https://pubappslu.atlassian.net/wiki/spaces/HPCWIKI/pages/5963809 first.

Batch script

Code Block
#!/bin/bash
#SBATCH --partition=gpu_strw
#SBATCH --account=gpu_strw
#SBATCH --job-name=test_job
#SBATCH --time=0-00:02:00
#SBATCH --output=%x_%j.out
#SBATCH --nodes=1
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=3
#SBATCH --mem=10G
#SBATCH --mail-user="your-email-address"
#SBATCH --mail-type="ALL"

module load ALICE/default
module load OpenMPI/4.0.5-GCC-9.3.0

echo "#### Test started"

# return the name of the node
echo "## Which node is this: $HOSTNAME"

# check the number of cores (ntasks*cpus-per-task)
echo "How many cores do I have access to: ${SLURM_CPUS_ON_NODE}"

# Just to check that the AMD software stack is loaded
echo "Am I loading the from the right module path"
echo ${MODULEPATH%%:*}

# get the current working directory
CWD=$(pwd)

echo "## Where am I: ${CWD}"

# check out the nodes local scratch
echo "## My local scratch space on the node is: ${SCRATCH}"
cd $SCRATCH

echo "## Let us go there: $(pwd)"

# In case the file has already been compiled
# and stored in $CWD, the following six lines
# are not necessary  
echo "## Let us copy the C script to it"
cp $CWD/omp_hello.c $SCRATCH/  
echo "## Is the file there?"
ls -la omp_hello.c
echo "## Now we compile it on the node"
gcc -o omp_hello_amd -fopenmp omp_hello.c

# In case the file is already compiled
# the next four lines would copy it
# and check that it is there:
#echo "## Let us copy the compiled C programme to it"
#cp $CWD/omp_hello_amd $SCRATCH/
#echo "## Is the file there?"
#ls -la omp_hello_amd

echo "## Let us run it"
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASKS
srun ./omp_hello_amd

# Copy those files back to shared scratch or home
# that should be kept for later.
# Here, it is just the compiled C programme.
# It does not need to be copied back of course
# if it came from shared scratch or home.
echo "## Saving files that should be saved."
cp $SCRATCH/omp_hello_amd $CWD/

echo "## Now that this is done, I want to go home"
cd $CWD
echo "## Good to be back $(pwd)"

echo "#### Test finished"

OpenMP script

Here is the content of the file omp_hello.c from https://computing.llnl.gov/tutorials/openMP/samples/C/omp_hello.c

Code Block
/******************************************************************************
 * * FILE: omp_hello.c
 * * DESCRIPTION:
 * *   OpenMP Example - Hello World - C/C++ Version
 * *   In this simple example, the master thread forks a parallel region.
 * *   All threads in the team obtain their unique thread number and print it.
 * *   The master thread only prints the total number of threads.  Two OpenMP
 * *   library routines are used to obtain the number of threads and each
 * *   thread's number.
 * * AUTHOR: Blaise Barney  5/99
 * * LAST REVISED: 04/06/05
 * ******************************************************************************/
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[])
{
int nthreads, tid;

/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
  {

  /* Obtain thread number */
  tid = omp_get_thread_num();
  printf("Hello World from thread = %d\n", tid);

  /* Only master thread does this */
  if (tid == 0)
    {
    nthreads = omp_get_num_threads();
    printf("Number of threads = %d\n", nthreads);
    }

  }  /* All threads join master thread and disband */

}