Best Practices

This page contains a number of best-practice recommendations for using ALICE or SHARK.

Best Practices - Shared File System

Your I/O activity can have dramatic effects on the performance of you jobs and on other users.  The general statement here is to ask for advice on improving your I/O activity if you are uncertain.  The time spent can be saved many times over in faster job execution.

  • Be aware of I/O load. If your workflow creates a lot of I/O activity then creating dozens of jobs doing the same thing may be detrimental.

  • Avoid storing many files in a single directory. Hundreds of files is probably ok; tens of thousands is not.

  • Avoid opening and closing files repeatedly in tight loops.  If possible, open files once at the beginning of your workflow / program, then close them at the end.

  • Watch your quotas.  You are limited in capacity and file count. Use "uquota". In /home the scheduler writes files in a hidden directory assigned to you.

  • Avoid frequent snapshot files which can stress the storage.

  • Limit file copy sessions. You share the bandwidth with others.  Two or three scp sessions are probably ok; >10 is not.

  • Consolidate files. If you are transferring many small files consider collecting them in a tarball first.

  • Use parallel I/O if available like "module load phdf5"

  • Use local storage for working space. Us the local storage on each node for you data. This will improve the performance of your job and reduce I/O load on the shared file systems.

 

Best Practices - Login nodes

Know when you are on a login node.  You can use your Linux prompt or the command hostname.  This will tell you the name of the login node that you are currently on. Note that the ssh gateway host itself is a secure portal from the outside and serves no compute function.

  • Appropriate activities on the login nodes:

    1. Compile code, Developing applications,

    2. Defining and submitting your job,

    3. Post-processing and managing data,

    4. Monitoring running applications.

    5. Change your user password.

  • Avoid computationally intensive activity on the login nodes.

    1. Don't run research applications.  Use an interactive session if running a batch job is not appropriate.

    2. Don't launch too many simultaneous processes.  While it is fine to compile on a login node, avoid using all of the resources. For example "make -j 2" will use two cores.

    3. That script you run to monitor job status several times a second should probably run every few minutes.

    4. I/O activity can slow the login node for everyone, like multiple copies or "ls -l" on directories with 000's of files.

Best Practices - Running jobs

  • Don't ask for more time than you really need.  The scheduler will have an easier time finding a slot for the 2 hours you need rather than the 48 hours you request.  When you run a job it will report back on the time used which you can use as a reference for future jobs.  However, don't cut the time too tight.  If something like shared I/O activity slows it down and you run out of time, the job will fail.

  • Specify the resources you need as much as possible. Do not just specify the partition, but be clear on the main job resources, i.e., number of nodes, number of CPUs/GPUs, walltime, etc. The more information you can give Slurm the better for you and other users.

  • Test your submission scripts.  Start small.  You can use the testing queue which has a higher priority but a short run time.

  • Use the testing queue.  It has a higher priority which is useful for running tests that can complete in less than 10 minutes.

  • Respect memory limits.  If your application needs more memory than is available, your job could fail and leave the node in a state that requires manual intervention.

  • Do not run scripts automating job submissions. Executing large numbers of sbatch's in rapid succession can overload the system's scheduler leading to problems with overall system performance. A better alternative is to submit job arrays.

Best Practices - Data Transfer

  • Use scp for smaller or few files and sftp or rsync for larger or more files

  • If you need to copy the same directories again and again, synchronise them so that only the files are copied that need have been updated instead copying all files.

  • If you have a lot of small files to copy, combine them in a zip or tar file before copying. Copying large number of smaller files takes usually more time on both ends than copying fewer, but larger files

  • You can use git to clone repositories to your user directory on the cluster

    • You can also consider doing development directly on the cluster