Python on ALICE and SHARK
There are different versions of Python available on ALICE and SHARK. Some have also been build with CUDA support. You can choose whatever version is available and suitable for you.
Note |
---|
Python is also always available by default from the operating system. Do not use this version of Python for your jobs. Always make use of a module. |
ALICE
You can find a list of available versions with
Code Block |
---|
module -r avail '^Python/' |
Choose a module and add it to your environment by loading it, e.g.,:
Code Block |
---|
module load Python/3.711.45-GCCcore-813.32.0 |
The command python --version
returns the version of Python you have loaded:
Code Block |
---|
[me@nodelogin01 ~]$ python --version Python 3.711.45 |
The command which Python
returns the location where the Python executable resides:
Code Block |
---|
which python /cm/shared/easybuild/software/Python/3.711.45-GCCcore-813.32.0/bin/python |
There also several Python packages available as modules or other applications that have been build with Python support. You can find them by running
Code Block |
---|
module avail Python |
Miniconda is also available on ALICE in addition to applications that use Miniconda. You can get an overview by runnning
Code Block |
---|
module avail conda |
This tutorial will not go into detail on using Miniconda. Note that conda environments can become quite large. If you are not sure whether it will fit into your quota-limited home directory, use the shared scratch space.
SHARK
You can find a list of available versions with
Code Block |
---|
module avail /python/ |
Choose a module and add it to your environment by loading it, e.g.,:
Code Block |
---|
module load module load system/python/3.10.2 |
The command python --version
returns the version of Python you have loaded:
Code Block |
---|
[me@res-hpc-lo02 ~]$ python --version Python 3.10.2 |
The command which Python
returns the location where the Python executable resides:
Code Block |
---|
which python alias python='/share/software/system/python/3.10.2/bin/python3.10' /share/software/system/python/3.10.2/bin/python3.10 |
There also several Python packages available as modules or other applications that have been build with Python support. You can find them by running
Code Block |
---|
module avail python |
Miniconda is also available on SHARK in addition to applications that use Miniconda. You can get an overview by running
Code Block |
---|
module avail conda |
This tutorial will not go into detail on using Miniconda. Note that conda environments can become quite large. If you are not sure whether it will fit into your quota-limited home directory, use the shared scratch space.
Preparations
It is always a good idea to start by looking at the load of the cluster when you want to submit a job. Also, it helps to run some short, resource-friendly tests to see if your set up is working and you have a correct batch file.
The “testing”-partition on ALICE or the “short” partition on SHARK can be used for such purpose. The examples in this tutorial are save to use on those partitions.
Here, we will assume that you have already created a directory called user_guide_tutorials
in your $HOME
from the previous tutorials. For this job, let's create a sub-directory and change into it:
Code Block |
---|
mkdir -p $HOME/user_guide_tutorials/first_python_job cd $HOME/user_guide_tutorials/first_python_job |
Since this tutorial will go through different examples of Python jobs, further preparations are discussed for each example.
We will make use of the Numpy package in this tutorial. For demonstration purposes, this tutorial will show you how to install it in your user environment in Python virtual environment.
Setting up Numpy in a virtual environment
In most cases, it is best to setup your own Python environment and install all necessary packages manually from the command line and not make it part of the Slurm batch file. Or you can create a separate job that only takes care of installing the virtual environment. Here, we will make a manual install.
First, we have to a load one of the available Python modules:
ALICE
You are free to use a Python module of your choice. For this tutorial, we will use:
Code Block |
---|
module load Python/3.911.65-GCCcore-1113.2.0 |
Info |
---|
On ALICE, you can also make use of the SciPy-bundle module which includes numpy as several other packages and dependencies for HPC purposes. |
SHARK
You are free to use a Python module of your choice. For this tutorial, we will use:
Code Block |
---|
module load system/python/3.9.13 |
Next, we will create the virtual environment in the directory of our test job (assuming that you have changed into this directory)
Code Block |
---|
python -m venv python_test_venv |
To activate the newly created virtual environment, we have to source it:
Code Block |
---|
source python_test_venv/bin/activate |
Note how the command line prompt changed from [me@<nodename> first_python_job]
to (guide_venv) [me@<nodename> first_python_job]
indicating the active virtual environment. You can also see this by retrieving the list of packages in the virtual environment which is quite different from when you run it outside the environement, e.g., it might look like this (version numbers will most likely have changed)
Code Block |
---|
(guide_venv) [me@<nodename> python_venvs]$ pip list Package Version ---------- ------- pip 19.0.3 setuptools 40.8.0 |
Before we install any packages, we update the existing pip and setuptools packages and install the package wheel by running
Code Block |
---|
pip install --upgrade pip pip install --upgrade setuptools pip install wheel |
Now, we are ready to install the python packages that we need. In this case, we just need Numpy, so we run
Code Block |
---|
pip install numpy |
If the installation was successful, you should see a message such as this: Successfully installed numpy-<version>.
You can also create a requirements file which includes all packages that you want to install. Then you tell pip to use this requirements file and it will proceed to install all packages. This helps with reproducibility because you can easily re-create virtual environment with the same package configuration. Conda has a similar feature.
You can leave the virtual environment by running:
Code Block |
---|
deactivate |
A serial Python job
First, we will prepare and run a simple Python job that will calculates the median of a randomly generated array several times. Here, we will do this in a serial manner on a single core.
Preparations
The Python script
We will use the following Python script for this example and save it as test_python_simple.py
.
Code Block | ||
---|---|---|
| ||
""" Python test script for the ALICE user guide. Serial example """ import numpy as np import os import socket from time import time def mysim(run, size=1000000): """ Function to calculate the median of a random generated array """ # get pid pid = os.getpid() # initialize rng = np.random.default_rng(seed=run) # create random array rnd_array = rng.random(size) # get median arr_median = np.median(rnd_array) print("(PID {0}) Run {1}: Median of simulation: {2} ".format(pid, run, arr_median)) return arr_median if __name__ == "__main__": # get starting time of script start_time = time() print("Python test started on {}".format(socket.gethostname())) # how many simulation runs: n_runs = 100 size = 10000000 print("Running {0} simulations of size {1}".format(n_runs, size)) # go through the simulations for i in range(n_runs): # run the simulation run_result = mysim(i, size=size) print("Python test finished (running time: {0:.1f}s)".format(time() - start_time)) |
For demonstration purposes, the script contains quite a few print statements. Since this is a very basic example, we will not use proper logging, but write everything out to the Slurm output file.
The Slurm batch file
The next step is to create the corresponding Slurm batch file which we will name test_python_simple.slurm
. We will make use of the testing partition on ALICE or the short partition on SHARK. Make sure to change the partition and resources requirements for your production jobs. The running time and amount of memory have already been set in a way that fits to the resources that this job needs. If you do not know this, it is best to use a conservative estimate at first and then reduce the resource requirements.
ALICE
Code Block |
---|
#!/bin/bash #SBATCH --job-name=test_python_simple #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_email_address>" #SBATCH --mail-type="ALL" #SBATCH --mem=100M #SBATCH --time=00:05:00 #SBATCH --partition=testing #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 # load modules (assuming you start from the default environment) # we explicitly call the modules to improve reproducibility # in case the default settings change module load Python/3.911.5-GCCcore-1013.32.0 # Source the Python virtual environment source $HOME/user_guide_tutorials/first_python_job/python_test_venv/bin/activate echo "[$SHELL] #### Starting Python test" echo "[$SHELL] ## This is $SLURM_JOB_USER on $HOSTNAME and this job has the ID $SLURM_JOB_ID" # get the current working directory export CWD=$(pwd) echo "[$SHELL] ## current working directory: "$CWD # Run the file echo "[$SHELL] ## Run script" python3 test_python_simple.py echo "[$SHELL] ## Script finished" echo "[$SHELL] #### Finished Python test. Have a nice day" |
SHARK
Code Block |
---|
#!/bin/bash #SBATCH --job-name=test_python_simple #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_email_address>" #SBATCH --mail-type="ALL" #SBATCH --mem=1G #SBATCH --time=00:05:00 #SBATCH --partition=short #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 # load modules (assuming you start from the default environment) # we explicitly call the modules to improve reproducibility # in case the default settings change module load system/python/3.9.13 # Source the Python virtual environment source $HOME/user_guide_tutorials/first_python_job/python_test_venv/bin/activate echo "[$SHELL] #### Starting Python test" echo "[$SHELL] ## This is $SLURM_JOB_USER on $HOSTNAME and this job has the ID $SLURM_JOB_ID" # get the current working directory export CWD=$(pwd) echo "[$SHELL] ## current working directory: "$CWD # Run the file echo "[$SHELL] ## Run script" python3 test_python_simple.py echo "[$SHELL] ## Script finished" echo "[$SHELL] #### Finished Python test. Have a nice day" |
where you should replace <your_email_address>
by an actual e-mail address of yours.
The batch file will also print out some information to the Slurm output file. To separate the output from what the Python script will produce, we use [$SHELL]
here.
Job submission
Let us submit this Python job to slurm:
Code Block |
---|
sbatch test_python_simple.slurm |
Immediately after you have submitted this job, you should see something like this:
Code Block |
---|
[me@<node_name> first_python_job]$ sbatch test_Python_simple.slurm Submitted batch job <job_id> |
Job output
In the directory where you launched your job, there should be new file created by Slurm: test_Python_simple_<jobid>.out
. It contains all the output from your job which would have normally written to the command line. Check the file for any possible error messages. The content of the file should look something like this:
Code Block |
---|
[/bin/bash] #### Starting Python test [/bin/bash] ## This is <username> on <node_name> and this job has the ID <job_id> [/bin/bash] ## current working directory: /home/<username>/user_guide_tutorials/first_python_job [/bin/bash] ## Run script Python test started on nodelogin01 Running 100 simulations of size 10000000 (PID 355612) Run 0: Median of simulation: 0.5000570098580963 (PID 355612) Run 1: Median of simulation: 0.4998579857833511 ... (PID 355612) Run 98: Median of simulation: 0.49996481928029896 (PID 355612) Run 99: Median of simulation: 0.5001124362538245 Python test finished (running time: 26.2s) [/bin/bash] ## Script finished [/bin/bash] #### Finished Python test. Have a nice day |
The running time might differ when you run it. The process ID (PID) is printed out for demonstration purposes. Because this is a serial job, the PID does not change.
You can get a quick overview of the resources actually used by your job by running:
Code Block |
---|
seff <job_id> |
The output from seff
will probably look something like this:
Code Block |
---|
Job ID: <jobid> Cluster: <cluster_name> User/Group: <user_name>/<group_name> State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:22 CPU Efficiency: 100.00% of 00:00:22 core-walltime Job Wall-clock time: 00:00:22 Memory Utilized: 1.36 MB Memory Efficiency: 1.36% of 100.00 MB |
A parallel Python job
The simulations that you ran in the previous example are independent of each other. This makes it possible to make them parallel and use multiple cores.
Preparations
Parallel Python script
There are different ways to parallelize in Python. Here, we will make use of the Multiprocessing package which is a standard package in Python. This is just one example and not necessarily the best option for your case.
We will name the Python script test_python_mp.py
and put in the same directory as the previous script. While this is fine for this tutorial, in a realistic case, it is probably best to use a separate directory in order to avoid having too many files in one directory.
Code Block | ||
---|---|---|
| ||
""" Python test script for the ALICE user guide. Multi-processing example """ import numpy as np import os import socket from time import time import multiprocessing as mp def mysim(run, size=1000000): """ Function to calculate the median of a random generated array """ # get pid pid = os.getpid() # initialize rng = np.random.default_rng(seed=run) # create random array rnd_array = rng.random(size) # get median arr_median = np.median(rnd_array) # just for demonstration # do not do this here in a production run print("(PID {0}) Run {1}: Median of simulation: {2} ".format(pid, run, arr_median)) return arr_median if __name__ == "__main__": # get starting time of script start_time = time() print("Python MP test started on {}".format(socket.gethostname())) # how many simulation runs: n_runs = 100 size = 10000000 print("Running {0} simulations of size {1}".format(n_runs, size)) # Important: only way to get correct core count # Altenatively use SLURM_CPUS_PER_TASK n_cores = os.environ['SLURM_JOB_CPUS_PER_NODE'] print("The number of cores available from SLURM: {}".format(n_cores)) # go through the simulations in parallel pool = mp.Pool(processes=int(n_cores)) # use starmap because mysim has multiple inputs res = pool.starmap(mysim, [(i,size) for i in range(n_runs)]) pool.close() pool.join() print("Python MP test finished (running time: {0:.1f}s)".format(time() - start_time)) |
Note |
---|
Do not use internal functions of Multiprocessing to get the core count that you set. This will not work. You have to read out the Slurm environment variable |
Slurm batch file
The Slurm batch file will be named test_python_mp.slurm
ALICE
Code Block |
---|
#!/bin/bash #SBATCH --job-name=test_python_mp #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_email_address>" #SBATCH --mail-type="ALL" #SBATCH --mem-per-cpu=10M #SBATCH --time=00:05:00 #SBATCH --partition=testing #SBATCH --ntasks=1 #SBATCH --cpus-per-task=10 # load modules (assuming you start from the default environment) # we explicitly call the modules to improve reproducibility # in case the default settings change module load Python/3.911.5-GCCcore-1013.32.0 # Source the Python virtual environment source $HOME/user_guide_tutorials/first_python_job/python_test_venv/bin/activate echo "[$SHELL] #### Starting Python test" echo "[$SHELL] ## This is $SLURM_JOB_USER on $HOSTNAME and this job has the ID $SLURM_JOB_ID" # get the current working directory export CWD=$(pwd) echo "[$SHELL] ## current working directory: "$CWD # Run the file echo "[$SHELL] ## Run script" python3 test_python_mp.py echo "[$SHELL] ## Script finished" echo "[$SHELL] #### Finished Python test. Have a nice day" |
SHARK
Code Block |
---|
#!/bin/bash #SBATCH --job-name=test_python_mp #SBATCH --output=%x_%j.out #SBATCH --mail-user="<your_email_address>" #SBATCH --mail-type="ALL" #SBATCH --mem-per-cpu=1G #SBATCH --time=00:05:00 #SBATCH --partition=short #SBATCH --ntasks=1 #SBATCH --cpus-per-task=10 # load modules (assuming you start from the default environment) # we explicitly call the modules to improve reproducibility # in case the default settings change module load system/python/3.9.13 # Source the Python virtual environment source $HOME/user_guide_tutorials/first_python_job/python_test_venv/bin/activate echo "[$SHELL] #### Starting Python test" echo "[$SHELL] ## This is $SLURM_JOB_USER on $HOSTNAME and this job has the ID $SLURM_JOB_ID" # get the current working directory export CWD=$(pwd) echo "[$SHELL] ## current working directory: "$CWD # Run the file echo "[$SHELL] ## Run script" python3 test_python_mp.py echo "[$SHELL] ## Script finished" echo "[$SHELL] #### Finished Python test. Have a nice day" |
where you should replace <your_email_address>
with your e-mail address.
Note the changes that were made to the list of resources: The number of cores has been set to 10 (--cpus-per-task
) and the amount of memory is specified as per core (--mem-per-cpu
). We have also changed the name of the job to make it consistent.
Job submission
Let us submit this Python job to slurm:
Code Block |
---|
sbatch test_python_mp.slurm |
Immediately after you have submitted this job, you should see something like this:
Code Block |
---|
[me@<node_name> first_python_job]$ sbatch test_python_mp.slurm Submitted batch job <job_id> |
Job output
The job should have created test_python_mp_<jobid>.out
. As before, check the .out
-file for the output from the script any possible error messages. It should look something like this:
Code Block |
---|
[/bin/bash] #### Starting Python test [/bin/bash] ## This is <username> and this job has the ID <job_id> [/bin/bash] ## current working directory: /home/<username>/User_Guide/First_Job/First_Python_Job [/bin/bash] ## Run script Python MP test started on nodelogin01 Running 100 simulations of size 10000000 The number of cores available from SLURM: 10 (PID 167665) Run 15: Median of simulation: 0.4998555612812697 (PID 167665) Run 16: Median of simulation: 0.4997970892172718 ... (PID 167664) Run 97: Median of simulation: 0.5001516237583952 (PID 167664) Run 98: Median of simulation: 0.49996481928029896 Python MP test finished (running time: 3.5s) [/bin/bash] ## Script finished [/bin/bash] #### Finished Python test. Have a nice day |
Note how the running time changed compared to the serial job as is expected from using multiple cores. You can also see the multi-processing at work because there different PIDs and the output is out of order.
You can get a quick overview of the resources actually used by your job by running:
Code Block |
---|
seff <job_id> |
It might look something like this:
Code Block |
---|
Job ID: <job_id> Cluster: <cluster_name> User/Group: <user_name>/<group_name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 10 CPU Utilized: 00:00:25 CPU Efficiency: 50.00% of 00:00:50 core-walltime Job Wall-clock time: 00:00:05 Memory Utilized: 1.34 MB Memory Efficiency: 0.01% of 10.00 GB |