More complex batch scripts | qcn.physics.uoc.gr

Using environment variables

When a SLURM batch scripr is submitted, the job scheduler will export or use a number of environment variables. Some of the most important variables are:

Variable	Meaning
SLURM_SUBMIT_DIR	Directory from which the script was submitted
SLURM_JOB_USER	The username of the user that submitted the job
SLURM_EXPORT_ENV	Which variables to export to shell
SLURM_NNODES	The number of nodes allocated for the job
SLURM_JOBID	An arithmetic value which identifies the job
SLURM_NODELIST	The names of the nodes on which the job will run
SLURM_SUBMIT_HOST	The hostname of the node from which the job was submited
SLURM_NTASKS_PER_NODE	Number of tasks requested per node
SLURM_NTASKS_PER_SOCKET	Number of tasks requested per physical processor

All these variables may be used inside a batch script in order to control the processor keep track of the progress of a job. For example:

#!/bin/bash
#SBATCH -J my_hybrid_job
#SBATCH --cpus-per-task=2
#SBATCH --ntasks=16
#SBATCH --partition=parallel
srun my_parallel_program
echo "Job with ID: $SLURM_JOBID used $SLURM_NNODES number of nodes and $SLURM_NTASKS_PER_SOCKET task per socket"

Using Python as the batch script interpreter

It is possible to use Python as the batch cript interpreter, which means that you can create a batch script entirely with Python. For example:

#!/bin/env python

#SBATCH --time=00:30:00
#SBATCH --ntasks=100
#SBATCH --partition=parallel

from mpi4py import MPI
from pk import work

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

work(rank, size)

Disabling Hyper-Threading

Sometimes, the use of the Hyperthreading Technology (HT) can cause serious performance issues. Since HT is enabled by default for all nodes in the Metropolis cluster, in case you need to disable it you must instruct the scheduler to do so. This can be achieved with a the --extra-node-info switch:

#SBATCH --extra-node-info=2:10:1

In the above example, we instruct the job scheduler that ech node has 2 processors, 10 cores per processor, but only 1 thread per core. In this way HT is virtually disabled and our task will use just one thread per core.

Hybrid jobs: MPI and OpenMP

It is possible to have an MPI job that requires both MPI and OpenMP. To take advantage of OpenMP, we need to specify the number of OpenMP threads via the OMP_NUM_THREADS environment variable. Additionally, to avoid running all the threads in a single core, we should disable core affinity. A batch script that handles such jobs, could be like this:

#!/bin/bash
#SBATCH -J my_hybrid_job
export OMP_NUM_THREADS=16
export MV2_ENABLE_AFFINITY=0
module load mpi/ofed/mpich2
srun -n 64 my_hybrid_executable

In the above example, we define the number of OpenMP threads to 10 and we instruct MPIVCH to disable affinity so that the threads are not pinned on one core. Please note that, when using this approach, we may result into decreased performance due to thread replacement by the operating system.

Allocating tasks to nodes, cores and threads

You may use any combination of nodes, cores and threads in order to efficiently allocate the system's resources for your tasks. The way to do that, is to define the number of required nodes, cores and threads you need for each job with the --node, --cpus-per-task and --ntasks SLURM switches. These switches can be either used from the command line directly (with the srun) command, or inside a batch script. Please bare in mind that you should respect the actual H/W characteristics of the nodes, otherwise your jobs will not be executed.

Requeueing jobs

There is always the possibility that a node will fail while executing your jobs (e.g: hardware failure). In this case, you might want your job to be automatically requeued or not. You can control this behavior by using the --requeue or the --no-requeue switch (command line and batch script switches).

Job Arrays

SLURM offers a mechanism for submitting and managing collections of similar jobs, given that all jobs share the same initial options, which can be later altered with the scontrol update command. To explain the mechanism used for job arrays, we will use a simple example. Let's assume that we need to run a program named "vectormap" which maps 10 datasets to 10 individual vectors. We can create several batch files, each one for each run, or even one single batch file in which will call the vectormap program several times. With job arrays, we have the flexibility to create a single batch script in which we will call the vectormap program onle once and SLURM will split the process into 10 individual jobs.

In order to achieve that, first we need to create a batch script (let's call it vectormap.sh) like this:

#!/bin/bash
#SBATCH -J vectormap
#SBATCH -o vectormap%A%a.out # Standard output
#SBATCH -e vectormap%A%a.err # Standard error
vectormap dataset"${SLURM_ARRAY_TASK_ID}".inp

and then, we can submit the script using the sbatch command but with the --array switch:

sbatch --array=1-10 vectormap.sh

The %A and %a are variables, represent the job id and the job array index, respectively. The variable SLURM_ARRAY_TASK_ID represents the current array index and is set by SLURM. So, in this example, the vectormap program will launch 10 separate tasks which will run at the same time, each with a different dataset. Please note, that the names of the input files must match the dataset"${SLURM_ARRAY_TASK_ID}".inp pattern, that is, each file must follow the same naming convention e.g: dataset1.inp, dataset2.inp etc.

For a more detailed description of job arrays, please visit the SLURM Job Arrays page.