Resource allocation and job scirpts
The easiest way to use the SLURM batch job system is to use a batch job file and submit it to the scheduler with the sbatch command. A batch script is a simple shell script which contains directives for the scheduler, the actual program to run and probably some shell commands which control the working environment or perform additional tasks. Any text editor can be used to create such scripts. For instance, here is a a basic sbatch submition script for a parallel job:
#!/bin/bash
#test.sbatch, a sample slurm job
#################
#SBATCH --job-name=TEST
#SBATCH --output=TEST.out
#SBATCH --error=TEST.err
#SBATCH --time=24:00
#SBATCH --nodes=2
#SBATCH --mem=1024
#SBATCH --ntasks-per-node=16
#SBATCH --partition=parallel
#################
module load mpi/openmpi-gcc
srun <path>/executable
Lines starting with #SBATCH contain information for the job scheduler. The syntax of these lines is:
#SBATCH -sbatch_option argument
In the example above, the sbatch options we used are explained as follows:
- --job-name: defines the name of the batch job. It does not have to be unique, but should be representative of the job for convenience
- --output, -error: define the file names of the STDOUT and STDERR of the batch job. If omitted, the job scheduler will create its own output files with a standard naming convention
- --time: the time limit for this job
- --nodes: number of nodes on which to run
- --mem: the required amount of memory (in MB)
- --ntasks-per-node: defines the number of tasks that will be spawned on each node (very useful for parallel tasks)
- --parttion: submits this job to the specified partition
- <path>/executable: is the executable to run along with its path (if needed)
Lines following the scheduler directives are simple shell commands (like the module load mpi/ofed/openmpi-1.8.4) and finally, the last line is actually loading the executable with the srun command. The srun command will submit the executable for processing and the scheduler will allocate the required resources (number of nodes, memory and tasks per node) based on the #SBATCH directives already mentioned. Upon job submission the scheduler will also assign a unique numeric identifier for this job which can be later used by the user in order to track the progress of his tasks.
Acutally, the #SBATCH lines can contain more detailed information for the scheduler, for instance, the number of tasks per core,the number of cores per CPU, the amount of memory per core and much more. Please consult the relevant manual page for a complete list of options (man sbatch).
Here are some examples of sbatch scripts for serial and parallel jobs:
Serial jobs |
Parallel jobs |
|
1 job per node: | #!/bin/bash #SBATCH --ntasks 1 #SBATCH --time=00:30:00 #SBATCH --partition=serial srun ./my_program |
#!/bin/bash #SBATCH --nodes 2 #SBATCH --time=01:00:00 #SBATCH --partition=parallel #SBATCH -c 2 module load mpi/openmpi-<compiler> srun -n 16 ./mpi_executable |
2 jobs per node: | #!/bin/bash #SBATCH --nodes 1 #SBATCH --ntasks 2 #SBATCH --time=00:30:00 #SBATCH --partition=serial srun -n 1 ./job1.sh & srun -n 1 ./job2.sh |
After creating the the batch script it can be submitted with the following command ( the "%" sign stands for the standard shell prompt and after each command follows a typical output):
% sbatch test.sbatch
Submitted batch job 98765
The number "98765" is the numeric identifier of the job. By using the number, a user can track the job progress. For example, to display information about job 98765, a user can use the scontrol command with the following arguments:
% scontrol show job 98765
UserId=user1(4353) GroupId=users(100)
Priority=4294883847 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=2-02:22:00 TimeLimit=30-00:00:00 TimeMin=N/A
SubmitTime=2015-05-05T15:44:13 EligibleTime=2015-05-05T15:44:13
StartTime=2015-05-05T15:44:13 EndTime=2015-06-04T15:44:13
........... more lines of output ..............
In order to cancel a job, a user must use the scancel or the qdel command:
% scancel 98765
- or -
% scancel -u user_id
(to cancel all jobs belonging to a user)
To display a full listing of the submitted jobs along with their status, use the squeue command:
% squeue
JOBID | PARTITION | NAME | USER | ST | TIME | NODES | NODELIST (REASON) |
14685 | master | xyz1 | user1 | R | 15-05:07:43 | 7 | node-ib[01-07] |
20127 | master | abc24 | user2 | R | 7-07:27:58 | 1 | node-ib01 |
23600 | master | run_all | user3 | R | 2-07:03:54 | 7 | node-ib[17,29,36,39,42,45,49] |
Among others, the squeue command will display the nodes allocated for each job (last column).
Note: While a job is running, you can login (with ssh) to the nodes it is running on for debugging purposes.
Finally, to get an overview of all available prtitions, you can use the sinfo command.
Other ways to allocate resources
You can use the salloc command, and then run commands against the nodes that you are allocated. For example:
% salloc -N 1 -n 4
This command will allocate a single node for 4 tasks. You can get a list of the allocated node(s) by examining the output ot the $SLURM_JOB_NODELIST environment variable (echo $SLURM_JOB_NODELIST) or by using the squeue command. Then, you can use the srun command to run the job(s) on the allocated node(s). A more compact form of the above procedure would be:
% salloc -n 1 -N 4 srun <path>/executable
Interactive batch jobs
Interactive batch jobs can be used for tasks that require user input from the user during the execution. The easiest way is allocate resources with the salloc command and then run your interactive programm with the srun command, more or less as already described in the previous paragraph. Please bare in mind that interactive execution is not the most efficient way to work in an HPC environment, except for special cases (eg. data visualization with interctive software and so on).
Job dependencies
There are cases where a user must wait for a job to finish in order to submit another job, for example when the input of job is the output of another. For these cases, the sbatch command has a special option, "--dependency". With this option a user can instruct the scheduler to execute a job after some other job has finished running. For example:
% sbatch job1.sbatch
Submitted batch job 98765
% sbatch --dependency=afterok:98765 job2.sbatch
In the above example the user submits two jobs, job1.sbatch and job.sbatch. The scheduler will allocate resources and execute job1, but instead of executing job2 at once, will wait for job1 to finish and will execute job2 only after job1 has successfully completed its operation (please notice the "afterok" condition). The available conditions are:
- after: The dependent job starts after the specified job begins its execution
- afterany: The dependent job starts after the specified job terminates regardless its exit status
- afternotok: The dependent job starts after the specified job terminates successfully
Job Status Codes
Each job has a status code when queued, indicating its status. Below, you can find the status codes and their meaning:
Status Code | Meaning |
---|---|
PD | Pending - usually waiting for resources |
R | Running - normal execution |
S | Suspended - job allocated, but not started |
CG | Completing - job ending but some processes are still active |
CD | Completed - job ended |
CF | Configuring - job has allocated resources |
CA | Canceled - job canceled by he user or an administrator |
F | Failed - job has failed |
TO | Timeout - job terminated due to time limit |
PR | Preempted - job terminated due to preemption |
NF | Node fail - job ended due to node failure |