SLURM Job Arrays

Job arrays make it easy to submit and manage large numbers of similar jobs quickly. They can handle millions of tasks in milliseconds (as long as they don't exceed size limits). The job arrays can be very useful when you need to run a large number of similar jobs. For example, you may need to run the same analysis with different inputs or parameters.

This can save time and reduce the amount of manual work required, as you don't need to submit each job individually. Instead, you can submit a single job array and let SLURM manage the scheduling of the individual jobs. This can be particularly useful if you have limited time or resources available, as you can utilize the computing power of the cluster more efficiently by running multiple jobs in parallel.

Max allowed SLURM job size and SLURM Job Array size

There is also a Max SLURM job limit on HPC, and when the value is reached, no further jobs can be submitted to the cluster.

The max value of 30,000 jobs can be handled by the SLURM job scheduler before jobs are unable to submit to HPC.

To avoid users from flooding the job queue with large job array sizes, there is a limitation on the total size for a job array to be submitted.

The max Job Array limit is set to 10,000, but it is recommended to submit smaller jobs and lower the job array values to avoid heavy resource use on HPC that can affect other users from running on HPC.

SLURM header for job arrays

The most basic configuration for a job array is as follows:

#!/bin/bash #SBATCH --partition=general #SBATCH --job-name=jarray-example #SBATCH --output=out/array_%A.out #SBATCH --error=err/array_%A.err #SBATCH --array=1-6

This will ask Slurm to run the same script 6 times. Job arrays will have two additional environment variable set. SLURM_ARRAY_JOB_ID (%A) will be set to the first job ID of the array, and SLURM_ARRAY_TASK_ID (%a) will be set to the job array index value. In general, we want to pass the former as an argument for our script. If you are using R, the former can be retrieved using ask_id <- Sys.getenv("SLURM_ARRAY_TASK_ID"). In case you are using job arrays with python, the task id can be obtained using:

import sys taskId = sys.getenv('SLURM_ARRAY_TASK_ID')

In the previously defined SLURM header, the error and output file will be overwritten whenever a “task” (one of the executions of the script through the job array) is finished. The modification below assures that each “task id” will have its own output and error files.

#!/bin/bash #SBATCH --partition=general #SBATCH --job-name=jarray-example #SBATCH --output=out/array_%A_%a.out #SBATCH --error=err/array_%A_%a.err #SBATCH --array=1-6

When submitting an array of high dimensionality, we kindly ask you to use the % to specify how many tasks will be run simultaneously. For instance, below, we specify an array of 600 jobs running 20 at a time.

Whenever you specify the memory, number of nodes, number of CPUs, etc., such a specification will be applied to each task. Therefore, if we set the header of our submission file as follows

SLURM will submit 20 jobs simultaneously, where each job (represented by a task id) will use two nodes with 128G of RAM each. Most of the time, setting a single task suffices.

Lastly, we usually use job arrays for embarrassingly parallel jobs. If that’s your case and the job executed at each job ID does not use any multithreaded libraries, then you can use the following header to avoid wasting resources:

Be considerate when using job arrays. If you take too many resources, other users will be severely impacted. If we judge that that’s the case, we will take cancel your jobs.

Scancel Command Use

If the job ID of a job array is specified as input to the scancel command then all elements of that job array will be canceled. Alternately an array ID, optionally using regular expressions, may be specified for job cancellation.

Example: Multiple Input Array

If you have 10 input files named test_1, test_2, test_3 … etc that you want to pass through a software. the following would be a good way to do so.

 

For more info see https://slurm.schedmd.com/job_array.html.