Slurm: A workload manager for Supercomputers and Compute Clusters

Table of Contents

Image by Taylor Vick (https://unsplash.com/photos/cable-network-M5tzZtFCOfs)

Introduction

This post provides a loose tutorial for beginners when it comes to interacting with supercomputers and compute clusters through Slurm. The material is heavily influenced by the official Slurm documentation. Further information was taken from the computing centers at the University of Kaiserslautern, University of Chicago, and Harvard University. The reader is expected to be capable of establishing an SSH session to his available infrastructure and to be familiar with a shell along with the most basic Linux commands.

Primarily, this post serves as my own mnemonic device and a reference to get colleagues started!

Data Transfer

The first step in submitting a job to any compute cluster running Slurm involves moving the compiled binaries or executable scripts, as well as any supplementary materials to the cluster. Without a doubt, relying on a a version control system and git clone or git pull is the most reliable way to move code from one place to another. Still, the cluster you are working with will most likely also support a variety of file transfer protocols (e.g., SCP or SFTP) to move related files to the destination system, such as large datasets which may not be tracked by your version control system.

Submitting Jobs to Slurm

For the remainder of this post, we consider the following structure of a file system at a compute cluster running Slurm, where a properly tested project has been cloned.

$HOME
├── project
│   ├── .gitignore
│   ├── instances
│   │   ├── **/*.csv
│   ├── submit
│   ├── work.cpp
├── ...

Overview of the `sbatch` command

Jobs are submitted to the Slurm scheduler through the sbatch command. A submit script may be stored in a file that has the following content. Using sbatch submit, the information about the job (in particular, the requested resources) can be submitted to the central Slurm scheduler

#!/bin/bash
#SBATCH --job-name=Foo
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-core = 1
#SBATCH --cpus-per-task=1 
#SBATCH --output=Job-%j.out
#SBATCH --error=Job-%j.err
#SBATCH --time=6-1:23:45  # 6d-1h:23m:45s
#SBATCH --mem=4G
#SBATCH --constraint="XEON_E5_2670|XEON_E5_2640v3|XEON_SP_6126"
#SBATCH --mail-type=END

module load a_required_module

echo "Executing job on $HOSTNAME"
./work.cpp

⚠️ Do not run bash submit, as your program will execute on the current host, instead of being scheduled according to your resource requirements!

Specifying resources and using the right nodes, tasks, and cores

Most of the resources such as the requested amount of memory, time, or further constraints such as the type of CPU / GPU should be self explanatory. However, it may take a while, to wrap your head around the differences between nodes, ntasks, ntasks-per-core, and cpus-per-task. Detailled descriptions can be found in the Slurm manual:

A node is a distinct physical server in the compute cluster – it generally owns more than one CPUs with non-uniform memory access.
A task can be considered as an independent process that you want to execute via srun.
A core is a logical CPU that can work on a task.

For example, depending on the number of tasks, the following program may take 20 seconds (--ntasks=2) or 30 seconds (--ntasks=1) to complete.

#!/bin/bash
#SBATCH --ntasks=2

srun sleep 10 & 
srun sleep 20 &
wait

The correct usage of course depends on your concrete application. This concerns whether or not you launch independent process and if you are able exploit parallelism through multithreading on a single node or across several nodes, e.g., by utilizing the Message Passing Interface (MPI).

#!/bin/bash
# Using mpi and not concerend how tasks are distribution
#SBATCH --ntasks=8

# Launch 8 independent processes
#SBATCH --ntasks=8

# Spread cores across distinct nodes
#SBATCH --ntasks=8 --ntasks-per-node=1 # or
#SBATCH --ntasks=16 --nodes=16

# No interference from other jobs
#SBATCH --exclusive

# 8 processes spread across 4 nodes (2 processes per node)
#SBATCH --ntasks=8 --ntasks-per-node=2

# 8 processes on the same node
#SBATCH --ntasks=8 --ntasks-per-node=8

# 1 process that utilizes 8 cores for multithreading
#SBATCH --ntasks=1 --cpus-per-task=16

# 2 processes that utilize 2 cores each for multithreading
#SBATCH --ntasks=2 --cpus-per-task=2

The Environment Modules system `module`

Your compute cluster is most likely utilizing environment modules. Environment modules simplify shell initialization during each session; generally, by updating PATH variables to reference software. The basic commands for interacting with modules are the following.

# Query modules installed on the compute cluster
module avail
# Load a given module (e.g, the GNU C Compiler gcc)
module load gcc
# equals
module load gcc/latest
# Alternative: use a specific version
module load python/3.10
# Query currently loaded module
module list
# Reload (unload then load) all loaded modules
module reload
# Remove / unload a specific module
module rm gcc/latest
# Remove / unload all loaded modules
module purge
# Get further information about a module
module whatis gcc
module show gcc

Why even use modules? In a nutshell, your compute cluster likely has several different versions of the same software installed. Rather than using the right path on a per-project basis, it is much easier to deal with module files. Consider the following example.

echo $PATH
# Prints "foo"
module load python/3.10
echo $PATH
# Prints "/software/python/3.10.6/bin:foo"
# The command 'python3' now points to Python 3.10.6
module load python/3.12
echo $PATH
# Prints "/software/python/3.12.0/bin:/software/python/3.10.6/:foo"
# The command 'python3' now points to Python 3.12.0
module purge
# Prints "foo"

Job Arrays

In most applications, it may be convenient or necessary to call the same program multiple times with different arguments (e.g., to load different instances). In the simplest case, the program may receive a single integer argument for further processing inside of the program. The most straightforward extension of a submission script involves the use of an array as follows.

# A job array with index values between 0 and 10 (inclusive)
#SBATCH --array=0-10
# A job array with index values of 1, 2, 5, 19, 27
#SBATCH --array=1,2,5,19,27
# A job array with index values between 1 and 7 with a step size of 2
#SBATCH --array=1-7:2
# A job array where at most 5 jobs are active at any time
#SBATCH 1-100%5

The array index can be accessed by %a and note that %Ais the job array’s master job allocation number (assigned by Slurm).

#!/bin/bash
#SBATCH --job-name=foo
#SBATCH --array=0-10
#SBATCH --nodes=1
#SBATCH --time=1:00
#SBATCH --mem=4G
#SBATCH --output %A_%a.out # Standard output
#SBATCH --error %A_%a.err # Standard error

./work.cpp instances/**/${SLURM_ARRAY_TASK_ID}.csv

If the names of the *.csv files are non-sequential (e.g., they represent named objects), the submission script may easily be adjusted. In such cases, we mighjt read the list of files into an array FILESand then use the shellvariable to work our way through this array as follows:

FILENAME=${FILES[$SLURM_ARRAY_TASK_ID]}
# or
FILENAME=$(ls **/*.csv | sed -n ${SLURM_ARRAY_TASK_ID}p)
./work.cpp $FILENAME.csv

⚠️ Array jobs make running many similar tasks easily possible, but if each task is short (in the order of seconds or a few minutes), array jobs quickly bog down the Slurm scheduler and more time may be spent managing jobs than actually doing work.

Dealing with running jobs: `squeue` and `scancel`

To query the current state of submitted jobs, we use the squeue command which returns the following head and the list of related jobs.

JOBID PARTITION  NAME  USER ST  TIME NODES NODELIST(REASON)
12345     batch    rf  anne  R 13:07     2 node[007-008]
12346     batch  opti  bart CG  0:00     1 node001
12347     batch   dnn  carl PD  0:00     6 (Resources)

The ID is assigned to each job by the Slurm scheduler. The assigned partition depends on the configuration of the compute cluster and may contain different queues (e.g., short- vs. long-running jobs). States include running, configuring, and pending along with the corresponding runtime. Nodes lists the number of nodes on which a particular job is being executed and the nodelist contains the name of these nodes or the reason why the job is still pending (no available resources, quality of service, …).

Cancelling any job can be achieved with the scancel command. Common scenarios include:

# Cancel all jobs of user mike
scancel -u mike
# Cancel a job with a specific JOBID
scancel -j JOBID
# Cancel a job with a specific NAME
scancel --name NAME
# Cancel all pending jobs
scancel -t PENDING -u mike
# Cancel array array ID 1 to 3 from an array with JOBID 20
scancel 20_[1-3]
# Cancel array array ID 4 to 5 from an array with JOBID 20
scancel 20_4 20_5
# Cancel job with JOBID 20 (all elements, if array)
scancel 20

Conclusion

Getting started with Slurm is quite simple for anyone that is already familiar with a shell. We did not cover further Slurm commands (e.g., sinfo, sacct, srun) or more exotic use cases, e.g., jobs that require GPUs or a RAM disk. Nevertheless, this should be sufficient to get you started and into not much trouble with the administrators of your compute cluster! Happy computing! 🚀