Using the Batch System

The Slurm (Simple Linux Utility for Resource Management) workload manager is a software package for submitting, scheduling, and monitoring jobs on large compute clusters. Slurm is available on Chinook for submitting and monitoring user jobs.

Similar to PBS/TORQUE, Slurm accepts user jobs specified in batch scripts. More information on Slurm batch scripts may be found below.

Common Slurm commands, Slurm batch scripts, translating from PBS/TORQUE to Slurm, and running interactive jobs are discussed below. SchedMD, the company behind Slurm, has also put together a quick reference for Slurm commands.

Batch overview

The general principle behind batch processing is automating repetitive tasks. Single tasks are known as jobs, while a set of jobs is known as a batch. This distinction is mostly academic, since the terms job and batch job are now mostly synonymous, but here we'll use the terms separately.

There are three basic steps in a batch or job-oriented workflow:

  1. Copy input data from archival storage to scratch space
  2. Run computational tasks over the input data
  3. Copy output to archival storage

On Chinook the first and last steps must occur on login nodes, and the computation step on compute nodes. This is enforced by the login nodes having finite CPU ulimits set and $ARCHIVE not being present on the compute nodes.

Depending on the scale and characteristics of a particular job, different jobs may require different combinations of computational resources. Garnering these resources is a combination of:

  • Choosing which partition to submit the job to
  • Choosing what resources to request from the partition

This is done by writing batch scripts whose directives specify these resources.

Available partitions

Name Node count Max walltime Nodes per job (min-max) Other rules Purpose
debug 2 1 hour 1-2 For debugging job scripts
t1small 71 1 day 1-2 For short, small jobs with quick turnover
t1standard 71 4 days 3-71 Default General-purpose partition
t2small 71 2 days 1-2 Tier 2 users only. Increased priority and walltime. Tier 2 version of t1small
t2standard 71 7 days 3-71 Tier 2 users only. Increased priority and walltime. Tier 2 general-purpose partition
transfer 1 1 day 1 Shared use Copy files between archival storage and scratch space

Selecting a partition is done by adding a directive to the job submission script such as #SBATCH --partition=t1standard, or on the command line: $ sbatch -p t1standard

Anyone interested in gaining access to the higher-priority Tier 2 partitions (t2small, t2standard) by subscribing to support the cluster or procuring additional compute capacity should contact uaf-rcs@alaska.edu.

Common Slurm Commands

sacct

The sacct command is used for viewing information about submitted jobs. This can be useful for monitoring job progress or diagnosing problems that occurred during job execution. By default, sacct will report the job ID, job name, partition, account, allocated CPU cores, job state, and the exit code for all of the current user's jobs that have been submitted since midnight of the current day.

sacct's output, as with most Slurm informational commands, can be customized in a large number of ways. Here are a few of the more useful options:

Command Result
sacct --starttime 2016-03-01 select jobs since midnight of March 1, 2016
sacct --allusers select jobs from all users (default is only the current user)
sacct --accounts=account_list select jobs whose account appears in a comma-separated list of accounts
sacct --format=field_names print fields specified by a comma-separated list of field names
sacct --helpformat print list of fields that can be specified with --format

For more information on sacct, please visit https://slurm.schedmd.com/sacct.html.

sbatch

The sbatch command is used for submitting jobs to the cluster. Although it is possible to supply command-line arguments to sbatch, it is generally a good idea to put all or most resource requests in the batch script for reproducibility.

Sample usage:

sbatch mybatch.sh

On successful batch submission, sbatch will print out the new job's ID. sbatch may fail if the resources requested cannot be satisfied by the indicated partition.

For more information on sbatch, please visit https://slurm.schedmd.com/sbatch.html.

scontrol

The scontrol command is used for monitoring and modifying queued or running jobs. Although many scontrol subcommands apply only to cluster administration, there are some that may be useful for users:

Command Result
scontrol hold job_id place hold on job specified by job_id
scontrol release job_id release hold on job specified by job_id
scontrol show reservation show details on active or pending reservations
scontrol show nodes show hardware details for compute nodes

For more information on scontrol, please visit https://slurm.schedmd.com/scontrol.html.

sinfo

The sinfo command is used for viewing compute node and partition status. By default, sinfo will report the ID, partition, job name, user, state, time elapsed, nodes requested, nodes held by running jobs, and reason for being in the queue for queued jobs.

sinfo's output, as with most Slurm informational commands, can be customized in a large number of ways. Here are a few of the more useful options:

Command Result
sinfo --partition=t1standard show node info for the partition named 't1standard'
sinfo --summarize group by partition, aggregate state by A/I/O/T (Available/Idle/Other/Total)
sinfo --reservation show Slurm reservation information
sinfo --format=format_tokens print fields specified by format_tokens
sinfo --Format=field_names print fields specified by comma-separated field_names

There are a large number of fields hidden by default that can be displayed using --format and --Format. Refer to the sinfo's manual page for the complete list of fields.

For more information on sinfo, please visit https://slurm.schedmd.com/sinfo.html.

smap

The smap command is an ncurses-based tool useful for viewing the status of jobs, nodes, and node reservations. It aggregates data exposed by other Slurm commands, such as sinfo and squeue.

Command Result
sinfo -i 15 Run sinfo, refreshing every 15 seconds

For more information on smap, please visit https://slurm.schedmd.com/smap.html.

squeue

The squeue command is used for viewing job status. By default, squeue will report the ID, partition, job name, user, state, time elapsed, nodes requested, nodes held by running jobs, and reason for being in the queue for queued jobs.

squeue's output, as with most Slurm informational commands, can be customized in a large number of ways. Here are a few of the more useful options:

Command Result
squeue --user=user_list filter by a comma-separated list of usernames
squeue --start print expected start times of pending jobs
squeue --format=format_tokens print fields specified by format_tokens
squeue --Format=field_names print fields specified by comma-separated field_names

The majority of squeue's customization is done using --format or --Format. The lowercase --format allows for controlling which fields are present, their alignments, and other contextual details such as whitespace, but comes at the cost of readability and completeness (not all fields can be specified using the provided tokens). In contrast, the capitalized --Format accepts a complete set of verbose field names, but offers less flexibility with contextual details.

As an example, the following command produces output identical to squeue --start:

squeue --format="%.18i %.9P %.8j %.8u %.2t %.19S %.6D %20Y %R" --sort=S --states=PENDING

--Format can produce equivalent (but not identical) output:

squeue --Format=jobid,partition,name,username,state,starttime,numnodes,schednodes,reasonlist --sort=S --states=PENDING

For more information on squeue, please visit https://slurm.schedmd.com/squeue.html.

sreport

The sreport command is used for generating job and cluster usage reports. Statistics will be shown for jobs run since midnight of the current day by default. Although many of sreport's reports are more useful for cluster administrators, there are some commands that may be useful to users:

Command Result
sreport cluster AccountUtilizationByUser -t Hours start=2016-03-01 report hours used since Mar 1, 2016, grouped by account
sreport cluster UserUtilizationByAccount -t Hours start=2016-03-01 Users=$USER report hours used by the current user since Mar 1, 2016

For more information on sreport, please visit https://slurm.schedmd.com/sreport.html.

srun

The srun command is used to launch a parallel job step. Typically, srun is invoked from a Slurm batch script to perform part (or all) of the job's work. srun may be used multiple times in a batch script, allowing for multiple program runs to occur in one job.

Alternatively, srun can be run directly from the command line on a login node, in which case srun will first create a resource allocation for running the job. Use command-line keyword arguments to specify the parameters normally used in batch scripts, such as --partition, --nodes, --ntasks, and others. For example, srun --partition=debug --nodes=1 --ntasks=8 whoami will obtain an allocation consisting of 8 cores on 1 node and then run the command whoami on all of them.

Please note that srun does not inherently parallelize programs - it simply runs many independent instances of the specified program in parallel across the nodes assigned to the job. Put another way, srun will launch a program in parallel, but makes no guarantee that the program is designed to be run in parallel at any degree.

See Interactive Jobs for an example of how to use srunto allocate and run an interactive job (i.e. a job whose input and output are attached to your terminal).

A note about MPI: srun is designed to run MPI applications without the need for using mpirun or mpiexec, but this ability is currently not available on Chinook. It may be made available in the future. Until then, please refer to the directions on how to run MPI applications on Chinook below.

For more information on srun, please visit https://slurm.schedmd.com/srun.html.

sview

The sview command is a graphical interface useful for viewing the status of jobs, nodes, partitions, and node reservations. It aggregates data exposed by other Slurm commands, such as sinfo, squeue, and smap, and refreshes every few seconds.

For more information on sview, please visit https://slurm.schedmd.com/sview.html.

Batch Scripts

Batch scripts are plain-text files that specify a job to be run. They consist of batch scheduler (Slurm) directives which specify the resources requested for the job, followed by a script used to successfully run a program.

Here is a simple example of a batch script that will be accepted by Slurm on Chinook:

#!/bin/bash
#SBATCH --partition=debug
#SBATCH --ntasks=24
#SBATCH --tasks-per-node=24

echo "Hello world"

On submitting the batch script to Slurm using sbatch, the job's ID is printed:

$ ls
hello.slurm
$ sbatch hello.slurm
Submitted batch job 8137

Among other things, Slurm stores what the current working directory was when sbatch was run. Upon job completion (nearly immediate for a trivial job like the one specified by hello.slurm), output is written to a file in that directory.

$ ls
hello.slurm  slurm-8137.out
$ cat slurm-8137.out
Hello world

Running an MPI Application

Here is what a batch script for an MPI application might look like:

#!/bin/sh

#SBATCH --partition=t1standard
#SBATCH --ntasks=<NUMTASKS>
#SBATCH --tasks-per-node=24
#SBATCH --mail-user=<USERNAME>@alaska.edu
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --output=<APPLICATION>.%j

ulimit -s unlimited
ulimit -l unlimited

# Load any desired modules, usually the same as loaded to compile
. /etc/profile.d/modules.sh
module purge
module load toolchain/pic-intel/2016b
module load slurm

cd $SLURM_SUBMIT_DIR
# Generate a list of allocated nodes; will serve as a machinefile for mpirun
srun -l /bin/hostname | sort -n | awk '{print $2}' > ./nodes.$SLURM_JOB_ID
# Launch the MPI application
mpirun -np $SLURM_NTASKS -machinefile ./nodes.$SLURM_JOB_ID ./<APPLICATION>
# Clean up the machinefile
rm ./nodes.$SLURM_JOB_ID
  • <APPLICATION>: The executable to run in parallel
  • <NUMTASKS>: The number of parallel tasks requested from Slurm
  • <USERNAME>: Your Chinook username (same as your UA username)

There are many environment variables that Slurm defines at runtime for jobs. Here are the ones used in the above script:

  • $SLURM_JOB_ID: The job's numeric id
  • $SLURM_NTASKS: The value supplied as <NUMTASKS>
  • $SLURM_SUBMIT_DIR: The current working directory when "sbatch" was invoked

Interactive Jobs

Command Line Interactive Jobs

Interactive jobs are possible on Chinook using srun:


chinook:~$ srun -p debug --ntasks=24 --exclusive --pty /bin/bash

The above command will reserve one node in the debug partition and launch an interactive shell job. The --pty option executes task zero in pseudo terminal mode and implicitly sets --unbuffered and --error and --output to /dev/null for all tasks except task zero, which may cause those tasks to exit immediately.

Displaying X Windows from Interactive Jobs

A new module named "sintr" is available to create an interactive job that forwards application windows from the first compute node back the local display. This relies on using X11 forwarding over SSH, so make sure to enable graphics when connecting to a chinook login node. An SSH key pair on a Chinook login node will have to be generated which can be done by running the ssh-keygen -t rsa command:


chinook00 % ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/u1/uaf/<USERNAME>/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
...

The command will prompt you for the location to save the file, using /u1/uaf/<USERNAME>/.ssh/id_rsa as the default. The rsa key pair must be saved in that file. You will also be prompted for a passphrase for the key pair which will be used when connecting to a compute node with sintr. The contents of $HOME/.ssh/id_rsa.pub must then be added to $HOME/.ssh/authorized_keys. This can be done with the following command:

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The sintr command accepts the same command line arguments as sbatch. To launch a single node interactive job in the debug partition, for example, follow these steps:


chinook:~$ module load sintr
chinook:~$ sintr -p debug -N 1
Waiting for JOBID #### to start.
...

The command will wait for a node to be assigned and the job to launch. As soon as that happens, the next prompt should be on the first allocated compute node, and the DISPLAY environment variable will be set to send X windows back across the SSH connection. It is now possible to load and execute a desired windowed application. Here's an example with Totalview.


bash-4.1$ module load totalview
bash-4.1$ totalview

After exiting an application, exit the session too. This will release the allocated node(s) and end the interactive job.


bash-4.1$ exit
exit
[screen is terminating]
chinook:~$