Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 34 Next »

Account Information

acctinfo

view relevant account information

Flag

Description

Example

<netID>

shows info needed to submit jobs to each partition

acctinfo jth10001

outputs info. for the user jth10001

acctinfo `whoami`

outputs info about yourself, note these apostrophes are left quotes, a.k.a. backticks

Example output

acctinfo `whoami`
      User      Account            Partition                  QOS
---------- ------------ -------------------- --------------------
jth10001       ucb99411              hi-core              general
jth10001       ucb99411          general-gpu              general
jth10001       ucb99411                debug              general
jth10001       ucb99411              lo-core              general
jth10001       ucb99411              general              general
 HPC Admin's Note

Note: acctinfois an alias that modifies the output of sacctmgr (docs). The full alias is:

acctinfo(){
sacctmgr list assoc user=$1 -o format=user,account%12,partition%20,qos
}

Job Submission

srun

initiate an interactive session on the HPC (docs)

sbatch

submit a script to be run in the background when resources become available (docs)

Flag

Description

Example

-J

job name you see in squeue

-J name_of_job

-o

out file name

-o jobname.out

prints command line output from submitted script to a file with the above name

-o jobname_%j.out

same as the above but it also includes the job number; useful if you use the same script more than once. This allows you to go back and investigate if anything goes wrong with a given job.

-n

number of cores you want

-n 1

tells SLURM you want 1 core

-n 20 -N 1

tells SLURM you want 20 cores

-N

number of nodes your cores will come from

-n 10 -N 1

tells SLURM that you want 10 cores on the same node. Having all your cores on the same node increases performance b/c communication b/w nodes is slow

--mem=

amount of memory (RAM) your job will need

Please note: the default memory per core is 2 gigabytes, but users in need of more than 2 GB of memory can override the overall memory available with the --mem flag.

--mem=5G

tells SLURM you want 5 gigabytes (GB) of memory total

--mem=10G

tells SLURM you want 10 GB of memory total

--mem-per-cpu=

amount of memory (RAM) per core your job will need

--mem-per-core=# X -n <# cores> = total RAM

Please note: the default memory per core is 2 gigabytes, but users in need of more than 2 GB of memory can override it with the --mem-per-cpu flag. Also, when in conflict, the --mem flag overrides the --mem-per-cpu flag.

--mem-per-core=4G

tells SLURM you want 4 gigabytes per core

--mem-per-core=4G -n 2 -N 1

tells SLURM you want 2 cores on the same node, each with 4 gigabytes of memory giving you a total memory available of 8 gigabytes

--gres=gpu:

number of GPUs you want to use

--gres=gpu:1

SLURM will give you a node that has one GPU available

--gres=gpu:2

SLURM will give you a node that has two GPUs available

-p

name of the partition you are targeting

Please note that you will only be able to use priority partitions if your lab has priority access. To access priority partitions, one must also use the -A and -q flags.

-p general

SLURM will look for available nodes on the general partition

–p hi-core

SLURM will look for available nodes on the hi-core partition

-p priority-gpu

SLURM will look for available nodes on the priority-gpu partition.

-t

length of time you’d like the job to run, follows the below format

-t HH:MM:SS, H=hour, M=minute, S=second

Please note that most partitions have maximum time limits. Jobs cannot run longer than the time limits shown here.

-t 01:00:00

SLURM will allocate resources to you for one hour

-t 12:00:00

SLURM will allocate resources to you for 12 hours

-b

tells SLURM to hold off on submitting job until HH:MM or MM/DD/YY

-b 13:15

SLURM will not try starting the job until today at 1:15 pm

-b 01/01/24

SLURM will not try starting the job until January 1st, 2024

-C

“C” stands for “constraints.” Use this flag to constrain SLURM to look only for nodes with specific features

Please note that ALL nodes on HPC 2.0 have features (e.g., gpu, skylake). See full list here.

-C cpuonly

SLURM will look for a node that only has CPUs. Helpful for jobs that do not use GPUs

-C gpu

SLURM will look for nodes that have GPUs

-C cpuonly,skylake

SLURM will look for skylake nodes without GPUs

-x

“x” stands for exclude. This flag tells SLURM the nodes you do NOT want.

-x cn451

Do not submit my job to cn451

-x cn[451-455]

Do not submit job to any node between cn451-cn455

-x cn[451-455],gpu[14,15]

Do not submit job to gpu14, gpu15, or any node between cn451-cn455

-w

tells SLURM to only submit job to a specific node, or a specific set of nodes

Please note that this flag is rarely useful unless trying to backfill a node that you are already partially using.

-w cn459

tells SLURM to only submit job to cn459, even if other nodes are open. SLURM will wait until cn459 is open to run your job.

-A

“A” stands for account. This is normally the netID of the head/PI of your lab.

To check the account your username is associated with, see here.

-A jth10001

tells SLURM that the netID of my advisor is jth10001.

-q

“q” stands for Quality of Service. In practice, we use this to restrict access to priority partitions.

To check the QOS needed to access priority partitions for your account, see here.

-q huskylab

tells SLURM the QOS I need to access a given partition is huskylab.

Please replace huskylab with the QOS you need to access a given partition.

--mail-type=

tells SLURM when to send email notifications related to a given job: BEGIN, END, FAIL, ALL

--mail-type=BEGIN

SLURM will send you an email when your job begins

--mail-type=FAIL

SLURM will send you an email if your job fails

--mail-type=ALL

SLURM will send you an email when the job begins, ends, and if the job fails

--mail-user=

tells SLURM what email to send email notifications to. This is only needed if you use the --mail-type= flag.

--mail-user=jon.husky@uconn.edu

sends email notifications to jon.husky@uconn.edu.

Please replace jon.husky@uconn.edu with your email address.

--array=

enables job array submissions; useful when you need to run a large number of similar and/or parallel jobs.

For further info, please see SLURM Job Arrays.

--array=1-6

submits a job array of six jobs.

--array=1-6%2

also submits a job array of six jobs, but the % symbol tells SLURM to only execute two at a time.

--x11

enables X-forwarding in interactive jobs; useful when you want to use a GUI for a given software.

Please note that X-forwarding will only work if you also log into the HPC w/ X-forwarding enabled. For more info, please see GUIs on the HPC.

srun -n 1 --x11 --pty bash

starts an interactive session with one core and X-forwarding enabled

--no-requeue

The scheduler is configured to automatically requeue batch jobs that fail or are preempted.

But sometimes you might not want a job to be resubmitted. In that scenario, you can use the --no-requeue flag.

--no-requeue

Overrides the default behavior and prevent jobs from being automatically requeued.

These flags can be used with srun or in the #SBATCH header of a batch script. Flags can also often be combined. For a more comprehensive list of flag options, use the man srun command.


Job Management

squeue

view information about jobs in the queue, a.k.a. jobs that are running or pending (docs)

Flag

Description

Example

<default>

output information on all jobs in the queue

squeue

--me

filter output to only show your jobs

squeue --me

-p

filter output by partition

squeue -p general-gpu

shows jobs running in the general-gpu partition

squeue -p general,priority

shows jobs running in the general partition and then in priority

-n

filter output by specific name of job

squeue -n job_name

shows jobs with the job name “job_name” submitted by any user

-u

filter output by specific user or list of users

squeue -u jth10001

shows jobs submitted by user jth10001

squeue -u jth10001,tpd23099

shows jobs submitted jth10001 and tpd23099

-A

filter output to show jobs submitted by all users associated with a given PI’s account

squeue -A erm12009

shows jobs submitted by all members of the erm12009’s lab

scancel

cancel jobs or job arrays (docs)

Flag

Description

Example

<default>

cancel a job with a specific job id

Please note: You only have access to cancel jobs that you have submitted.

scancel 1234567

cancels job with the job_id: 1234567

--me

cancel all of your pending or running jobs

scancel --me

-n

cancel all jobs with a given jobname

scancel --me -n job_name

cancels all of your jobs with the name “job_name”

-t

cancel all jobs with a given state

R=running, PD=pending, CG=completing

scancel --me -t PD

cancels all of your pending jobs

-p

cancel all jobs on a given partition

scancel --me -p priority

cancels all of your jobs that in the priority partition’s job queue

shist

view information about jobs that are pending, running, or have completed;

Flag

Description

Example

<default>

output information about your recent jobs

shist

-r

filter output to only show jobs from a specific partition

shist -r debug

outputs info. about jobs on the debug partition

-s

filter output by job state, including:

pending (pd), running (r), completed (cd), failed (f), timeout (to), and node_fail (nf)

shist -s cd

outputs info. about jobs which have completed

shist -s f,to

outputs info. about jobs which have failed or timed out

-u

filter output to a specific user or list of users

shist -u jth10001

outputs info. about jth10001’s jobs

shist -u jth10001,tpd23099

outputs info. about jth10001’s jobs and then tpd23099’s jobs

-S

filter output by the time it was submitted/run

Please note: This is useful to show info. about jobs which ran more than a few days ago.

Valid time formats include:

  • HH:MM[:SS] [AM|PM] 

  • MM/DD[/YY]  

  • now[{+|-}count[seconds|minutes|hours|days|weeks]]

shist -S 09:00:00AM

outputs info. about jobs submitted after 9:00 AM today

shist -S 09/01/23

outputs info. about jobs submitted on or after September 1, 2023

shist -S now-6days

outputs info. about jobs submitted in the last 6 days

shist -S now-12hours

outputs info. about jobs submitted in the last 12 hours

 HPC Admin's Note

Note 1: shist is an alias that modifies the output of sacct (docs). The full command is:

sacct -o "JobID,Partition,QOS,JobName,User,State,Elapsed,NNodes,NCPUs,NodeList,ExitCode,End" -X

Note 2: If you notice a discrepancy between the output of shist and your job’s out files, it may be worthwhile to us scontrol instead. (docs) An example command to check a job’s status is below:

scontrol show job <job_id> -dd

Partition and Node Information

nodeinfo

view information about nodes and partitions

Flag

Description

Example

<default>

output information about the status of nodes on each partition

nodeinfo

-p

filter output to only show jobs from a specific partition

nodeinfo -p general-gpu

outputs info. about jobs on the general-gpu partition

-t

filter output by node state, including:

idle - available, no jobs running

alloc - allocated, not available

mixed - partially allocated, some cores available

drain - not accepting jobs, queued for maintenance

down - not available, needs maintenance

nodeinfo -t idle

outputs info. about nodes which are idle

nodeinfo -t idle,mixed

outputs info. about nodes which are in partial use but have some cores available

-S

sort output by a given column including:

P - partition

t - state

l - maximum job run time allowed

f - node features (e.g., gpu, skylake)

nodeinfo -S f

sort output by node features

nodeinfo -S P,t

sort output first by partition and then by node state

-i

update output after a specified number of seconds

nodeinfo -i 15

updates output every 15s

Example output

[jth10001@login6 ~]$ nodeinfo -p general-gpu
PARTITION      NODES  STATE  TIMELIMIT   CPUS    GRES MEMORY   ACTIVE_FEATURES   NODELIST
general-gpu        1  maint   12:00:00     64   gpu:1 515404   epyc64,a100,gpu   gpu30
general-gpu        1   drng   12:00:00     64   gpu:3 515404   epyc64,a100,gpu   gpu22
general-gpu        1   drng   12:00:00     64   gpu:1 515404   epyc64,a100,gpu   gpu23
general-gpu        1    mix   12:00:00     36   gpu:3 191954   gpu,v100,skylake  gpu05
general-gpu        3    mix   12:00:00     64   gpu:3 515404   epyc64,a100,gpu   gpu[20-21,29]
general-gpu        4    mix   12:00:00     64   gpu:1 515404   epyc64,a100,gpu   gpu[31-34]
general-gpu        1   idle   12:00:00     36   gpu:1 191954   gpu,v100,skylake  gpu06
general-gpu        8   idle   12:00:00     64   gpu:3 515404   epyc64,a100,gpu   gpu[14-15,35-40]
general-gpu        9   idle   12:00:00     64   gpu:1 515404   epyc64,a100,gpu   gpu[16-19,24-28]
  • Partition: lists the partition

  • Nodes: lists the number of nodes in a given partition with a specific state

  • State: describes the state of the node. Reminder, we can only submit to idle or mix.

  • TimeLimit: maximum amount of time a job can run on a given partition

  • CPUs: number of cores on a given node

  • GRes: lists the number of GPUs available on a given node (ranges from 0-8)

  • Memory: amount of memory available in megabytes (divide by 1024 for memory in gigabytes)

  • Active Features: lists the features of a given node; we can submit jobs to nodes with desirable features (e.g., a node that has GPUs) using the -C flag with srun or batch

  • Nodelist: lists the name of nodes that match the characteristics described in the previous columns

Want more detail on nodes and partitions? Try the sinfo command (docs). Or do you prefer GUIs? Then you can always use sview & which gives you a graphical look at node availability and traits. (smile)

 HPC Admin's Note

Note that nodeinfo is an alias for the SLURM command sinfo (docs). The output format of nodeinfo has been optimized to show relevant info for our HPC. The nodeinfo alias’s actual command is below.

nodeinfo(){
sinfo -o '%14P %.5D %.6t %.10l %.6c %.7G %8m %32b %N' $1 $2 $3 $4| sed 's/location=local,//g' | sed 's/ACTIVE_FEATURES                  NODELIST/ACTIVE_FEATURES   NODELIST/g';
}

 References

This page was inspired by the Slurm Cheatsheet on the University of Southern California’s Center for Advanced Research Computing website.

  • No labels