Page Comparison

Table of Contents

...

Flag

Description

Example

<netID>

shows info needed to submit jobs to each partition

acctinfo jth10001

outputs info. for the user jth10001

acctinfo `whoami`

outputs info about yourself, note these apostrophes are left quotes, a.k.a. backticks

Example output

Code Block

acctinfo `whoami`
      User      Account            Partition                  QOS
---------- ------------ -------------------- --------------------
jth10001       ucb99411              hi-core              general
jth10001       ucb99411          general-gpu              general
jth10001       ucb99411                debug              general
jth10001       ucb99411              lo-core              general
jth10001       ucb99411              general              general

...

Flag	Description	Example
-J	job name you see in squeue	-J name_of_job
-o	out file name	-o jobname.out prints command line output from submitted script to a file with the above name -o jobname_%j.out same as the above but it also includes the job number; useful if you use the same script more than once. This allows you to go back and investigate if anything goes wrong with a given job.
-n	number of cores you want	-n 1 tells SLURM you want 1 core -n 20 -N 1 tells SLURM you want 20 cores
-N	number of nodes your cores will come from	-n 10 -N 1 tells SLURM that you want 10 cores on the same node. Having all your cores on the same node increases performance b/c communication b/w nodes is slow
--mem=	amount of memory (RAM) your job will need Please note: the default memory per core is 2 gigabytes, but users in need of more than 2 GB of memory can override the overall memory available with the --mem flag.	--mem=5G tells SLURM you want 5 gigabytes (GB) of memory total --mem=10G tells SLURM you want 10 GB of memory total
--mem-per-cpu=	amount of memory (RAM) per core your job will need --mem-per-core=# X -n <# cores> = total RAM Please note: the default memory per core is 2 gigabytes, but users in need of more than 2 GB of memory can override it with the --mem-per-cpu flag. Also, when in conflict, the --mem flag overrides the --mem-per-cpu flag.	--mem-per-core=4G tells SLURM you want 4 gigabytes per core --mem-per-core=4G -n 2 -N 1 tells SLURM you want 2 cores on the same node, each with 4 gigabytes of memory giving you a total memory available of 8 gigabytes
--gres=gpu:	number of GPUs you want to use	--gres=gpu:1 SLURM will give you a node that has one GPU available --gres=gpu:2 SLURM will give you a node that has two GPUs available
-p	name of the partition you are targeting Please note that you will only be able to use priority partitions if your lab has priority access. To access priority partitions, one must also use the -A and -q flags.	-p general SLURM will look for available nodes on the general partition –p hi-core SLURM will look for available nodes on the hi-core partition -p priority-gpu SLURM will look for available nodes on the priority-gpu partition.
-t	length of time you’d like the job to run, follows the below format -t HH:MM:SS, H=hour, M=minute, S=second Please note that most partitions have maximum time limits. Jobs cannot run longer than the time limits shown here.	-t 01:00:00 SLURM will allocate resources to you for one hour -t 12:00:00 SLURM will allocate resources to you for 12 hours
-b	tells SLURM to hold off on submitting job until HH:MM or MM/DD/YY	-b 13:15 SLURM will not try starting the job until today at 1:15 pm -b 01/01/24 SLURM will not try starting the job until January 1st, 2024
-C	“C” stands for “constraints.” Use this flag to constrain SLURM to look only for nodes with specific features Please note that ALL nodes on HPC 2.0 have features (e.g., gpu, skylake). See full list here.	-C cpuonly SLURM will look for a node that only has CPUs. Helpful for jobs that do not use GPUs -C gpu SLURM will look for nodes that have GPUs -C cpuonly,skylake SLURM will look for skylake nodes without GPUs
-x	“x” stands for exclude. This flag tells SLURM the nodes you do NOT want.	-x cn451 Do not submit my job to cn451 -x cn[451-455] Do not submit job to any node between cn451-cn455 -x cn[451-455],gpu[14,15] Do not submit job to gpu14, gpu15, or any node between cn451-cn455
-w	tells SLURM to only submit job to a specific node, or a specific set of nodes Please note that this flag is rarely useful unless trying to backfill a node that you are already partially using.	-w cn459 tells SLURM to only submit job to cn459, even if other nodes are open. SLURM will wait until cn459 is open to run your job.
-A	“A” stands for account. This is normally the netID of the head/PI of your lab. To check the account your username is associated with, see here.	-A jth10001 tells SLURM that the netID of my advisor is jth10001.
-q	“q” stands for Quality of Service. In practice, we use this to restrict access to priority partitions. To check the QOS needed to access priority partitions for your account, see here.	-q huskylab tells SLURM the QOS I need to access a given partition is huskylab. Please replace huskylab with the QOS you need to access a given partition.
--mail-type=	tells SLURM when to send email notifications related to a given job: BEGIN, END, FAIL, ALL	--mail-type=BEGIN SLURM will send you an email when your job begins --mail-type=FAIL SLURM will send you an email if your job fails --mail-type=ALL SLURM will send you an email when the job begins, ends, and if the job fails
--mail-user=	tells SLURM what email to send email notifications to. This is only needed if you use the --mail-type= flag.	--mail-user=jon.husky@uconn.edu sends email notifications to jon.husky@uconn.edu. Please replace jon.husky@uconn.edu with your email address.
--array=	enables job array submissions; useful when you need to run a large number of similar and/or parallel jobs. For further info, please see SLURM Job Arrays.	--array=1-6 submits a job array of six jobs. --array=1-6%2 also submits a job array of six jobs, but the % symbol tells SLURM to only execute two at a time.
--x11	enables X-forwarding in interactive jobs; useful when you want to use a GUI for a given software. Please note that X-forwarding will only work if you also log into the HPC w/ X-forwarding enabled. For more info, please see GUIs on the HPC.	srun -n 1 --x11 --pty bash starts an interactive session with one core and X-forwarding enabled
--no-requeue	The scheduler is configured to automatically requeue batch jobs that fail or are preempted. But sometimes you might not want a job to be resubmitted. In that scenario, you can use the --no-requeue flag.	--no-requeue Overrides the default behavior and prevent jobs from being automatically requeued.

Info
These flags can be used with `srun` or in the `#SBATCH` header of a batch script. Flags can also often be combined. For a more comprehensive list of flag options, use the `man srun` command.

...

Flag	Description	Example
<default>	output information on all jobs in the queue	squeue
--me	filter output to only show your jobs	squeue --me
-p	filter output by partition	squeue -p general-gpu shows jobs running in the general-gpu partition squeue -p general,priority shows jobs running in the general partition and then in priority
-n	filter output by specific name of job	squeue -n job_name shows jobs with the job name “job_name” submitted by any user
-u	filter output by specific user or list of users	squeue -u jth10001 shows jobs submitted by user jth10001 squeue -u jth10001,tpd23099 shows jobs submitted jth10001 and tpd23099
-A	filter output to show jobs submitted by all users associated with a given PI’s account	squeue -A erm12009 shows jobs submitted by all members of the erm12009’s lab

scancel

cancel jobs or job arrays(docs)

Flag	Description	Example
<default>	cancel a job with a specific job id Please note: You only have access to cancel jobs that you have submitted.	scancel 1234567 cancels job with the job_id: 1234567
--me	cancel all of your pending or running jobs	scancel --me
-n	cancel all jobs with a given jobname	scancel --me -n job_name cancels all of your jobs with the name “job_name”
-t	cancel all jobs with a given state R=running, PD=pending, CG=completing	scancel --me -t PD cancels all of your pending jobs
-p	cancel all jobs on a given partition	scancel --me -p priority cancels all of your jobs that in the priority partition’s job queue

shist

view information about jobs that are pending, running, or have completed;

Flag	Description	Example
<default>	output information about your recent jobs	shist
-r	filter output to only show jobs from a specific partition	shist -r debug outputs info. about jobs on the debug partition
-s	filter output by job state, including: pending (pd), running (r), completed (cd), failed (f), timeout (to), and node_fail (nf)	shist -s cd outputs info. about jobs which have completed shist -s f,to outputs info. about jobs which have failed or timed out
-u	filter output to a specific user or list of users	shist -u jth10001 outputs info. about jth10001’s jobs shist -u jth10001,tpd23099 outputs info. about jth10001’s jobs and then tpd23099’s jobs
-S	filter output by the time it was submitted/run Please note: This is useful to show info. about jobs which ran more than a few days ago. Valid time formats include: HH:MM[:SS] [AM\|PM] MM/DD[/YY] now[{+\|-}count[seconds\|minutes\|hours\|days\|weeks]]	shist -S 09:00:00AM outputs info. about jobs submitted after 9:00 AM today shist -S 09/01/23 outputs info. about jobs submitted on or after September 1, 2023 shist -S now-6days outputs info. about jobs submitted in the last 6 days shist -S now-12hours outputs info. about jobs submitted in the last 12 hours

Expand

title	HPC Admin's Note

Note 1: shist is an alias that modifies the output of sacct (docs). The full command is:

Code Block
sacct -o "JobID,Partition,QOS,JobName,User,State,Elapsed,NNodes,NCPUs,NodeList,ExitCode,End" -X

Note 2: If you notice a discrepancy between the output of shist and your job’s out files, it may be worthwhile to us scontrol instead. (docs) An example command to check a job’s status is below:

Code Block
scontrol show job <job_id> -dd

...

Flag	Description	Example
<default>	output information about the status of nodes on each partition	nodeinfo
-p	filter output to only show jobs from a specific partition	nodeinfo -p general-gpu outputs info. about jobs on the general-gpu partition
-t	filter output by node state, including: idle - available, no jobs running alloc - allocated, not available mixed - partially allocated, some cores available drain - not accepting jobs, queued for maintenance down - not available, needs maintenance	nodeinfo -t idle outputs info. about nodes which are idle nodeinfo -t idle,mixed outputs info. about nodes which are in partial use but have some cores available
-S	sort output by a given column including: P - partition t- state l - maximum job run time allowed f- node features (e.g., gpu, skylake)	nodeinfo -S f sort output by node features nodeinfo -S P,t sort output first by partition and then by node state
-i	update output after a specified number of seconds	nodeinfo -i 15 updates output every 15s

Example output

Code Block

[jth10001@login6 ~]$ nodeinfo -p general-gpu
PARTITION      NODES  STATE  TIMELIMIT   CPUS    GRES MEMORY   ACTIVE_FEATURES   NODELIST
general-gpu        1  maint   12:00:00     64   gpu:1 515404   epyc64,a100,gpu   gpu30
general-gpu        1   drng   12:00:00     64   gpu:3 515404   epyc64,a100,gpu   gpu22
general-gpu        1   drng   12:00:00     64   gpu:1 515404   epyc64,a100,gpu   gpu23
general-gpu        1    mix   12:00:00     36   gpu:3 191954   gpu,v100,skylake  gpu05
general-gpu       11 3    mix   12:00:00     64   gpu:3 515404   epyc64,a100,gpu   gpu[14-15,20-21,29,35-40]
general-gpu        14    mix   12:00:00     64   gpu:1 515404   epyc64,a100,gpu   gpu24gpu[31-34]
general-gpu        1   idle   12:00:00     36   gpu:1 191954   gpu,v100,skylake  gpu06
general-gpu       12 8   idle   12:00:00     64   gpu:3 515404   epyc64,a100,gpu   gpu[14-15,35-40]
general-gpu        9   idle   12:00:00     64   gpu:1 515404   epyc64,a100,gpu   gpu[16-19,2524-28,31-34]

Partition: lists the partition
Nodes: lists the number of nodes in a given partition with a specific state
State: describes the state of the node. Reminder, we can only submit to idle or mix.
TimelimitTimeLimit: maximum amount of time a job can run on a given partition
CPUs: number of cores on a given node
GRes: lists the number of GPUs available on a given node (ranges from 0-8)
Memory: amount of memory available in megabytes (divide by 1024 for memory in gigabytes)
Active Features: lists the features of a given node; we can submit jobs to nodes with desirable features (e.g., a node that has GPUs) using the -C flag with srun or batch
Nodelist: lists the name of nodes that match the characteristics described in the previous columns

...

Versions Compared

Old Version 32

New Version 33

Key

scancel

shist