Table of Contents |
---|
...
Flag | Description | Example |
---|---|---|
<netID> | shows info needed to submit jobs to each partition | acctinfo jth10001 outputs info. for the user jth10001 acctinfo `whoami` outputs info about yourself, note these apostrophes are left quotes, a.k.a. backticks |
Example output
Code Block |
---|
acctinfo `whoami` User Account Partition QOS ---------- ------------ -------------------- -------------------- jth10001 ucb99411 hi-core general jth10001 ucb99411 general-gpu general jth10001 ucb99411 debug general jth10001 ucb99411 lo-core general jth10001 ucb99411 general general |
...
Flag | Description | Example |
---|---|---|
-J | job name you see in squeue | -J name_of_job |
-o | out file name, lowercase “o” | -o jobname.out prints command line output from submitted script to a file with the above name -o jobname_%j.out same as the above but it also includes the job number; useful if you use the same script more than once. This allows you to go back and investigate if anything goes wrong with a given job. |
-n | number of cores you want | -n 1 tells SLURM you want 1 core -n 20 -N 1 tells SLURM you want 20 cores |
-N | number of nodes your cores will come from | -n 10 -N 1 tells SLURM that you want 10 cores on the same node. Having all your cores on the same node increases performance b/c communication b/w nodes is slow |
--mem= | amount of memory (RAM) your job will need Please note: the default memory per core is 2 gigabytes, but users in need of more than 2 GB of memory can override the overall memory available with the --mem flag. | --mem=5G tells SLURM you want 5 gigabytes (GB) of memory total --mem=10G tells SLURM you want 10 GB of memory total |
--mem-per-cpu= | amount of memory (RAM) per core your job will need --mem-per-core=# X -n <# cores> = total RAM Please note: the default memory per core is 2 gigabytes, but users in need of more than 2 GB of memory can override it with the --mem-per-cpu flag. Also, when in conflict, the --mem flag overrides the --mem-per-cpu flag. | --mem-per-core=4G tells SLURM you want 4 gigabytes per core --mem-per-core=4G -n 2 -N 1 tells SLURM you want 2 cores on the same node, each with 4 gigabytes of memory giving you a total memory available of 8 gigabytes |
--gres=gpu: | number of GPUs you want to use | --gres=gpu:1 SLURM will give you a node that has one GPU available --gres=gpu:2 SLURM will give you a node that has two GPUs available |
-p | name of the partition you are targeting, lowercase “p” Please note that you will only be able to use priority partitions if your lab has priority access. To access priority partitions, one must also use the -A and -q flags. | -p general SLURM will look for available nodes on the general partition –p hi-core SLURM will look for available nodes on the hi-core partition -p priority-gpu SLURM will look for available nodes on the priority-gpu partition. |
-t | length of time you’d like the job to run, follows the below format -t HH:MM:SS, H=hour, M=minute, S=second Please note that most partitions have maximum time limits. Jobs cannot run longer than the time limits shown here. | -t 01:00:00 SLURM will allocate resources to you for one hour -t 12:00:00 SLURM will allocate resources to you for 12 hours |
-b | tells SLURM to hold off on submitting job until HH:MM or MM/DD/YY | -b 13:15 SLURM will not try starting the job until today at 1:15 pm -b 01/01/24 SLURM will not try starting the job until January 1st, 2024 |
-C | “C” stands for “constraints.” Use this flag to constrain SLURM to look only for nodes with specific features. uppercase “C” Please note that ALL nodes on HPC 2.0 have features (e.g., gpu, skylake). See full list here. | -C cpuonly SLURM will look for a node that only has CPUs. Helpful for jobs that do not use GPUs -C gpu SLURM will look for nodes that have GPUs -C cpuonly,skylake SLURM will look for skylake nodes without GPUs |
-x | “x” stands for exclude. This flag tells SLURM the nodes you do NOT want. lowercase “x” | -x cn451 Do not submit my job to cn451 -x cn[451-455] Do not submit job to any node between cn451-cn455 -x cn[451-455],gpu[14,15] Do not submit job to gpu14, gpu15, or any node between cn451-cn455 |
-w | tells SLURM to only submit job to a specific node, or a specific set of nodes, lowercase “w” Please note that this flag is rarely useful unless trying to backfill a node that you are already partially using. | -w cn459 tells SLURM to only submit job to cn459, even if other nodes are open. SLURM will wait until cn459 is open to run your job. |
-A | “A” stands for account. This is normally the netID of the head/PI of your lab. To check the account your username is associated with, see here. | -A jth10001 tells SLURM that the netID of my advisor is jth10001. |
-q | “q” stands for Quality of Service. In practice, we use this to restrict access to priority partitions. To check the QOS needed to access priority partitions for your account, see here. | -q huskylab tells SLURM the QOS I need to access a given partition is huskylab. Please replace huskylab with the QOS you need to access a given partition. |
--mail-type= | tells SLURM when to send email notifications related to a given job: BEGIN, END, FAIL, ALL | --mail-type=BEGIN SLURM will send you an email when your job begins --mail-type=FAIL SLURM will send you an email if your job fails --mail-type=ALL SLURM will send you an email when the job begins, ends, and if the job fails |
--mail-user= | tells SLURM what email to send email notifications to. This is only needed if you use the --mail-type= flag. | --mail-user=jon.husky@uconn.edu sends email notifications to jon.husky@uconn.edu. Please replace jon.husky@uconn.edu with your email address. |
--array= | enables job array submissions; useful when you need to run a large number of similar and/or parallel jobs. For further info, please see SLURM Job Arrays. | --array=1-6 submits a job array of six jobs. --array=1-6%2 also submits a job array of six jobs, but the % symbol tells SLURM to only execute two at a time. |
--x11 | enables X-forwarding in interactive jobs; useful when you want to use a GUI for a given software. Please note that X-forwarding will only work if you also log into the HPC w/ X-forwarding enabled. For more info, please see GUIs on the HPC. | srun -n 1 --x11 --pty bash starts an interactive session with one core and X-forwarding enabled |
--no-requeue | The scheduler is configured to automatically requeue batch jobs that fail or are preempted. But sometimes you might not want a job to be resubmitted. In that scenario, you can use the --no-requeue flag. | --no-requeue Overrides the default behavior and prevent jobs from being automatically requeued. |
Info |
---|
These flags can be used with |
...
Flag | Description | Example |
---|---|---|
<default> | output information on all jobs in the queue | squeue |
--me | filter output to only show your jobs | squeue --me |
-p | filter output by partition, lowercase “p” | squeue -p general-gpu shows jobs running in the general-gpu partition squeue -p general,priority shows jobs running in the general partition and then in priority |
-n | filter output by specific name of job | squeue -n job_name shows jobs with the job name “job_name” submitted by any user |
-u | filter output by specific user or list of users | squeue -u jth10001 shows jobs submitted by user jth10001 squeue -u jth10001,tpd23099 shows jobs submitted jth10001 and tpd23099 |
-A | filter output to show jobs submitted by all users associated with a given PI’s account | squeue -A erm12009 shows jobs submitted by all members of the erm12009’s lab |
scancel
cancel jobs or job arrays(docs)
Flag | Description | Example |
---|---|---|
<default> | cancel a job with a specific job id Please note: You only have access to cancel jobs that you have submitted. | scancel 1234567 cancels job with the job_id: 1234567 |
--me | cancel all of your pending or running jobs | scancel --me |
-n | cancel all jobs with a given jobname | scancel --me -n job_name cancels all of your jobs with the name “job_name” |
-t | cancel all jobs with a given state R=running, PD=pending, CG=completing | scancel --me -t PD cancels all of your pending jobs |
-p | cancel all jobs on a given partition, lowercase “p” | scancel --me -p priority cancels all of your jobs that in the priority partition’s job queue |
shist
view information about jobs that are pending, running, or have completed;
Flag | Description | Example |
---|---|---|
<default> | output information about your recent jobs | shist |
-r | filter output to only show jobs from a specific partition | shist -r debug outputs info. about jobs on the debug partition |
-s | filter output by job state (lowercase “s”), including: pending (pd), running (r), completed (cd), failed (f), timeout (to), and node_fail (nf) | shist -s cd outputs info. about jobs which have completed shist -s f,to outputs info. about jobs which have failed or timed out |
-u | filter output to a specific user or list of users | shist -u jth10001 outputs info. about jth10001’s jobs shist -u jth10001,tpd23099 outputs info. about jth10001’s jobs and then tpd23099’s jobs |
-S | filter output by the time it was submitted/run, uppercase “S” Please note: This is useful to show info. about jobs which ran more than a few days ago. Valid time formats include:
| shist -S 09:00:00AM outputs info. about jobs submitted after 9:00 AM today shist -S 09/01/23 outputs info. about jobs submitted on or after September 1, 2023 shist -S now-6days outputs info. about jobs submitted in the last 6 days shist -S now-12hours outputs info. about jobs submitted in the last 12 hours |
Expand | ||||
---|---|---|---|---|
| ||||
Note 1:
Note 2: If you notice a discrepancy between the output of
|
...
Flag | Description | Example |
---|---|---|
<default> | output information about the status of nodes on each partition | nodeinfo |
-p | filter output to only show jobs from a specific partition, lowercase “p” | nodeinfo -p general-gpu outputs info. about jobs on the general-gpu partition |
-t | filter output by node state, including: idle - available, no jobs running alloc - allocated, not available mixed - partially allocated, some cores available drain - not accepting jobs, queued for maintenance down - not available, needs maintenance | nodeinfo -t idle outputs info. about nodes which are idle nodeinfo -t idle,mixed outputs info. about nodes which are in partial use but have some cores available |
-S | sort output by a given column including, uppercase “S”: P - partition t- state l - maximum job run time allowed f- node features (e.g., gpu, skylake) | nodeinfo -S f sort output by node features nodeinfo -S P,t sort output first by partition and then by node state |
-i | update output after a specified number of seconds | nodeinfo -i 15 updates output every 15s |
Example output
Code Block |
---|
[jth10001@login6 ~]$ nodeinfo -p general-gpu PARTITION NODES STATE TIMELIMIT CPUS GRES MEMORY ACTIVE_FEATURES NODELIST general-gpu 1 maint 12:00:00 64 gpu:1 515404 epyc64,a100,gpu gpu30 general-gpu 1 drng 12:00:00 64 gpu:3 515404 epyc64,a100,gpu gpu22 general-gpu 1 drng 12:00:00 64 gpu:1 515404 epyc64,a100,gpu gpu23 general-gpu 1 mix 12:00:00 36 gpu:3 191954 gpu,v100,skylake gpu05 general-gpu 3 mix 12:00:00 64 gpu:3 515404 epyc64,a100,gpu gpu[20-21,29] general-gpu 4 mix 12:00:00 64 gpu:1 515404 epyc64,a100,gpu gpu[31-34] general-gpu 1 idle 12:00:00 36 gpu:1 191954 gpu,v100,skylake gpu06 general-gpu 8 idle 12:00:00 64 gpu:3 515404 epyc64,a100,gpu gpu[14-15,35-40] general-gpu 9 idle 12:00:00 64 gpu:1 515404 epyc64,a100,gpu gpu[16-19,24-28] |
Partition: lists the partition
Nodes: lists the number of nodes in a given partition with a specific state
State: describes the state of the node. Reminder, we can only submit to
idle
ormix
.TimeLimit: maximum amount of time a job can run on a given partition
CPUs: number of cores on a given node
GRes: lists the number of GPUs available on a given node (ranges from 0-8)
Memory: amount of memory available in megabytes (divide by 1024 for memory in gigabytes)
Active Features: lists the features of a given node; we can submit jobs to nodes with desirable features (e.g., a node that has GPUs) using the
-C
flag with srun or batch sbatchNodelist: lists the name of nodes that match the characteristics described in the previous columns
...