Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand

Long wait times for your jobs? Errors about unavailable resources? We’ve been there and understanding understand how frustrating it can be for jobs to take a long time to run. It’s an unfortunate consequence of having such strong computational research community at UConn. Some LOTS of incredible research happens here, but it also means that we’re all there are LOTS of people competing for resources. There’s no getting around that problem, but there are a couple steps we can take to increase the odds that our jobs get on ASAP.

  1. Check what resources are available before we submit a job. (And then)

  2. Target our submission to those available resources.

This FAQ will give advice offer guidance on how to do both of those things.


Checking for Available Resources

The below sinfo command will give you a high-level view of what nodes are fully available (“idle”), in partial use (“mix”), not available (“alloc”. for allocated), or otherwise in need of maintenance (all other states besides idle, mix, or alloc). The nodes will be broken down in order by partition.

Code Block
sinfo -o '%14P %.5a %.10l %.6D %.6t %30N %b

The output for that command will look like this, but it will be much longer and provide info on every partition.

Code Block
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST                       ACTIVE_FEATURES
GeoSciMP          up 1-00:00:00     38    mix cn[410-447]                    location=local,epyc64,cpuonly
class             up    4:00:00      3  inval cn[407-409]                    location=local,skylake,cpuonly
class             up    4:00:00      1  maint cn348                          location=local,skylake,cpuonly
class             up    4:00:00     19  down* cn[244-246,252,254-255,268-271 location=local,haswell,cpuonly
class             up    4:00:00      7  down* cn[335-337,339-340,373,406]    location=local,skylake,cpuonly
class             up    4:00:00      1   comp cn376                          location=local,skylake,cpuonly
class             up    4:00:00     10    mix cn[333,352,362,374-375,391,402 location=local,skylake,cpuonly
class             up    4:00:00     25  alloc cn[329-332,334,338,341-342,345 location=local,skylake,cpuonly
... (etc.)

The above command gives us an overarching picture of usage on the cluster, and from there, we can use a more targeted command to get more information on individual nodes within a partition, like how many cores or GPUs are in use and how many are available. The base sinfo command is below and it targets the priority-gpu partition but we could amend to target any other partition, like hi-core for example.

Code Block
sinfo -p priority-gpu -t idle,mix -o%10n%20C%10G%10t%20R%b
HOSTNAMES CPUS(A/I/O/T)       GRES      STATE     PARTITION           ACTIVE_FEATURES
gpu06     13/23/0/36          gpu:1     mix       priority-gpu        location=local,gpu,v100,skylake
gpu20     36/28/0/64          gpu:3     mix       priority-gpu        location=local,epyc64,a100,gpu
gpu21     25/39/0/64          gpu:3     mix       priority-gpu        location=local,epyc64,a100,gpu
gpu22     3/61/0/64           gpu:3     mix       priority-gpu        location=local,epyc64,a100,gpu
gpu23     33/31/0/64          gpu:1     mix       priority-gpu        location=local,epyc64,a100,gpu
gtx02     2/18/0/20           gpu:3     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx03     12/8/0/20           gpu:3     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx08     1/19/0/20           gpu:2     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx11     2/18/0/20           gpu:2     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx15     28/4/0/32           gpu:8     mix       priority-gpu        location=local,gpu,rtx,skylake
gtx09     0/20/0/20           gpu:2     idle      priority-gpu        location=local,gpu,gtx,skylake

The column titled “CPUS (A/I/O/T)” tells us how many cores are available. A stands for Allocated, I stands for Idle, and T stands for Total. (O stands for Other but you can ignore that column) Since there are 39 cores in the “Idle” column for GPU21, that means 39 cores are available to use. But all 3 of the GPUs on GPU21 are in use so we can’t use any GPUs on that node. So, that gives us an idea of the resources. If I only needed cores and no GPUs, I could target GPU21.

In summary, these two commands can give us a picture of what partitions have resources available, and then what resources are available on individual nodes within that partition.


Targeting a specific partition

The next step is submitting a job targeting a specific partition. If you’re not sure how to target a specific partition, please visit our SLURM Guide where you will see examples of submission scripts that target different partitions and architectures. Another key part of targeting a specific partition is knowing what partitions you are allowed to use and what “account” and “QOS” you must use to access them. To check what partitions you’re allowed to use and how to access them you can use this command.

Code Block
sacctmgr list assoc user=`whoami` -o format=user,account,partition%20,qos%20
      User    Account            Partition                  QOS 
---------- ---------- -------------------- -------------------- 
net10004    jth10001               hi-core              general 
net10004    jth10001           general-gpu              general 
net10004    jth10001                 debug              general 
net10004    jth10001               lo-core              general 
net10004    jth10001               general              general
net10004    jth10001          priority-gpu          jth10001gpu

This tells me that I have access to 6 partitions. To access the priority-gpu partition, I need to include the below three flags in my #SBATCH header of my submission script. This will be different for every individual so you will have to modify this with the partitions you have access to and the account and QOS that are associated with your account.

Code Block
#SBATCH -p priority-gpu     # partition I want to access
#SBATCH -A jth10001         # account I am associated with
#SBATCH -q jth10001gpu      # QOS needed to access partition

If you have further questions about how to check what resources are available and how to target them, please feel free to contact the Storrs HPC admins by sending an email to hpc@uconn.edu.

...