...

acctinfo`whoami` User Account Partition QOS ---------- ---------- -------------------- -------------------- net10004 jth10001

You can also use the following to see available recources for a particular partition in a simpler format. For example:

Expand

Long wait times for your jobs? Errors about unavailable resources? We’ve been there and understand how frustrating it can be for jobs to take a long time to run. It’s an unfortunate consequence of having such a strong computational research community at UConn. LOTS of incredible research happens here, but it also means that there are LOTS of people competing for resources. There’s no getting around that problem, but there are a couple of steps we can take to increase the odds that our jobs get on ASAP.

Check what resources are available before we submit a job. (And then)
Target our submission to those available resources.

This FAQ will offer guidance on how to do both of those things.

Checking for Available Resources

The below nodeinfo command will give you a high-level view of what nodes are fully available (“idle”), in partial use (“mix”), not available (“alloc”. for allocated), or otherwise in need of maintenance (all other states besides idle, mix, or alloc). The nodes will be broken down in order by partition.

Code Block
nodeinfo

The output for that command will look like this, but it will be much longer and provide info on every partition (not just class and debug).

Code Block

PARTITION      NODES  STATE  TIMELIMIT   CPUS    GRES MEMORY   ACTIVE_FEATURES   NODELIST
class              1  inval    4:00:00     36   gpu:3 191954   gpu,v100,skylake  gpu05
class              1 maint*    4:00:00    128  (null) 515404   epyc128,cpuonly   cn563
class              3 drain$    4:00:00     36  (null) 191845+  skylake,cpuonly   cn[360-361,363]
class              8  maint    4:00:00     36  (null) 191845+  skylake,cpuonly   cn[362,365-366,368-372]
class              1  maint    4:00:00    128  (null) 515404   epyc128,cpuonly   cn561
class              1  comp*    4:00:00    128  (null) 515404   epyc128,cpuonly   cn564
class              3   drng    4:00:00    128  (null) 515404   epyc128,cpuonly   cn[560,565,569]
class              6  alloc    4:00:00    128  (null) 515404   epyc128,cpuonly   cn[559,562,566-568,570]
debug              1 maint*      30:00    128  (null) 515404   epyc128,cpuonly   cn563

... (etc.)

The above command gives us an overarching picture of usage on the cluster, and from there, we can use a more targeted command to get more information on individual nodes within a partition, like how many cores or GPUs are in use and how many are available. The base sinfo command is below and it targets the priority-gpu partition but we could amend it to target any other partition, like hi-core for example.

Code Block

sinfo -p priority-gpu -t idle,mix -o%10n%20C%10G%10t%20R%b
HOSTNAMES CPUS(A/I/O/T)       GRES      STATE     PARTITION           ACTIVE_FEATURES
gpu06     13/23/0/36          gpu:1     mix       priority-gpu        location=local,gpu,v100,skylake
gpu20     36/28/0/64          gpu:3     mix       priority-gpu        location=local,epyc64,a100,gpu
gpu21     25/39/0/64          gpu:3     mix       priority-gpu        location=local,epyc64,a100,gpu
gpu22     3/61/0/64           gpu:3     mix       priority-gpu        location=local,epyc64,a100,gpu
gpu23     33/31/0/64          gpu:1     mix       priority-gpu        location=local,epyc64,a100,gpu
gtx02     2/18/0/20           gpu:3     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx03     12/8/0/20           gpu:3     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx08     1/19/0/20           gpu:2     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx11     2/18/0/20           gpu:2     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx15     28/4/0/32           gpu:8     mix       priority-gpu        location=local,gpu,rtx,skylake
gtx09     0/20/0/20           gpu:2     idle      priority-gpu        location=local,gpu,gtx,skylake

The column titled “CPUS (A/I/O/T)” tells us how many cores are available. “A” stands for Allocated, “I” stands for Idle, and “T” stands for Total. (“O” stands for Other but you can ignore that column) Since there are 39 cores in the “Idle” column for GPU21, that means 39 cores are available to use. But all 3 of the GPUs on GPU21 are in use so we can’t use any GPUs on that node. So, that gives us an idea of the resources. If I only needed cores and no GPUs, I could target GPU21.

In summary, these two commands can give us a picture of what partitions have resources available, and then what resources are available on individual nodes within that partition.

Targeting a specific partition

The next step is submitting a job targeting a specific partition. If you’re not sure how to target a specific partition, please visit our SLURM Guide where you will see examples of submission scripts that target different partitions and architectures. Another key part of targeting a specific partition is knowing what partitions you are allowed to use and what “account” and “QOS” you must use to access them. To check what partitions you’re allowed to use and how to access them you can use this command.

Code Block

Code Block
sinfo -p priority -o%C \| sed 's/CPUS//; s/A/Alloc/; s/I/Idle/; s/O/Other/; s/T/Total/'\| tr "/" "\t" \| tr -d "()" Alloc Idle Other Total 4667 8664 705 14036

In summary, these two commands can give us a picture of what partitions have resources available, and then what resources are available on individual nodes within that partition.

Targeting a specific partition

The next step is submitting a job targeting a specific partition. If you’re not sure how to target a specific partition, please visit our SLURM Guide where you will see examples of submission scripts that target different partitions and architectures. Another key part of targeting a specific partition is knowing what partitions you are allowed to use and what “account” and “QOS” you must use to access them. To check what partitions you’re allowed to use and how to access them you can use this command.

Code Block

acctinfo`whoami`
      User    Account          hi-core    Partition          general  net10004    jth10001  QOS 
       general-gpu   ---------- ---------- -------------------- -------------------- 
net10004    jth10001           general  net10004  hi-core  jth10001            general 
net10004   debug jth10001           general-gpu  general  net10004    jth10001      general 
net10004    jth10001   lo-core              generaldebug  net10004    jth10001        general 
net10004    jth10001 general              general
net10004    jth10001          priority-gpu          jth10001gpu

This tells me that I have access to 6 partitions. To access the priority-gpu partition, I need to include the below three flags in the #SBATCH header of my submission script. This will be different for every individual so you will have to modify this with the partitions you have access to and the account and QOS that are associated with your account.

Code Block
#SBATCH -p priority-gpu # partition I want to access #SBATCH -A jth10001 # account I am associated with #SBATCH -q jth10001gpu # QOS needed to access partition

If you have further questions about how to check what resources are available and how to target them, please feel free to contact the Storrs HPC admins by sending an email to hpc@uconn.edu.

 lo-core              general 
net10004    jth10001               general              general
net10004    jth10001          priority-gpu          jth10001gpu

This tells me that I have access to 6 partitions. To access the priority-gpu partition, I need to include the below three flags in the #SBATCH header of my submission script. This will be different for every individual so you will have to modify this with the partitions you have access to and the account and QOS that are associated with your account.

Code Block
#SBATCH -p priority-gpu # partition I want to access #SBATCH -A jth10001 # account I am associated with #SBATCH -q jth10001gpu # QOS needed to access partition

If you have further questions about how to check what resources are available and how to target them, please feel free to contact the Storrs HPC admins by sending an email to hpc@uconn.edu.

How does the HPC decide which jobs run first?

Expand

HPCs have a lot of similarities with airports. Let’s talk about how:

Security and Access

Accessing an HPC is a bit like entering an airport. To get inside an airport, you have to show your ID and go through security. Similarly, you need to both have an account and use your password (or an ssh key) to access the HPC.

Image Added

Once you get inside, you hang around the terminal until your flight starts boarding. The HPC equivalent of the terminal is called the login node. There are some basic things you can do on login nodes---move files, edit documents, etc.---but you shouldn’t run any really computationally programs or analyses on the login nodes. It’d be like blatantly cutting in front of everyone in line at the airport Starbuck’s. Not only would it be disrespectful to everyone else in line (i.e., on the login node), when the Starbuck’s staff (i.e., Storrs HPC Admins) saw what happened, that person would be kicked out of the cafe.

Intense computational programs or analyses, a.k.a. jobs, should be run on compute nodes. In this analogy, jobs are groups of passengers (e.g., families) and compute nodes are planes. Just as planes have a limited number of seats, compute nodes have a limited number of cores (a.k.a. CPUs). Jobs can only get onto compute nodes that have enough cores (i.e., “seats”) available.

Image Added

People who buy tickets (submit jobs) ahead of time generally have their seats reserved ahead of people who just show up at the gate looking for a seat (requesting an interactive job). But there are exceptions. When a flight is overbooked (more cores requested than the HPC has available), people who are part of an airline’s frequent flyer program (have priority access on the HPC) get first dibs on (HPC resources).

Okay, this analogy works well for how individuals relate to the HPC, but we have to use a different analogy to understand the structure of HPCs and how they operate as a whole.

Structure and Organization

To understand the broader structure of HPCs and how they work, we can look at how airports are organized. All major airports have an air traffic control tower and an air traffic controller (ATC) working inside it. The HPC’s equivalent of the ATC tower is called the head node and the role of the ATC is played by a program called SLURM. The ATC’s (SLURM’s) main job is directing which planes (jobs) get to use the runways (compute nodes) because there are usually more planes flying in the air (running jobs) and waiting on the ground (pending jobs) than there are runways (nodes) available.

Image Added

The ATC (SLURM) takes many things into account when deciding which planes (jobs) get to use a given runway (node) next. Here a few that are similar to HPCs.

how long a plane (job) has been waiting on the ground (pending in the job queue)
the size of the plane (how many cores the job is requesting)

In general, SLURM will let a job that has been waiting for 6 hours will get on the next open node before a job that has been waiting for 1 hour. But if a job is massive, smaller jobs may get on before it.

Here’s where SLURM differs from an airport’s ATC. Access to airport runways generally operate on a first come-first served basis, but SLURM adds another consideration to prevent a single HPC user from monopolizing all of the HPC’s resources. It takes into account the number (and size) of jobs a given user has submitted recently. A user who has submitted thousands of jobs in the last week will be pushed down the list to give all users fair, equitable access to HPC resources.

The last important consideration for an HPC which doesn’t generally apply to airports is that users can buy priority access to HPC resources. Priority access is like TSA PreCheck. Priority users still wait in line, but it’s a much shorter line and they tend to get through faster.

When will my pending job start running?

Expand

You can use the following command to see SLURM’s estimate of the start time:

Code Block
squeue -j <jobid> --start

This is only an estimate, and in most cases will not be the actual start times of the jobs.

...

Troubleshooting problems

Why did my job fail?

...

Expand

Short answer: If you received this error, your job most likely failed because the amount of memory (RAM) it needed was larger than the default. We can request more memory for your job using the --mem or --mem-per-cpu flag.

Long answer: There are several reasons a job may fail from insufficient memory. As of January 2023, the default amount of memory available per CPU is 2 gigabytes. The default mem-per-cpu of 2 GB is okay for many users, but not for all. Users who receive OOM errors need to request more memory when submitting or initiating jobs. Those users can easily override the default using the -mem-per-cpu flag in your submission script. The new line would look something like this:

Code Block
#SBATCH -n 2 # asks for 2 cores #SBATCH --mem-per-cpu=3G # asks for 3 GB or RAM per core, or 6 GB total

Adding this line will tell SLURM to use 3 gigabytes of RAM per CPU you request. That means if we ask for 2 cores (-n 2), then we’ll be allocated 6 gigabytes of RAM total. Please note that the --mem-per-cpu flag must be used with the -n flag specifying the number of cores you want. Alternatively, users can use the --mem flag to specify the total amount of RAM they need regardless of how many CPUs are requested.

Code Block
#SBATCH --mem=5G # asks for 105 gigabytes total

We encourage users to please adjust the --mem-per-cpu or --mem flags in a step-wise fashion. First, we try 3 gigabytes, then 4, then 5, etc. until our jobs start running without failing from memory errors. That strategy helps ensure that every HPC user's jobs get on quickly and run efficiently. For more info on fisbatch, srun, and #SBATCH flags, see this link.

...

Expand

Short answer: Your job is being “held.” To release the job and re-submit it to the job queue you can use the scontrol release {JOBID} command.

Long Answer: Your job failed. We have a separate FAQ on figuring out why a job failed here, but here we will focus on why your job is being held. When jobs fail, they used to be automatically re-queued. This was a problem for a number of users because re-running the job would overwrite their previous data. In January 2024, we re-configured SLURM to prevent this problem. Now, when jobs fail, they are not immediately re-queued. Instead, the jobs will be “held” from the queue until the submitting user “releases” those jobs back into the queue. This change prevents jobs from requeueing automatically and allows users to make a conscious choice to re-queue their jobs. You can re-queue jobs using the below commands:

To release a single job
Code Block
scontrol release {JOBID}

To release multiple jobs

Code Block
scontrol release {JOBID_1},{JOBID_2},{JOBID_3}

To release all jobs with a given job name
Code Block
scontrol release jobname={JOBNAME}

To release all of your held jobs

Code Block
squeue --me \| grep ' SE ' \| awk '{print $1}' \| xargs -n1 scontrol release

If you release your jobs into the queue and they keep ending up back in the “held” state, that is an indication that there may be something failing within your submission script in which case you should cancel your jobs and start troubleshooting. Please note that jobs which that are left in the queue with the “SE” state will be cancelled canceled after seven days.

Please feel free to contact us at hpc@uconn.edu with any questions or concerns.

...

Expand

Short answer: One of the login nodes is most likely not working properly. Try to ssh into any of the three login nodes directly.

Long Answer: When you ssh into hpc2 login.storrs.hpc.uconn.edu, you are directed to one of our three login nodes (login4, login5, or login6). Occasionally, one of these three login nodes will become faulty. If you ssh into the cluster and your account is directed to the faulty node, then you may be given a “Permission denied” error message. If you experience this problem, we recommend you try to ssh directly into one of the login nodes. Here are the three commands one can use to login into our three login nodes. Please replace netID with your own netID.

login4

Code Block
ssh -Y netID@login4.storrs.hpc.uconn.edu

login5

Code Block
ssh -Y netID@login5.storrs.hpc.uconn.edu

login6

Code Block
ssh -Y netID@login6.storrs.hpc.uconn.edu

If one of these allows you to log in but another gives you a permission denied error, then we can be sure that there is something wrong with one of the login nodes. If you have the time, we’ll ask that you please send us a screenshot of the login node which that is giving you a problem so that we can tend to any problems on that faulty node. This will help us ensure that this problem is fixed as soon as possible.

If you receive this “Permission denied” error when ssh-ing directly into all three login nodes, then the problem may be with your netID. You may have to reset your netID password which can be done at this link.

...

Expand

Short answer: Some modules can only be run on certain architectures. Try loading that module on a node with the Epyc or Skylake architectures.

Long Answer: These GLIBC errors often happen because the software you’re trying to load expects to find a newer GLIBC library than what’s available in your current node’s Red Hat Enterprise Linux (RHEL) version. Older architectures sometimes have older RHEL versions and therefore older GLIBC libraries. Newer architectures typically have newer RHEL versions and GLIBC libraries.

Switching to a newer or different architecture may resolve this GLIBC error. See the following guide for instructions on how to target specific architectures.

How do I fix an error

...

that says, “Can't open display, Failed initializing GUI, exiting?”

Expand

Short answer: This is an X-forwarding error. The most common fix is to enable X-forwarding when you ssh into the HPC using the -X or -Y flag. But there are other problems which can cause this too.

Code Block
ssh -Y netID@hpc2.storrs.hpc.uconn.edu

Long Answer: X-forwarding allows programs being run on the HPC to be opened up in a GUI on our local machines. It is convenient and nice to work with but it can take a bit of effort to set up if you are working on a Mac or Windows device. Linux users have it easy because all they normally have to do is use the -X or -Y flag we referred to above. For users, of Windows or Mac devices, we have written up a more in-depth guide which you can find here.

Modules that I installed used to work properly on the original HPC but they are not loading properly on HPC 2.0. How do I resolve this problem?

Expand

There are many reasons this might be occurring, but a common problem with user-installed programs is that the module names and versions have changed slightly between the old HPC and HPC 2.0. It may be that the dependencies your program used to rely on are no longer available on HPC 2.0. For instance, the GCC compilers have been updated and some of the old ones are no longer available. In this case, the ideal situation would be to install your program again using the newer compilers---this is often a good idea anyway because newer compilers sometimes increase the performance and reduce the chance of bugs.

If there are extenuating circumstances that prevent you from using a program with new compilers or are experiencing other module-related problems, we invite you to submit a request for assistance by emailing hpc@uconn.edu. Then we can discuss options of how to set up a module that meets your needs.

How do I fix an error

...

that says one module conflicts with other modules(s)?

Expand

If the 'module load' command returns the following errors:

Code Block

$ module load <Module1>
<Module1>(4):ERROR:150: Module '<Module1>' conflicts with the currently loaded module(s) '<Module2>'
<Module1>(4):ERROR:102: Tcl command execution failed: conflict <Module_Group>

This means that the module you want to load conflicts with the currently loaded module, <Module2>. To fix it, please unload <Module2> and then load <Module1> again:

Code Block
$ module unload <Module2> $ module load <Module1>

Or

Code Block
$ module switch <Module2> <Module1>

Or, if neither of these works, you can purge all the modules with

Code Block
$ module purge

and start fresh.

How do I fix an error

...

that says the module I’m trying to load depends on other modules?

Expand

If the 'module load' command returns the following errors:

Code Block
$ module load <Module1> <Module1>(9):ERROR:151: Module '<Module1>' depends on one of the module(s) '<Module2>' <Module1>(9):ERROR:102: Tcl command execution failed: prereq <Modle2>

This means that the module you want to load depends on the module <Module2>. To fix it, please load <Module2> prior to <Module1>:

Code Block
$ module load <Module2> <Module1>

You may encounter the above errors many times. Please load/unload the requested/conflicted modules and try again.

...

How can I acknowledge the Storrs HPC in our publications?

Expand

If you would like to acknowledge/reference When acknowledging the Storrs HPC cluster in your publications, you can acknowledge Storrs HPC to something along the lines of use the following suggested text:

“The "The computational work performed on for this project was done with help from conducted using resources provided by the Storrs High-Performance Computing (HPC) cluster. We would like extend our gratitude to thank the UConn Storrs HPC and HPC its team for providing the their resources and support, which aided in achieving these results."

For a more detailed acknowledgment, consider including specific information about the resources used, such as:

The types of compute nodes utilized
Storage resources accessed
Software applications or libraries supported by the HPC

Including these details can provide a clearer picture of the HPC resources that contributed to these results.” your research.

Versions Compared

Old Version 52

New Version Current

Key

How does the HPC decide which jobs run first?

Security and Access

Structure and Organization

When will my pending job start running?

Troubleshooting problems

Why did my job fail?

How do I fix an error

that says, “Can't open display, Failed initializing GUI, exiting?”

Modules that I installed used to work properly on the original HPC but they are not loading properly on HPC 2.0. How do I resolve this problem?

How do I fix an error

that says one module conflicts with other modules(s)?

How do I fix an error

that says the module I’m trying to load depends on other modules?

How can I acknowledge the Storrs HPC in our publications?

Page Comparison

Versions Compared

Old Version 52

New Version Current

Key

How does the HPC decide which jobs run first?

Security and Access

Structure and Organization

When will my pending job start running?

Troubleshooting problems

Why did my job fail?

How do I fix an error

that says, “Can't open display, Failed initializing GUI, exiting?”

Modules that I installed used to work properly on the original HPC but they are not loading properly on HPC 2.0. How do I resolve this problem?

How do I fix an error

that says one module conflicts with other modules(s)?

How do I fix an error

that says the module I’m trying to load depends on other modules?

How can I acknowledge the Storrs HPC in our publications?