...
Expand | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Long wait times for your jobs? Errors about unavailable resources? We’ve been there and understand how frustrating it can be for jobs to take a long time to run. It’s an unfortunate consequence of having such a strong computational research community at UConn. LOTS of incredible research happens here, but it also means that there are LOTS of people competing for resources. There’s no getting around that problem, but there are a couple of steps we can take to increase the odds that our jobs get on ASAP.
This FAQ will offer guidance on how to do both of those things. Checking for Available Resources The below
The output for that command will look like this, but it will be much longer and provide info on every partition (not just class and debug).
The above command gives us an overarching picture of usage on the cluster, and from there, we can use a more targeted command to get more information on individual nodes within a partition, like how many cores or GPUs are in use and how many are available. The base
The column titled “CPUS (A/I/O/T)” tells us how many cores are available. “A” stands for Allocated, “I” stands for Idle, and “T” stands for Total. (“O” stands for Other but you can ignore that column) Since there are 39 cores in the “Idle” column for GPU21, that means 39 cores are available to use. But all 3 of the GPUs on GPU21 are in use so we can’t use any GPUs on that node. So, that gives us an idea of the resources. If I only needed cores and no GPUs, I could target GPU21. In summary, these two commands can give us a picture of what partitions have resources available, and then what resources are available on individual nodes within that partition. Targeting a specific partition The next step is submitting a job targeting a specific partition. If you’re not sure how to target a specific partition, please visit our SLURM Guide where you will see examples of submission scripts that target different partitions and architectures. Another key part of targeting a specific partition is knowing what partitions you are allowed to use and what “account” and “QOS” you must use to access them. To check what partitions you’re allowed to use and how to access them you can use this command. Code Block | You can also use the following to see available recources for a particular partition in a simpler format. For example:
In summary, these two commands can give us a picture of what partitions have resources available, and then what resources are available on individual nodes within that partition. Targeting a specific partition The next step is submitting a job targeting a specific partition. If you’re not sure how to target a specific partition, please visit our SLURM Guide where you will see examples of submission scripts that target different partitions and architectures. Another key part of targeting a specific partition is knowing what partitions you are allowed to use and what “account” and “QOS” you must use to access them. To check what partitions you’re allowed to use and how to access them you can use this command.
This tells me that I have access to 6 partitions. To access the priority-gpu partition, I need to include the below three flags in the #SBATCH header of my submission script. This will be different for every individual so you will have to modify this with the partitions you have access to and the account and QOS that are associated with your account.
This tells me that I have access to 6 partitions. To access the priority-gpu partition, I need to include the below three flags in the #SBATCH header of my submission script. This will be different for every individual so you will have to modify this with the partitions you have access to and the account and QOS that are associated with your account.
If you have further questions about how to check what resources are available and how to target them, please feel free to contact the Storrs HPC admins by sending an email to hpc@uconn.edu. |
How does the HPC decide which jobs run first?
Expand |
---|
HPCs have a lot of similarities with airports. Let’s talk about how: Security and AccessAccessing an HPC is a bit like entering an airport. To get inside an airport, you have to show your ID and go through security. Similarly, you need to both have an account and use your password (or an ssh key) to access the HPC. Once you get inside, you hang around the terminal until your flight starts boarding. The HPC equivalent of the terminal is called the login node. There are some basic things you can do on login nodes---move files, edit documents, etc.---but you shouldn’t run any really computationally programs or analyses on the login nodes. It’d be like blatantly cutting in front of everyone in line at the airport Starbuck’s. Not only would it be disrespectful to everyone else in line (i.e., on the login node), when the Starbuck’s staff (i.e., Storrs HPC Admins) saw what happened, that person would be kicked out of the cafe. Intense computational programs or analyses, a.k.a. jobs, should be run on compute nodes. In this analogy, jobs are groups of passengers (e.g., families) and compute nodes are planes. Just as planes have a limited number of seats, compute nodes have a limited number of cores (a.k.a. CPUs). Jobs can only get onto compute nodes that have enough cores (i.e., “seats”) available. People who buy tickets (submit jobs) ahead of time generally have their seats reserved ahead of people who just show up at the gate looking for a seat (requesting an interactive job). But there are exceptions. When a flight is overbooked (more cores requested than the HPC has available), people who are part of an airline’s frequent flyer program (have priority access on the HPC) get first dibs on (HPC resources). Okay, this analogy works well for how individuals relate to the HPC, but we have to use a different analogy to understand the structure of HPCs and how they operate as a whole. Structure and OrganizationTo understand the broader structure of HPCs and how they work, we can look at how airports are organized. All major airports have an air traffic control tower and an air traffic controller (ATC) working inside it. The HPC’s equivalent of the ATC tower is called the head node and the role of the ATC is played by a program called SLURM. The ATC’s (SLURM’s) main job is directing which planes (jobs) get to use the runways (compute nodes) because there are usually more planes flying in the air (running jobs) and waiting on the ground (pending jobs) than there are runways (nodes) available. The ATC (SLURM) takes many things into account when deciding which planes (jobs) get to use a given runway (node) next. Here a few that are similar to HPCs.
In general, SLURM will let a job that has been waiting for 6 hours will get on the next open node before a job that has been waiting for 1 hour. But if a job is massive, smaller jobs may get on before it. Here’s where SLURM differs from an airport’s ATC. Access to airport runways generally operate on a first come-first served basis, but SLURM adds another consideration to prevent a single HPC user from monopolizing all of the HPC’s resources. It takes into account the number (and size) of jobs a given user has submitted recently. A user who has submitted thousands of jobs in the last week will be pushed down the list to give all users fair, equitable access to HPC resources. The last important consideration for an HPC which doesn’t generally apply to airports is that users can buy priority access to HPC resources. Priority access is like TSA PreCheck. Priority users still wait in line, but it’s a much shorter line and they tend to get through faster. |
When will my pending job start running?
...
Expand | ||||
---|---|---|---|---|
Short answer: If you received this error, your job most likely failed because the amount of memory (RAM) it needed was larger than the default. We can request more memory for your job using the Long answer: There are several reasons a job may fail from insufficient memory. As of January 2023, the default amount of memory available per CPU is 2 gigabytes. The default
Adding this line will tell SLURM to use 3 gigabytes of RAM per CPU you request. That means if we ask for 2 cores (-n 2), then we’ll be allocated 6 gigabytes of RAM total. Please note that the
We encourage users to please adjust the |
...
How can I acknowledge the Storrs HPC in our publications?
Expand |
---|
If you would like to acknowledge/reference When acknowledging the Storrs HPC cluster in your publications, you can acknowledge Storrs HPC to something along the lines of use the following suggested text: “The "The computational work performed on for this project was done with help from conducted using resources provided by the Storrs High-Performance Computing (HPC) cluster. We would like extend our gratitude to thank the UConn Storrs HPC and HPC its team for providing the their resources and support, which aided in achieving these results." For a more detailed acknowledgment, consider including specific information about the resources used, such as:
Including these details can provide a clearer picture of the HPC resources that contributed to these results.” your research. |