...
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Long wait times for your jobs? Errors about unavailable resources? We’ve been there and understand how frustrating it can be for jobs to take a long time to run. It’s an unfortunate consequence of having such a strong computational research community at UConn. LOTS of incredible research happens here, but it also means that there are LOTS of people competing for resources. There’s no getting around that problem, but there are a couple of steps we can take to increase the odds that our jobs get on ASAP.
This FAQ will offer guidance on how to do both of those things. Checking for Available Resources The below
The output for that command will look like this, but it will be much longer and provide info on every partition (not just class and debug).
The above command gives us an overarching picture of usage on the cluster, and from there, we can use a more targeted command to get more information on individual nodes within a partition, like how many cores or GPUs are in use and how many are available. The base
The column titled “CPUS (A/I/O/T)” tells us how many cores are available. “A” stands for Allocated, “I” stands for Idle, and “T” stands for Total. (“O” stands for Other but you can ignore that column) Since there are 39 cores in the “Idle” column for GPU21, that means 39 cores are available to use. But all 3 of the GPUs on GPU21 are in use so we can’t use any GPUs on that node. So, that gives us an idea of the resources. If I only needed cores and no GPUs, I could target GPU21. In summary, these two commands can give us a picture of what partitions have resources available, and then what resources are available on individual nodes within that partition. Targeting a specific partition The next step is submitting a job targeting a specific partition. If you’re not sure how to target a specific partition, please visit our SLURM Guide where you will see examples of submission scripts that target different partitions and architectures. Another key part of targeting a specific partition is knowing what partitions you are allowed to use and what “account” and “QOS” you must use to access them. To check what partitions you’re allowed to use and how to access them you can use this command.
This tells me that I have access to 6 partitions. To access the priority-gpu partition, I need to include the below three flags in the #SBATCH header of my submission script. This will be different for every individual so you will have to modify this with the partitions you have access to and the account and QOS that are associated with your account.
If you have further questions about how to check what resources are available and how to target them, please feel free to contact the Storrs HPC admins by sending an email to hpc@uconn.edu. |
...
Expand | ||||||||
---|---|---|---|---|---|---|---|---|
Short answer: Your job is being “held.” To release the job and re-submit it to the job queue you can use the Long Answer: Your job failed. We have a separate FAQ on figuring out why a job failed here, but here we will focus on why your job is being held. When jobs fail, they used to be automatically re-queued. This was a problem for a number of users because re-running the job would overwrite their previous data. In January 2024, we re-configured SLURM to prevent this problem. Now, when jobs fail, they are not immediately re-queued. Instead, the jobs will be “held” from the queue until the submitting user “releases” those jobs back into the queue. This change prevents jobs from requeueing automatically and allows users to make a conscious choice to re-queue their jobs. You can re-queue jobs using the below commands:
If you release your jobs into the queue and they keep ending up back in the “held” state, that is an indication that there may be something failing within your submission script in which case you should cancel your jobs and start troubleshooting. Please note that jobs which that are left in the queue with the “SE” state will be cancelled canceled after seven days. Please feel free to contact us at hpc@uconn.edu with any questions or concerns. |
...
Expand | ||||||
---|---|---|---|---|---|---|
Short answer: One of the login nodes is most likely not working properly. Try to ssh into any of the three login nodes directly. Long Answer: When you ssh into hpc2 login.storrs.hpc.uconn.edu, you are directed to one of our three login nodes (login4, login5, or login6). Occasionally, one of these three login nodes will become faulty. If you ssh into the cluster and your account is directed to the faulty node, then you may be given a “Permission denied” error message. If you experience this problem, we recommend you try to ssh directly into one of the login nodes. Here are the three commands one can use to login into our three login nodes. Please replace netID with your own netID. login4
login5
login6
If one of these allows you to log in but another gives you a permission denied error, then we can be sure that there is something wrong with one of the login nodes. If you have the time, we’ll ask that you please send us a screenshot of the login node which that is giving you a problem so that we can tend to any problems on that faulty node. This will help us ensure that this problem is fixed as soon as possible. If you receive this “Permission denied” error when ssh-ing directly into all three login nodes, then the problem may be with your netID. You may have to reset your netID password which can be done at this link. |
...
Expand |
---|
Short answer: Some modules can only be run on certain architectures. Try loading that module on a node with the Epyc or Skylake architectures. Long Answer: These GLIBC errors often happen because the software you’re trying to load expects to find a newer GLIBC library than what’s available in your current node’s Red Hat Enterprise Linux (RHEL) version. Older architectures sometimes have older RHEL versions and therefore older GLIBC libraries. Newer architectures typically have newer RHEL versions and GLIBC libraries. Switching to a newer or different architecture may resolve this GLIBC error. See the following guide for instructions on how to target specific architectures. |
How do I fix an error
...
that says, “Can't open display, Failed initializing GUI, exiting?”
Expand | ||
---|---|---|
Short answer: This is an X-forwarding error. The most common fix is to enable X-forwarding when you ssh into the HPC using the
Long Answer: X-forwarding allows programs being run on the HPC to be opened up in a GUI on our local machines. It is convenient and nice to work with but it can take a bit of effort to set up if you are working on a Mac or Windows device. Linux users have it easy because all they normally have to do is use the |
Modules that I installed used to work properly on the original HPC but they are not loading properly on HPC 2.0. How do I resolve this problem?
Expand |
---|
There are many reasons this might be occurring, but a common problem with user-installed programs is that the module names and versions have changed slightly between the old HPC and HPC 2.0. It may be that the dependencies your program used to rely on are no longer available on HPC 2.0. For instance, the GCC compilers have been updated and some of the old ones are no longer available. In this case, the ideal situation would be to install your program again using the newer compilers---this is often a good idea anyway because newer compilers sometimes increase the performance and reduce the chance of bugs. If there are extenuating circumstances that prevent you from using a program with new compilers or are experiencing other module-related problems, we invite you to submit a request for assistance by emailing hpc@uconn.edu. Then we can discuss options of how to set up a module that meets your needs. |
How do I fix an error
...
that says one module conflicts with other modules(s)?
Expand | ||||||||
---|---|---|---|---|---|---|---|---|
If the 'module load' command returns the following errors:
This means that the module you want to load conflicts with the currently loaded module, <Module2>. To fix it, please unload <Module2> and then load <Module1> again:
Or
Or, if neither of these works, you can purge all the modules with
and start fresh. |
How do I fix an error
...
that says the module I’m trying to load depends on other modules?
Expand | ||||
---|---|---|---|---|
If the 'module load' command returns the following errors:
This means that the module you want to load depends on the module <Module2>. To fix it, please load <Module2> prior to <Module1>:
You may encounter the above errors many times. Please load/unload the requested/conflicted modules and try again. |
...
Expand |
---|
If you would like to acknowledge/reference the Storrs HPC cluster in your publications, you can acknowledge Storrs HPC to something along the lines of the following: “The computational work performed on this project was done with help from the Storrs High-Performance Computing cluster. We would like to thank the UConn Storrs HPC and HPC team for providing the resources and support that contributed to these results.”
|