Table of Contents |
---|
Getting Started
How do I get an account?
Expand |
---|
If you don’t already have an account, please fill out the cluster application form. Students and postdoctoral research associates who are requesting an account will need their advisor’s (PI’s) NetID so that we can verify their membership in the advisor’s research group. If you don’t know your advisor’s NetID, you can look it up on the UConn PhoneBook. |
...
Expand |
---|
Short answer: In general, data cannot be stored on the Storrs HPC if it contains any of the following:
Long Answer: The Storrs HPC cluster cannot be used to generate or store data that is considered Sensitive University Data or covered by the university's Export Control Policy; for more info, see UConn’s Data Classification Policy. The documents linked in the previous sentence provide more detail about each of those classifications. Data which may have been been classified as confidential or sensitive may be eligible for storage on Storrs HPC if the data has been de-identified sufficiently that the data classification no longer applies. Any de/re-identification key (if one was used) cannot be transmitted or stored with the data within the HPC environment. All data that is stored on the cluster is subject to these restrictions, and data that is not in compliance may be removed. If you have read the above documents and you’re still not sure how your data would be classified, please send an email to security@uconn.edu. |
...
Using the HPC
How do I log in to the Storrs HPC when I am off campus?
Expand |
---|
Short answer: First, you need to connect to UConn’s VPN. Then, you should be able to access the HPC. Long Answer: The HPC Cluster only allows the connection of SSH from the campus-wide computers, for example:
In order to connect to the HPC when you are off campus, you will first need to connect to the UConn Virtual Private Network (VPN). After connecting to the VPN, you will be able to log in to the HPC as you normally do. For instructions on how to install, set up, and connect your personal device(s) to UConn’s VPN, please go to this webpage. |
Why am I suddenly getting a “Permission denied” error when I try to ssh into the HPC?
Expand | ||||||
---|---|---|---|---|---|---|
Short answer: One of the login nodes is most likely not working properly. Try to ssh into any of the three login nodes directly. Long Answer: When you ssh into hpc2.storrs.hpc.uconn.edu, you are directed to one of our three login nodes (login4, login5, or login6). Occasionally, one of these three login nodes will become faulty. If you ssh into the cluster and your account is directed to the faulty node, then you may be given a “Permission denied” error message. If you experience this problem, we recommend you try to ssh directly into one of the login nodes. Here are the three commands one can use to login into our three login nodes. Please replace netID with your own netID. login4
login5
login6
If one of these allows you to log in but another gives you a permission denied error, then we can be sure that there is something wrong with one of the login nodes. If you have the time, we’ll ask that you please send us a screenshot of the login node which is giving you a problem so that we can tend to any problems on that faulty node. This will help us ensure that this problem is fixed as soon as possible. If you receive this “Permission denied” error when ssh-ing directly into all three login nodes, then the problem may be with your netID. You may have to reset your netID password which can be done at this link. |
How do I check what node I am on?
Expand | ||||
---|---|---|---|---|
The node you are on will normally be shown next to your netID when you log in to the Storrs HPC. For instance, if Jonathan the Husky’s netID was jtk10001, his terminal might look like this.
This would tell us that Jonathan is on the node called “login6.” Another way to check what node you are on is to use the
|
What programs am I allowed to run on the login nodes?
Expand | |||
---|---|---|---|
Programs that are running on a login node ( Below is a list of programs that are allowed to run on the login node without restrictions:
|
How do I get priority access to HPC resources?
Expand |
---|
Short answer: Faculty can purchase priority access for 5 years if they pay the upfront cost for the nodes. Long answer: High-priority access is available under a “condo model,” where faculty are able to purchase semi-dedicated nodes which get made available to all users when there are unused compute cycles. Under the condo model, faculty researchers fund the capital equipment costs of individual compute nodes, while the university funds the operating costs of running these nodes for five years. Faculty who purchase compute nodes receive access to equivalent resources at a higher priority than other researchers. The faculty can designate others to receive access at the same priority level, such as their graduate students, postdoctoral researchers, etc. With priority access, computational jobs are moved higher in the queuing system, and in most cases begin execution within twelve hours, depending upon other FairShare factors. A priority user can utilize their resources indefinitely. All access to resources is managed through the cluster’s job scheduler. When a priority user is not using their assigned resources, the nodes are made available to all UConn researchers for general use. |
How do I access old data after I leave the university?
Expand |
---|
Short answer: Email us at hpc@uconn.edu. We will set up an affiliate account so you can access those files. Long answer: When folks leave the university, their netIDs are deactivated and access to their Storrs HPC resources and files are (typically) transferred to their supervisors. To regain access, Storrs HPC staff need to create an account affiliated with their former supervisor. |
How do I install specific R packages?
Expand |
---|
Since installing R packages is important to so many HPC users' research, we have created a brief guide on the installation of R packages which is linked here. |
How do I install specific python libraries?
Expand |
---|
We have created a brief guide on the installation of python libraries which can be found here. |
How can I check for available resources before submitting a job?
...
Long wait times for your jobs? Errors about unavailable resources? We’ve been there and understand how frustrating it can be for jobs to take a long time to run. It’s an unfortunate consequence of having such a strong computational research community at UConn. LOTS of incredible research happens here, but it also means that there are LOTS of people competing for resources. There’s no getting around that problem, but there are a couple of steps we can take to increase the odds that our jobs get on ASAP.
Check what resources are available before we submit a job. (And then)
Target our submission to those available resources.
This FAQ will offer guidance on how to do both of those things.
Checking for Available Resources
The below sinfo
command will give you a high-level view of what nodes are fully available (“idle”), in partial use (“mix”), not available (“alloc”. for allocated), or otherwise in need of maintenance (all other states besides idle, mix, or alloc). The nodes will be broken down in order by partition.
Code Block |
---|
sinfo -o '%14P %.5a %.10l %.6D %.6t %30N %b |
The output for that command will look like this, but it will be much longer and provide info on every partition.
...
How do I check what node I am on?
Expand | ||||
---|---|---|---|---|
The node you are on will normally be shown next to your netID when you log in to the Storrs HPC. For instance, if Jonathan the Husky’s netID was jtk10001, his terminal might look like this.
This would tell us that Jonathan is on the node called “login6.” Another way to check what node you are on is to use the
|
What programs am I allowed to run on the login nodes?
Expand | |||
---|---|---|---|
Programs that are running on a login node ( Below is a list of programs that are allowed to run on the login node without restrictions:
|
How do I get priority access to HPC resources?
Expand |
---|
Short answer: Faculty can purchase priority access for 5 years if they pay the upfront cost for the nodes. Long answer: High-priority access is available under a “condo model,” where faculty are able to purchase semi-dedicated nodes which get made available to all users when there are unused compute cycles. Under the condo model, faculty researchers fund the capital equipment costs of individual compute nodes, while the university funds the operating costs of running these nodes for five years. Faculty who purchase compute nodes receive access to equivalent resources at a higher priority than other researchers. The faculty can designate others to receive access at the same priority level, such as their graduate students, postdoctoral researchers, etc. With priority access, computational jobs are moved higher in the queuing system, and in most cases begin execution within twelve hours, depending upon other FairShare factors. A priority user can utilize their resources indefinitely. All access to resources is managed through the cluster’s job scheduler. When a priority user is not using their assigned resources, the nodes are made available to all UConn researchers for general use. |
How do I access old data after I leave the university?
Expand |
---|
Short answer: Email us at hpc@uconn.edu. We will set up an affiliate account so you can access those files. Long answer: When folks leave the university, their netIDs are deactivated and access to their Storrs HPC resources and files are (typically) transferred to their supervisors. To regain access, Storrs HPC staff need to create an account affiliated with their former supervisor. |
How do I install specific R packages?
Expand |
---|
Since installing R packages is important to so many HPC users' research, we have created a brief guide on the installation of R packages which is linked here. |
How do I install specific python libraries?
Expand |
---|
We have created a brief guide on the installation of python libraries which can be found here. |
How can I check for available resources before submitting a job?
Expand | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Long wait times for your jobs? Errors about unavailable resources? We’ve been there and understand how frustrating it can be for jobs to take a long time to run. It’s an unfortunate consequence of having such a strong computational research community at UConn. LOTS of incredible research happens here, but it also means that there are LOTS of people competing for resources. There’s no getting around that problem, but there are a couple of steps we can take to increase the odds that our jobs get on ASAP.
This FAQ will offer guidance on how to do both of those things. Checking for Available Resources The below
The output for that command will look like this, but it will be much longer and provide info on every partition.
The above command gives us an overarching picture of usage on the cluster, and from there, we can use a more targeted command to get more information on individual nodes within a partition, like how many cores or GPUs are in use and how many are available. The base
The above command gives us an overarching picture of usage on the cluster, and from there, we can use a more targeted command to get more information on individual nodes within a partition, like how many cores or GPUs are in use and how many are available. The base
The column titled “CPUS (A/I/O/T)” tells us how many cores are available. “A” stands for Allocated, “I” stands for Idle, and “T” stands for Total. (“O” stands for Other but you can ignore that column) Since there are 39 cores in the “Idle” column for GPU21, that means 39 cores are available to use. But all 3 of the GPUs on GPU21 are in use so we can’t use any GPUs on that node. So, that gives us an idea of the resources. If I only needed cores and no GPUs, I could target GPU21. In summary, these two commands can give us a picture of what partitions have resources available, and then what resources are available on individual nodes within that partition. Targeting a specific partition The next step is submitting a job targeting a specific partition. If you’re not sure how to target a specific partition, please visit our SLURM Guide where you will see examples of submission scripts that target different partitions and architectures. Another key part of targeting a specific partition is knowing what partitions you are allowed to use and what “account” and “QOS” you must use to access them. To check what partitions you’re allowed to use and how to access them you can use this command.
The column titled “CPUS (A/I/O/T)” tells us how many cores are available. “A” stands for Allocated, “I” stands for Idle, and “T” stands for Total. (“O” stands for Other but you can ignore that column) Since there are 39 cores in the “Idle” column for GPU21, that means 39 cores are available to use. But all 3 of the GPUs on GPU21 are in use so we can’t use any GPUs on that node. So, that gives us an idea of the resources. If I only needed cores and no GPUs, I could target GPU21. In summary, these two commands can give us a picture of what partitions have resources available, and then what resources are available on individual nodes within that partition. Targeting a specific partition The next step is submitting a job targeting a specific partition. If you’re not sure how to target a specific partition, please visit our SLURM Guide where you will see examples of submission scripts that target different partitions and architectures. Another key part of targeting a specific partition is knowing what partitions you are allowed to use and what “account” and “QOS” you must use to access them. To check what partitions you’re allowed to use and how to access them you can use this command.
This tells me that I have access to 6 partitions. To access the priority-gpu partition, I need to include the below three flags in the #SBATCH header of my submission script. This will be different for every individual so you will have to modify this with the partitions you have access to and the account and QOS that are associated with your account.
If you have further questions about how to check what resources are available and how to target them, please feel free to contact the Storrs HPC admins by sending an email to hpc@uconn.edu. |
...
Troubleshooting problems
Why did my job fail?
Expand | ||||
---|---|---|---|---|
There are many reasons a job may fail. A good first step is to use the Here’s an example of the output for a job that failed immediately with an
The The next clue to investigate is the
Once you see the job has failed multiple times on the same node, then you can feel confident that a faulty node is likely the cause. Please submit a help request to Storrs HPC including a screenshot from the |
...
Expand | ||||
---|---|---|---|---|
Short answer: If you received this error, your job most likely failed because the amount of memory (RAM) it needed was larger than the default. We can request more memory for your job using the Long answer: There are several reasons a job may fail from insufficient memory. As of January 2023, the default amount of memory available per CPU is 2 gigabytes. The default
Adding this line will tell SLURM to use 3 gigabytes of RAM per CPU you request. That means if we ask for 2 cores (-n 2), then we’ll be allocated 6 gigabytes of RAM total. Please note that the
We encourage users to please adjust the |
Some of my files were deleted. Is it possible to recover them? If so, how?
Expand | |||
---|---|---|---|
Short answer: Possibly, it depends on where the files were stored and how long ago they were deleted. Send us an email at hpc@uconn.edu explaining the situation so we can discuss options. Long Answer: Files stored in the /shared/ and /home/ directories are backed up daily at 5:00 am and the data is stored for 30 days. If the accidentally deleted files were stored within either of those directories, and you contact us in under 30 days, then there’s a chance we will be able to help you restore those files. Please contact us at hpc@uconn.edu so we can discuss a way forward. Unfortunately, if the files that were deleted had been stored on /scratch/, we will not be able to restore them. The /scratch/ directory is not backed up ever. If you’re reading this after accidentally deleting files from scratch, we understand that this news can come as a huge blow. We wish that we could back up /scratch/, but we only have enough room to back /shared/ and /home/. We cannot help you fix the current situation, but there are steps we can take to prevent this from happening again. You can follow these links for more info. on HPC data storage, backing up files, or using Globus
Adding this line will tell SLURM to use 3 gigabytes of RAM per CPU you request. That means if we ask for 2 cores (-n 2), then we’ll be allocated 6 gigabytes of RAM total. Please note that the
We encourage users to please adjust the |
Some of my files were deleted. Is it possible to recover them? If so, how?
Expand |
---|
Short answer: Possibly, it depends on where the files were stored and how long ago they were deleted. Send us an email at hpc@uconn.edu explaining the situation so we can discuss options. Long Answer: Files stored in the /shared/ and /home/ directories are backed up daily at 5:00 am and the data is stored for 30 days. If the accidentally deleted files were stored within either of those directories, and you contact us in under 30 days, then there’s a chance we will be able to help you restore those files. Please contact us at hpc@uconn.edu so we can discuss a way forward. Unfortunately, if the files that were deleted had been stored on /scratch/, we will not be able to restore them. The /scratch/ directory is not backed up ever. If you’re reading this after accidentally deleting files from scratch, we understand that this news can come as a huge blow. We wish that we could back up /scratch/, but we only have enough room to back /shared/ and /home/. We cannot help you fix the current situation, but there are steps we can take to prevent this from happening again. You can follow these links for more info. on HPC data storage, backing up files, or using Globus. |
Why am I suddenly getting a “Permission denied” error when I try to ssh into the HPC?
Expand | ||||||
---|---|---|---|---|---|---|
Short answer: One of the login nodes is most likely not working properly. Try to ssh into any of the three login nodes directly. Long Answer: When you ssh into hpc2.storrs.hpc.uconn.edu, you are directed to one of our three login nodes (login4, login5, or login6). Occasionally, one of these three login nodes will become faulty. If you ssh into the cluster and your account is directed to the faulty node, then you may be given a “Permission denied” error message. If you experience this problem, we recommend you try to ssh directly into one of the login nodes. Here are the three commands one can use to login into our three login nodes. Please replace netID with your own netID. login4
login5
login6
If one of these allows you to log in but another gives you a permission denied error, then we can be sure that there is something wrong with one of the login nodes. If you have the time, we’ll ask that you please send us a screenshot of the login node which is giving you a problem so that we can tend to any problems on that faulty node. This will help us ensure that this problem is fixed as soon as possible. If you receive this “Permission denied” error when ssh-ing directly into all three login nodes, then the problem may be with your netID. You may have to reset your netID password which can be done at this link. |
Why do I get a GLIBC Error when trying to load certain modules?
...
Expand | ||||
---|---|---|---|---|
If the script could not run via sbatch. The errors usually look like this:
It is usually due to the wrong file format. Your file is still in the Windows format but not the Linux format.
|
...
Guidelines for Citing the HPC
How can I acknowledge the Storrs HPC in our publications?
...