- Created by Douglas Kirby, last modified by Jeffrey Tamucci on Feb 16, 2023
You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 31 Next »
How do I get an account?
If you don’t already have an account, please fill out the cluster application form.
Students and postdoctoral research associates who are requesting an account will need their advisor’s (PI’s) NetID so that we can verify their membership in the advisor’s research group. If you don’t know your advisor’s NetID, you can look it up on the UConn PhoneBook.
How much does it cost to have an account with Storrs HPC?
In general, it costs nothing. Basic access to the Storrs HPC is provided for free to UConn students, staff, and faculty for research purposes. If you would like to have your jobs run more quickly, we offer priority access for a fee. See: How do I get priority access to HPC resources?
How do I login to the Storrs HPC when I am off campus?
Short answer: First, you need to connect to UConn’s VPN. Then, you should be able to access the HPC.
Long Answer: The HPC Cluster only allows the connection of SSH from the campus-wide computers, for example:
computers in the UConn libraries
computers in campus's offices/labs
computers connected to UCONN-SECURE WiFi, etc.).
In order to connect the HPC when you are off campus, you will first need to connect to the UConn Virtual Private Network (VPN). After connecting to the VPN, you will be able to login to the HPC as you normally do.
For instructions on how to install, setup, and connect your personal device(s) to UConn’s VPN , please go to this webpage.
How do I check what node I am on?
The node you are on will normally be shown next to your netID when you login to the Storrs HPC. For instance, if Jonathan the Husky’s netID was jtk10001, his terminal might look like this.
[jtk10001@login6 ~]$
This would tell us the Jonathan is on the node called “login6.” Another way to check what node you are on is to use the hostname
command. See below.
[jtk10001@login6 ~]$ hostname login6
How do I get priority access to HPC resources?
Short answer: Faculty can purchase priority access for 5 years if they pay the upfront cost for the nodes.
Long answer: High priority access is available under a “condo model,” where faculty are able to purchase semi-dedicated nodes which get made available to all users when there are unused compute cycles. Under the condo model, faculty researchers fund the capital equipment costs of individual compute nodes, while the university funds the operating costs of running these nodes for five years. Faculty who purchase compute nodes receive access to equivalent resources at a higher priority than other researchers. The faculty can designate others to receive access at the same priority level, such as their graduate students, postdoctoral researchers, etc. With priority access, computational jobs are moved higher in the queuing system, and in most cases begin execution within twelve hours, depending upon other FairShare factors. A priority user can utilize their resources indefinitely. All access to resources is managed through the cluster’s job scheduler. When a priority user is not using their assigned resources, the nodes are made available to all UConn researchers for general use.
How do I access old data after I leave the university?
Short answer: Email us at hpc@uconn.edu. We will set up an affiliate account so you can access those files.
Long answer: When folks leave the university, their netIDs are deactivated and access to their Storrs HPC resources and files are (typically) transferred to their supervisors. To regain access, Storrs HPC staff need to create an account affiliated with their former supervisor.
How do I install specific R packages?
Since installing R packages is important to so many HPC users' research, we have created a brief guide on installation of R packages which is linked here.
How do I install specific python libraries?
We have created a brief guide on installation of python libraries which can be found here.
How can I check for available resources before submitting a job?
Long wait times for your jobs? Errors about unavailable resources? We’ve been there and understand how frustrating it can be for jobs to take a long time to run. It’s an unfortunate consequence of having such strong computational research community at UConn. LOTS of incredible research happens here, but it also means that there are LOTS of people competing for resources. There’s no getting around that problem, but there are a couple steps we can take to increase the odds that our jobs get on ASAP.
Check what resources are available before we submit a job. (And then)
Target our submission to those available resources.
This FAQ will offer guidance on how to do both of those things.
Checking for Available Resources
The below sinfo
command will give you a high-level view of what nodes are fully available (“idle”), in partial use (“mix”), not available (“alloc”. for allocated), or otherwise in need of maintenance (all other states besides idle, mix, or alloc). The nodes will be broken down in order by partition.
sinfo -o '%14P %.5a %.10l %.6D %.6t %30N %b
The output for that command will look like this, but it will be much longer and provide info on every partition.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST ACTIVE_FEATURES GeoSciMP up 1-00:00:00 38 mix cn[410-447] location=local,epyc64,cpuonly class up 4:00:00 3 inval cn[407-409] location=local,skylake,cpuonly class up 4:00:00 1 maint cn348 location=local,skylake,cpuonly class up 4:00:00 19 down* cn[244-246,252,254-255,268-271 location=local,haswell,cpuonly class up 4:00:00 7 down* cn[335-337,339-340,373,406] location=local,skylake,cpuonly class up 4:00:00 1 comp cn376 location=local,skylake,cpuonly class up 4:00:00 10 mix cn[333,352,362,374-375,391,402 location=local,skylake,cpuonly class up 4:00:00 25 alloc cn[329-332,334,338,341-342,345 location=local,skylake,cpuonly ... (etc.)
The above command gives us an overarching picture of usage on the cluster, and from there, we can use a more targeted command to get more information on individual nodes within a partition, like how many cores or GPUs are in use and how many are available. The base sinfo
command is below and it targets the priority-gpu
partition but we could amend to target any other partition, like hi-core
for example.
sinfo -p priority-gpu -t idle,mix -o%10n%20C%10G%10t%20R%b HOSTNAMES CPUS(A/I/O/T) GRES STATE PARTITION ACTIVE_FEATURES gpu06 13/23/0/36 gpu:1 mix priority-gpu location=local,gpu,v100,skylake gpu20 36/28/0/64 gpu:3 mix priority-gpu location=local,epyc64,a100,gpu gpu21 25/39/0/64 gpu:3 mix priority-gpu location=local,epyc64,a100,gpu gpu22 3/61/0/64 gpu:3 mix priority-gpu location=local,epyc64,a100,gpu gpu23 33/31/0/64 gpu:1 mix priority-gpu location=local,epyc64,a100,gpu gtx02 2/18/0/20 gpu:3 mix priority-gpu location=local,gpu,gtx,skylake gtx03 12/8/0/20 gpu:3 mix priority-gpu location=local,gpu,gtx,skylake gtx08 1/19/0/20 gpu:2 mix priority-gpu location=local,gpu,gtx,skylake gtx11 2/18/0/20 gpu:2 mix priority-gpu location=local,gpu,gtx,skylake gtx15 28/4/0/32 gpu:8 mix priority-gpu location=local,gpu,rtx,skylake gtx09 0/20/0/20 gpu:2 idle priority-gpu location=local,gpu,gtx,skylake
The column titled “CPUS (A/I/O/T)” tells us how many cores are available. A stands for Allocated, I stands for Idle, and T stands for Total. (O stands for Other but you can ignore that column) Since there are 39 cores in the “Idle” column for GPU21, that means 39 cores are available to use. But all 3 of the GPUs on GPU21 are in use so we can’t use any GPUs on that node. So, that gives us an idea of the resources. If I only needed cores and no GPUs, I could target GPU21.
In summary, these two commands can give us a picture of what partitions have resources available, and then what resources are available on individual nodes within that partition.
Targeting a specific partition
The next step is submitting a job targeting a specific partition. If you’re not sure how to target a specific partition, please visit our SLURM Guide where you will see examples of submission scripts that target different partitions and architectures. Another key part of targeting a specific partition is knowing what partitions you are allowed to use and what “account” and “QOS” you must use to access them. To check what partitions you’re allowed to use and how to access them you can use this command.
sacctmgr list assoc user=`whoami` -o format=user,account,partition%20,qos%20 User Account Partition QOS ---------- ---------- -------------------- -------------------- net10004 jth10001 hi-core general net10004 jth10001 general-gpu general net10004 jth10001 debug general net10004 jth10001 lo-core general net10004 jth10001 general general net10004 jth10001 priority-gpu jth10001gpu
This tells me that I have access to 6 partitions. To access the priority-gpu partition, I need to include the below three flags in my #SBATCH header of my submission script. This will be different for every individual so you will have to modify this with the partitions you have access to and the account and QOS that are associated with your account.
#SBATCH -p priority-gpu # partition I want to access #SBATCH -A jth10001 # account I am associated with #SBATCH -q jth10001gpu # QOS needed to access partition
If you have further questions about how to check what resources are available and how to target them, please feel free to contact the Storrs HPC admins by sending an email to hpc@uconn.edu.
Why did my job fail?
There are many reasons a job may fail. A good first step is to use the shist
command to check the ExitCode
SLURM gave it. The command follows this format: shist --starttime YYYY-MM-DD
.
Here’s an example of the output for a job that failed immediately with an ExitCode
of 1.
JobID Partition QOS JobName User State Elapsed NNodes NCPUS NodeList ExitCode End ------------ ---------- ---------- ---------- --------- ---------- ---------- -------- ---------- --------------- -------- ------------------- 73088 priority-+ erm12009g+ submit_rx jdt10005 FAILED 00:00:00 1 32 gtx21 1:0 2022-11-17T10:05:34
The ExitCode
of a job will be a number between 0 and 255. An ExitCode
of 0 means that—as far as SLURM is concerned—the job ran and completed properly. Any non-zero ExitCode
will indicate that the job failed. One could then search the ExitCode
on Google to investigate what SLURM thinks caused the job to fail. Sometimes this is helpful but not always. Either way take note of what you find for future reference.
The next clue to investigate is the NodeList
column. Sometimes a job fails because there is something wrong with the compute node our job was run on. If the compute node is the problem (and Storrs HPC staff haven’t fixed it already), the job should fail again with the same ExitCode
. We can submit our job specifically to that same node to see if the job fails again. Try adding this to the #SBATCH header of your script to target a specific node. Here, we target gtx21
because that was the node listed in the NodeList
column above.
#SBATCH --nodelist=gtx21
Once you see the job has failed multiple times on the same node, then you can feel confident that a faulty node is likely the cause. Please submit a help request to Storrs HPC including screenshot from the shist
output.
My jobs are suddenly failing due to insufficient memory. Or with an “OOM” error. Why is this happening? And how do I fix this?
Short answer: If you received this error, your job most likely failed because the amount of memory (RAM) it needed was larger than the default. We can request more memory for your job using the --mem
or --mem-per-cpu
flag.
Long answer: There are several reasons a job may fail due to insufficient memory. The most likely reason this problem is suddenly affecting you is because Storrs HPC Admins had to implement a new change to HPC 2.0. We’ll explain why in a moment, but first let me go over the solution (assuming this is the problem).
As of January 2023, the default amount of memory available per CPU is 2 gigabytes. But you can easily override the default using the --mem-per-cpu
flag in your submission script. The new line will look like this:
#SBATCH -n 2 # asks for 2 cores #SBATCH --mem-per-cpu=3G # asks for 3 GB or RAM per core, or 6 GB total
Adding this line will tell SLURM to use 3 gigabytes of RAM per CPU you request. That means if we ask for 2 cores (-n 2), then we’ll be allocated 6 gigabytes of RAM total. Please note that --mem-per-cpu
flag must be used with -n
flag specifying the number of cores you want. Alternatively, users can use the --mem
flag to specify the total amount of RAM they need.
#SBATCH --mem=5G # asks for 10 gigabytes total
We encourage users to please adjust the --mem-per-cpu
or --mem
flags in a step-wise fashion. First, we try 3 gigabytes, then 4, then 5, etc. until our jobs start running without failing from memory errors. That strategy helps ensure that every HPC user's jobs get on quickly and run efficiently.
Now, we’ll explain why this problem is suddenly affecting our users. On HPC 1.0, the memory flags didn’t work so there wasn’t a great way to prevent jobs from failing due to insufficient memory. The problem didn’t happen super often though. The memory flags do work on HPC 2.0, and the default settings were a little too strict. SLURM was only letting 1 job run per node because it assumed all jobs needed the entire node’s memory. There were nodes with 128 cores and 500 GB of RAM where only 1 core and 1 gigabyte of RAM being used. Tons of jobs were piling up in the queue and the job wait times were really long.
So, the Storrs HPC Admins reset the default memory available per core to be 2 gigabytes of RAM per core. We had to set it this low because some of the node architectures have much less memory than others. Resetting this variable enabled SLURM to allow more than one job to run on a node and reduced the job wait times, provided there is enough memory on that node. This default mem-per-cpu of 2 GB is okay for many users, but not for all. Some of our users run more RAM-intensive programs, meaning the default is not sufficient. So, now it has to be adjusted manually in the #SBATCH header. For more info on #SBATCH flags, see this link.
I accidentally deleted some files. Is it possible to recover them? If so, how?
Short answer: Possibly, it depends where the files were stored and how long ago they were deleted. Send us an email at hpc@uconn.edu explaining the situation so we can discuss options.
Long Answer: Files stored in the /shared/ and /home/ directories are backed up daily at 5:00 am and the data is stored for 30 days. If the accidentally deleted files were stored within either those directories, and you contact us in under 30 days, then there’s a chance we will be able to help you restore those files. Please contact us at hpc@uconn.edu so we can discuss a way forward.
Unfortunately, if the files that were deleted had been stored on /scratch/, we will not be able to restore them. The /scratch/ directory is not backed up ever. If you’re reading this after accidentally deleting files on scratch, we understand that this news can come as a huge blow. We wish that we could backup /scratch/, but we only have enough room to back /shared/ and /home/. We cannot help you fix the current situation, but there are steps we can take to prevent this from happening again. You can follow these links for more info. on HPC data storage, backing up files, or using Globus.
Why do I get a GLIBC Error when trying to load certain modules?
Short answer: Some modules can only be run on certain architectures. Try loading that module on a node with the Epyc or Skylake architectures.
Long Answer: These GLIBC errors often happen because the software you’re trying to load expects to find a newer GLIBC library than what’s available in your current node’s Red Hat Enterprise Linux (RHEL) version. Older architectures sometimes have older RHEL versions and therefore older GLIBC libraries. Newer architectures typically have newer RHEL versions and GLIBC libraries.
Switching to a newer or different architecture may resolve this GLIBC error. See the following guide for instructions on how to target specific architectures.
Modules that I installed used to work properly on the original HPC but they are not loading properly on HPC 2.0. How do I resolve this problem?
There are many reasons this might be occurring, but a common problem with user-installed programs is that the module names and versions have changed slightly between the old HPC and HPC 2.0. It may be that the dependencies your program used to rely on are no longer available on HPC 2.0. For instance, the GCC compilers have been updated and some of the old ones are no longer available. In this case, the ideal situation would be to install your program again using the newer compilers---this is often a good idea anyway because newer compilers sometimes increase the performance and reduce the chance of bugs.
If there are extenuating circumstances that prevent you from using a program with new compilers, or are experiencing other module related problems, we invite you to submit a request for assistance by emailing hpc@uconn.edu. Then we can discuss options of how to set up a module that meets your needs.
How do I fix an error which says one module conflicts with other modules(s)?
If the 'module load' command returns the following errors:
$ module load <Module1> <Module1>(4):ERROR:150: Module '<Module1>' conflicts with the currently loaded module(s) '<Module2>' <Module1>(4):ERROR:102: Tcl command execution failed: conflict <Module_Group>
This means that the module you want to load conflicts with the currently loaded module, <Module2>. To fix it, please unload <Module2> and then load <Module1> again:
$ module unload <Module2> $ module load <Module1>
Or
$ module switch <Module2> <Module1>
Or, if neither of these work, you can purge all the modules with
$ module purge
and start fresh.
How do I fix an error which says the module I’m trying to load depends on other module(s)?
If the 'module load' command returns the following errors:
$ module load <Module1> <Module1>(9):ERROR:151: Module '<Module1>' depends on one of the module(s) '<Module2>' <Module1>(9):ERROR:102: Tcl command execution failed: prereq <Modle2>
This means that the module you want to load depends on the module <Module2>. To fix it, please load <Module2> prior to <Module1>:
$ module load <Module2> <Module1>
You may encounter the above errors many times. Please load/unload the requested/conflicted modules and try again.
Why do I get syntax errors when I try to run scripts via sbatch?
If the script could not run via sbatch. The errors usually looks like:
sh: -c: line 0: unexpected EOF while looking for matching `"' sh: -c: line 1: syntax error: unexpected end of file
It is usually due to the wrong file format. Your file is still in the Windows format but not Linux format.
$ file comsol.sh # with wrong format comsol.sh: Little-endian UTF-16 Unicode English text, with CRLF line terminators $ iconv -f utf-16 -t ascii comsol.sh -o comsol.sh # Convert to ASCII first. $ dos2unix comsol.sh # change CRLF line terminators to Unix format $ file comsol.sh comsol.sh: Bourne-Again shell script text executable
- No labels