Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 52 Next »

Getting Started

How do I get an account?

 Click here to expand...

If you don’t already have an account, please fill out the cluster application form.

Students and postdoctoral research associates who are requesting an account will need their advisor’s (PI’s) NetID so that we can verify their membership in the advisor’s research group. If you don’t know your advisor’s NetID, you can look it up on the UConn PhoneBook.

How much does it cost to have an account with Storrs HPC?

 Click here to expand...

In general, it costs nothing. Basic access to the Storrs HPC is provided for free to UConn students, staff, and faculty for research purposes. If you would like to have your jobs run more quickly, we offer priority access for a fee. See: How do I get priority access to HPC resources?

What kind of data can be stored on the Storrs HPC?

 Click here to expand...

Short answer: In general, data cannot be stored on the Storrs HPC if it contains any of the following:

  • personal identifiable information (e.g., SSN, Passport Number, Drivers License Number, D.OB.)

  • credit card and/or banking information (e.g., account numbers)

  • university records for individual students (e.g., grades)

  • personal health information / HIPAA-protected data (with few exceptions)

  • data for which the data owner has specified data protection requirements that are not compatible with the Storrs HPC environment

  • data protected by UConn’s Export Control Policy (e.g., DoD-related, requires a security clearance)


Long Answer: The Storrs HPC cluster cannot be used to generate or store data that is considered Sensitive University Data or covered by the university's Export Control Policy; for more info, see UConn’s Data Classification Policy. The documents linked in the previous sentence provide more detail about each of those classifications.

Data which may have been been classified as confidential or sensitive may be eligible for storage on Storrs HPC if the data has been de-identified sufficiently that the data classification no longer applies. Any de/re-identification key (if one was used) cannot be transmitted or stored with the data within the HPC environment.

All data that is stored on the cluster is subject to these restrictions, and data that is not in compliance may be removed. If you have read the above documents and you’re still not sure how your data would be classified, please send an email to security@uconn.edu.

How do I get priority access to HPC resources?

 Click here to expand...

Short answer: Faculty can purchase priority access for 5 years if they pay the upfront cost for the nodes.

Long answer: High-priority access is available under a “condo model,” where faculty are able to purchase semi-dedicated nodes which get made available to all users when there are unused compute cycles. Under the condo model, faculty researchers fund the capital equipment costs of individual compute nodes, while the university funds the operating costs of running these nodes for five years. Faculty who purchase compute nodes receive access to equivalent resources at a higher priority than other researchers. The faculty can designate others to receive access at the same priority level, such as their graduate students, postdoctoral researchers, etc. With priority access, computational jobs are moved higher in the queuing system, and in most cases begin execution within twelve hours, depending upon other FairShare factors. A priority user can utilize their resources indefinitely. All access to resources is managed through the cluster’s job scheduler. When a priority user is not using their assigned resources, the nodes are made available to all UConn researchers for general use.

Please note that priority users will not be given separate partitions. Instead, they will be given a custom QoS because the QoS governs access to priority resources (a.k.a. Trackable RESources, or TRes). 

If you are interested in investing in HPC resources, please fill out the HPC Condo Request form.


Using the HPC

How do I log in to the Storrs HPC when I am off campus?

 Click here to expand...

Short answer: First, you need to connect to UConn’s VPN. Then, you should be able to access the HPC.

Long Answer: The HPC Cluster only allows the connection of SSH from the campus-wide computers, for example:

  • computers in the UConn libraries

  • computers in campus offices/labs

  • computers connected to UCONN-SECURE WiFi, etc.).

To connect to the HPC when you are off campus, you will first need to connect to the UConn Virtual Private Network (VPN). After connecting to the VPN, you will be able to log in to the HPC as you normally do.

For instructions on how to install, set up, and connect your personal device(s) to UConn’s VPN, please go to this webpage.

How do I check what node I am on?

 Click here to expand...

The node you are on will normally be shown next to your netID when you log in to the Storrs HPC. For instance, if Jonathan the Husky’s netID was jth10001, his terminal might look like this.

[jth10001@login6 ~]$

This would tell us that Jonathan is on the node called “login6.” Another way to check what node you are on is to use the hostname command. See below.

[jth10001@login6 ~]$ hostname
login6

What programs am I allowed to run on the login nodes?

 Click here to expand...

Programs that are running on a login node (login.storrs.hpc.uconn.edu) without using the job scheduler are subject to certain restrictions. Programs that run on the login nodes for longer than 1 hour, use greater than 5% of CPU power, and/or use greater than 5% of RAM listed below may be throttled or terminated without notice.

Below is a list of programs that are allowed to run on the login node without restrictions:

  • awk

  • basemount

  • bash

  • bzip

  • chgrp

  • chmod

  • cmake

  • comsollauncher

  • cp

  • du

  • emacs

  • find

  • fort

  • gcc

  • gfortran

  • grep

  • gunzip

  • gzip

  • icc

  • ifort

  • jservergo

  • less

  • ls

  • make

  • more

  • mv

  • nano

  • ncftp

  • nvcc

  • perl

  • rm

  • rsync

  • ruby

  • setfacl

  • sftp

  • smbclient

  • ssh

  • tail

  • tar

  • ukbfetch

  • vim

  • wget

How do I access old data after I leave the university?

 Click here to expand...

Short answer: Email us at hpc@uconn.edu. We will set up an affiliate account so you can access those files.

Long answer: When folks leave the university, their netIDs are deactivated and access to their Storrs HPC resources and files are (typically) transferred to their supervisors. To regain access, Storrs HPC staff need to create an account affiliated with their former supervisor.

How do I install specific R packages?

 Click here to expand...

Since installing R packages is important to so many HPC users' research, we have created a brief guide on the installation of R packages which is linked here.

How do I install specific python libraries?

 Click here to expand...

We have created a brief guide on the installation of python libraries which can be found here.

How can I check for available resources before submitting a job?

 Click here to expand...

Long wait times for your jobs? Errors about unavailable resources? We’ve been there and understand how frustrating it can be for jobs to take a long time to run. It’s an unfortunate consequence of having such a strong computational research community at UConn. LOTS of incredible research happens here, but it also means that there are LOTS of people competing for resources. There’s no getting around that problem, but there are a couple of steps we can take to increase the odds that our jobs get on ASAP.

  1. Check what resources are available before we submit a job. (And then)

  2. Target our submission to those available resources.

This FAQ will offer guidance on how to do both of those things.


Checking for Available Resources

The below nodeinfo command will give you a high-level view of what nodes are fully available (“idle”), in partial use (“mix”), not available (“alloc”. for allocated), or otherwise in need of maintenance (all other states besides idle, mix, or alloc). The nodes will be broken down in order by partition.

nodeinfo

The output for that command will look like this, but it will be much longer and provide info on every partition (not just class and debug).

PARTITION      NODES  STATE  TIMELIMIT   CPUS    GRES MEMORY   ACTIVE_FEATURES   NODELIST
class              1  inval    4:00:00     36   gpu:3 191954   gpu,v100,skylake  gpu05
class              1 maint*    4:00:00    128  (null) 515404   epyc128,cpuonly   cn563
class              3 drain$    4:00:00     36  (null) 191845+  skylake,cpuonly   cn[360-361,363]
class              8  maint    4:00:00     36  (null) 191845+  skylake,cpuonly   cn[362,365-366,368-372]
class              1  maint    4:00:00    128  (null) 515404   epyc128,cpuonly   cn561
class              1  comp*    4:00:00    128  (null) 515404   epyc128,cpuonly   cn564
class              3   drng    4:00:00    128  (null) 515404   epyc128,cpuonly   cn[560,565,569]
class              6  alloc    4:00:00    128  (null) 515404   epyc128,cpuonly   cn[559,562,566-568,570]
debug              1 maint*      30:00    128  (null) 515404   epyc128,cpuonly   cn563

... (etc.)

The above command gives us an overarching picture of usage on the cluster, and from there, we can use a more targeted command to get more information on individual nodes within a partition, like how many cores or GPUs are in use and how many are available. The base sinfo command is below and it targets the priority-gpu partition but we could amend it to target any other partition, like hi-core for example.

sinfo -p priority-gpu -t idle,mix -o%10n%20C%10G%10t%20R%b
HOSTNAMES CPUS(A/I/O/T)       GRES      STATE     PARTITION           ACTIVE_FEATURES
gpu06     13/23/0/36          gpu:1     mix       priority-gpu        location=local,gpu,v100,skylake
gpu20     36/28/0/64          gpu:3     mix       priority-gpu        location=local,epyc64,a100,gpu
gpu21     25/39/0/64          gpu:3     mix       priority-gpu        location=local,epyc64,a100,gpu
gpu22     3/61/0/64           gpu:3     mix       priority-gpu        location=local,epyc64,a100,gpu
gpu23     33/31/0/64          gpu:1     mix       priority-gpu        location=local,epyc64,a100,gpu
gtx02     2/18/0/20           gpu:3     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx03     12/8/0/20           gpu:3     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx08     1/19/0/20           gpu:2     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx11     2/18/0/20           gpu:2     mix       priority-gpu        location=local,gpu,gtx,skylake
gtx15     28/4/0/32           gpu:8     mix       priority-gpu        location=local,gpu,rtx,skylake
gtx09     0/20/0/20           gpu:2     idle      priority-gpu        location=local,gpu,gtx,skylake

The column titled “CPUS (A/I/O/T)” tells us how many cores are available. “A” stands for Allocated, “I” stands for Idle, and “T” stands for Total. (“O” stands for Other but you can ignore that column) Since there are 39 cores in the “Idle” column for GPU21, that means 39 cores are available to use. But all 3 of the GPUs on GPU21 are in use so we can’t use any GPUs on that node. So, that gives us an idea of the resources. If I only needed cores and no GPUs, I could target GPU21.

In summary, these two commands can give us a picture of what partitions have resources available, and then what resources are available on individual nodes within that partition.


Targeting a specific partition

The next step is submitting a job targeting a specific partition. If you’re not sure how to target a specific partition, please visit our SLURM Guide where you will see examples of submission scripts that target different partitions and architectures. Another key part of targeting a specific partition is knowing what partitions you are allowed to use and what “account” and “QOS” you must use to access them. To check what partitions you’re allowed to use and how to access them you can use this command.

acctinfo`whoami`
      User    Account            Partition                  QOS 
---------- ---------- -------------------- -------------------- 
net10004    jth10001               hi-core              general 
net10004    jth10001           general-gpu              general 
net10004    jth10001                 debug              general 
net10004    jth10001               lo-core              general 
net10004    jth10001               general              general
net10004    jth10001          priority-gpu          jth10001gpu

This tells me that I have access to 6 partitions. To access the priority-gpu partition, I need to include the below three flags in the #SBATCH header of my submission script. This will be different for every individual so you will have to modify this with the partitions you have access to and the account and QOS that are associated with your account.

#SBATCH -p priority-gpu     # partition I want to access
#SBATCH -A jth10001         # account I am associated with
#SBATCH -q jth10001gpu      # QOS needed to access partition

If you have further questions about how to check what resources are available and how to target them, please feel free to contact the Storrs HPC admins by sending an email to hpc@uconn.edu.


Troubleshooting problems

Why did my job fail?

 Click here to expand...

There are many reasons a job may fail. A good first step is to use the shist command to check the ExitCode SLURM gave it. The command follows this format: shist --starttime YYYY-MM-DD.

Here’s an example of the output for a job that failed immediately with an ExitCode of 1.

JobID         Partition        QOS    JobName      User      State    Elapsed   NNodes      NCPUS        NodeList ExitCode                 End 
------------ ---------- ---------- ---------- --------- ---------- ---------- -------- ---------- --------------- -------- -------------------
73088        priority-+ erm12009g+  submit_rx  jdt10005     FAILED   00:00:00        1         32           gtx21      1:0 2022-11-17T10:05:34 

The ExitCode of a job will be a number between 0 and 255. An ExitCode of 0 means that—as far as SLURM is concerned—the job ran and was completed properly. Any non-zero ExitCode will indicate that the job failed. One could then search the ExitCode on Google to investigate what SLURM thinks caused the job to fail. Sometimes this is helpful but not always. Either way, take note of what you find for future reference. Common ExitCodes are listed below for reference:

  • 0 → success

  • non-zero → failure

  • Exit code 1 indicates a general failure

  • Exit code 2 indicates incorrect use of shell builtins

  • Exit codes 3-124 indicate some error in job (check software exit codes)

  • Exit code 125 indicates out of memory

  • Exit code 126 indicates command cannot execute

  • Exit code 127 indicates command not found

  • Exit code 128 indicates invalid argument to exit

  • Exit codes 129-192 indicate jobs terminated by Linux signals

    • For these, subtract 128 from the number and match to signal code

    • Enter kill -l to list signal codes

    • Enter man signal for more information

    • Please note: when a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).

The next clue to investigate is the NodeList column. Sometimes a job fails because there is something wrong with the compute node our job was run on. If the compute node is the problem (and Storrs HPC staff haven’t fixed it already), the job should fail again with the same ExitCode. We can submit our job specifically to that same node to see if the job fails again. Try adding this to the #SBATCH header of your script to target a specific node. Here, we target gtx21 because that was the node listed in the NodeList column above.

#SBATCH --nodelist=gtx21

Once you see the job has failed multiple times on the same node but does not fail on other nodes, then you can feel confident that a faulty node is a likely cause. Please submit a help request to Storrs HPC including a screenshot from the shist output.

My jobs are failing due to insufficient memory. Or with an “out of memory” or “OOM” error. Why is this happening? And how do I fix this?

 Click here to expand...

Short answer: If you received this error, your job most likely failed because the amount of memory (RAM) it needed was larger than the default. We can request more memory for your job using the --mem or --mem-per-cpu flag.

Long answer: There are several reasons a job may fail from insufficient memory. As of January 2023, the default amount of memory available per CPU is 2 gigabytes. The default mem-per-cpu of 2 GB is okay for many users, but not for all. Users who receive OOM errors need to request more memory when submitting or initiating jobs. Those users can easily override the default using the -mem-per-cpu flag in your submission script. The new line would look something like this:

#SBATCH -n 2                            # asks for 2 cores
#SBATCH --mem-per-cpu=3G                # asks for 3 GB or RAM per core, or 6  GB total

Adding this line will tell SLURM to use 3 gigabytes of RAM per CPU you request. That means if we ask for 2 cores (-n 2), then we’ll be allocated 6 gigabytes of RAM total. Please note that the --mem-per-cpu flag must be used with the -n flag specifying the number of cores you want. Alternatively, users can use the --mem flag to specify the total amount of RAM they need regardless of how many CPUs are requested.

#SBATCH --mem=5G                        # asks for 10 gigabytes total

We encourage users to please adjust the --mem-per-cpu or --mem flags in a step-wise fashion. First, we try 3 gigabytes, then 4, then 5, etc. until our jobs start running without failing from memory errors. That strategy helps ensure that every HPC user's jobs get on quickly and run efficiently. For more info on fisbatch, srun, and #SBATCH flags, see this link.

My jobs are failing due to Timeout. I do not have access to priority; how can I resume a job after it times out?

 Click here to expand...

Short answer: Once the job is canceled by SLURM due to timeout, it cannot be resumed from that point because SLURM sets the exit code to “0” which denotes job completion. As far as SLURM is concerned, the job is now complete, with no state to resume from.

Long answer: One thing you can try is to use the timeout command to stop your program just before SLURM does. You can tell from the return code if the timeout was reached or not. It should set exit code “124”. If so, you can then requeue it with scontrol. Try the following:

In your submission script, add the following:

#SBATCH --open-mode=append
#SBATCH --time=12:00:00

Then, use the timeout command to call your program:

timeout 11h ./your_program
if [[ $? == 124 ]]; then
  scontrol requeue $SLURM_JOB_ID
fi

Note: The --open-mode=append ensures the output of each run is appended to the file specified by #SBATCH --output= to preserve the previous run’s output in the same file.

Disclaimer: This is untested on Storrs HPC; however, it should work as long as everything else is working correctly.

My job is not running. It says “JobHeldUser” and its state is “SE.” Why is this happening? And how do I get my job running again?

 Click here to expand...

Short answer: Your job is being “held.” To release the job and re-submit it to the job queue you can use the scontrol release {JOBID} command.

Long Answer: Your job failed. We have a separate FAQ on figuring out why a job failed here, but here we will focus on why your job is being held. When jobs fail, they used to be automatically re-queued. This was a problem for a number of users because re-running the job would overwrite their previous data. In January 2024, we re-configured SLURM to prevent this problem. Now, when jobs fail, they are not immediately re-queued. Instead, the jobs will be “held” from the queue until the submitting user “releases” those jobs back into the queue. This change prevents jobs from requeueing automatically and allows users to make a conscious choice to re-queue their jobs. You can re-queue jobs using the below commands:

  1. To release a single job

    scontrol release {JOBID}
  2. To release multiple jobs

    scontrol release {JOBID_1},{JOBID_2},{JOBID_3}
  3. To release all jobs with a given job name

    scontrol release jobname={JOBNAME}
  4. To release all of your held jobs

    squeue --me | grep ' SE ' | awk '{print $1}' | xargs -n1 scontrol release

If you release your jobs into the queue and they keep ending up back in the “held” state, that is an indication that there may be something failing within your submission script in which case you should cancel your jobs and start troubleshooting. Please note that jobs which are left in the queue with the “SE” state will be cancelled after seven days.

Please feel free to contact us at hpc@uconn.edu with any questions or concerns.

Some of my files were deleted. Is it possible to recover them? If so, how?

 Click here to expand...

Short answer: Possibly, it depends on where the files were stored and how long ago they were deleted. Send us an email at hpc@uconn.edu explaining the situation so we can discuss options.

Long Answer: Files stored in the /shared/ and /home/ directories are backed up daily at 5:00 am and the data is stored for 30 days. If the accidentally deleted files were stored within either of those directories, and you contact us in under 30 days, then there’s a chance we will be able to help you restore those files. Please contact us at hpc@uconn.edu so we can discuss a way forward.

Unfortunately, if the files that were deleted had been stored on /scratch/, we will not be able to restore them. The /scratch/ directory is not backed up ever. If you’re reading this after accidentally deleting files from scratch, we understand that this news can come as a huge blow. We wish that we could back up /scratch/, but we only have enough room to back /shared/ and /home/. We cannot help you fix the current situation, but there are steps we can take to prevent this from happening again. You can follow these links for more info. on HPC data storage, backing up files, or using Globus.

Why am I suddenly getting a “Permission denied” error when I try to ssh into the HPC?

 Click here to expand...

Short answer: One of the login nodes is most likely not working properly. Try to ssh into any of the three login nodes directly.

Long Answer: When you ssh into hpc2.storrs.hpc.uconn.edu, you are directed to one of our three login nodes (login4, login5, or login6). Occasionally, one of these three login nodes will become faulty. If you ssh into the cluster and your account is directed to the faulty node, then you may be given a “Permission denied” error message. If you experience this problem, we recommend you try to ssh directly into one of the login nodes. Here are the three commands one can use to login into our three login nodes. Please replace netID with your own netID.

login4

ssh -Y netID@login4.storrs.hpc.uconn.edu

login5

ssh -Y netID@login5.storrs.hpc.uconn.edu

login6

ssh -Y netID@login6.storrs.hpc.uconn.edu

If one of these allows you to log in but another gives you a permission denied error, then we can be sure that there is something wrong with one of the login nodes. If you have the time, we’ll ask that you please send us a screenshot of the login node which is giving you a problem so that we can tend to any problems on that faulty node. This will help us ensure that this problem is fixed as soon as possible.

If you receive this “Permission denied” error when ssh-ing directly into all three login nodes, then the problem may be with your netID. You may have to reset your netID password which can be done at this link.

Why do I get a GLIBC Error when trying to load certain modules?

 Click here to expand...

Short answer: Some modules can only be run on certain architectures. Try loading that module on a node with the Epyc or Skylake architectures.

Long Answer: These GLIBC errors often happen because the software you’re trying to load expects to find a newer GLIBC library than what’s available in your current node’s Red Hat Enterprise Linux (RHEL) version. Older architectures sometimes have older RHEL versions and therefore older GLIBC libraries. Newer architectures typically have newer RHEL versions and GLIBC libraries.

Switching to a newer or different architecture may resolve this GLIBC error. See the following guide for instructions on how to target specific architectures.

How do I fix an error which says, “Can't open display, Failed initializing GUI, exiting?”

 Click here to expand...

Short answer: This is an X-forwarding error. The most common fix is to enable X-forwarding when you ssh into the HPC using the -X or -Y flag. But there are other problems which can cause this too.

ssh -Y netID@hpc2.storrs.hpc.uconn.edu

Long Answer: X-forwarding allows programs being run on the HPC to be opened up in a GUI on our local machines. It is convenient and nice to work with but it can take a bit of effort to set up if you are working on a Mac or Windows device. Linux users have it easy because all they normally have to do is use the -X or -Y flag we referred to above. For users, of Windows or Mac devices, we have written up a more in-depth guide which you can find here.

Modules that I installed used to work properly on the original HPC but they are not loading properly on HPC 2.0. How do I resolve this problem?

 Click here to expand...

There are many reasons this might be occurring, but a common problem with user-installed programs is that the module names and versions have changed slightly between the old HPC and HPC 2.0. It may be that the dependencies your program used to rely on are no longer available on HPC 2.0. For instance, the GCC compilers have been updated and some of the old ones are no longer available. In this case, the ideal situation would be to install your program again using the newer compilers---this is often a good idea anyway because newer compilers sometimes increase the performance and reduce the chance of bugs.

If there are extenuating circumstances that prevent you from using a program with new compilers or are experiencing other module-related problems, we invite you to submit a request for assistance by emailing hpc@uconn.edu. Then we can discuss options of how to set up a module that meets your needs.

How do I fix an error which says one module conflicts with other modules(s)?

 Click here to expand...

If the 'module load' command returns the following errors:

$ module load <Module1>
<Module1>(4):ERROR:150: Module '<Module1>' conflicts with the currently loaded module(s) '<Module2>'
<Module1>(4):ERROR:102: Tcl command execution failed: conflict <Module_Group>

This means that the module you want to load conflicts with the currently loaded module, <Module2>. To fix it, please unload <Module2> and then load <Module1> again:

$ module unload <Module2>
$ module load <Module1>

Or

$ module switch <Module2> <Module1>

Or, if neither of these works, you can purge all the modules with

$ module purge

and start fresh.

How do I fix an error which says the module I’m trying to load depends on other modules?

 Click here to expand...

If the 'module load' command returns the following errors:

$ module load <Module1>
<Module1>(9):ERROR:151: Module '<Module1>' depends on one of the module(s) '<Module2>'
<Module1>(9):ERROR:102: Tcl command execution failed: prereq <Modle2>

This means that the module you want to load depends on the module <Module2>. To fix it, please load <Module2> prior to <Module1>:

$ module load <Module2> <Module1>

You may encounter the above errors many times. Please load/unload the requested/conflicted modules and try again.

Why do I get syntax errors when I try to run scripts via sbatch?

 Click here to expand...

If the script could not run via sbatch. The errors usually look like this:

sh: -c: line 0: unexpected EOF while looking for matching `"'
sh: -c: line 1: syntax error: unexpected end of file

It is usually due to the wrong file format. Your file is still in the Windows format but not the Linux format.

$ file comsol.sh # with wrong format
comsol.sh: Little-endian UTF-16 Unicode English text, with CRLF line terminators
$ iconv -f utf-16 -t ascii comsol.sh -o comsol.sh # Convert to ASCII first.
$ dos2unix comsol.sh # change CRLF line terminators to Unix format
$ file comsol.sh
comsol.sh: Bourne-Again shell script text executable

Guidelines for Citing the HPC

How can I acknowledge the Storrs HPC in our publications?

 Click here to expand...

If you would like to acknowledge/reference the Storrs HPC cluster in your publications, you can acknowledge Storrs HPC to something along the lines of the following:

“The computational work performed on this project was done with help from the Storrs High Performance Computing cluster. We would like to thank the UConn Storrs HPC and HPC team for providing the resources and support that contributed to these results.”

 

  • No labels