General HPC Guidelines

Welcome to the Storrs HPC community! There are about 2,000 students, faculty, and staff from all backgrounds and disciplines who regularly access and do research on the Storrs HPC.

With so many people sharing the same resources, it is important that we be considerate of other users on the Storrs HPC. For that reason, we’ve laid out some general guidelines that we can all follow to make sure our work progresses smoothly without negatively impacting other HPC users. If you have any questions or comments, please feel free to reach out to us by email at hpc@uconn.edu.


1. Please do not run computationally-intensive programs (i.e., modules), scripts, analyses, or calculations on the login nodes.

  • Why: The login nodes are where we land right when we login to the cluster. Login nodes have limited memory (a.k.a., RAM) that is shared between all users on that node. Running programs on the login nodes is both inefficient and problematic because we’re sharing memory with many other users, and computationally-intensive programs slow down that login node for all HPC users.

  • Exceptions: Very basic programs (e.g., vim) or file management (e.g., rsync in a screen session) commands can be used on the login nodes, provided it is not already running slowly. See full list of allowed programs here.

Unfamiliar terms? If some of these terms—like module, HPC, or SLURM—are new to you, feel free to check out our Glossary page!

2. Always use the SLURM scheduler for any jobs or calculations.

  • Why: The SLURM scheduler (i.e., sbatch, fisbatch) helps us ensure that access to HPC resources are being distributed equitably among our many users.

  • Whenever submitting a script, please use the sbatch command. And whenever initiating an interactive session, please use the fisbatch (or srun) command.

  • Accessing nodes directly (i.e., bypassing the SLURM scheduler) may impact your performance or someone else’s because the SLURM scheduler cannot allocate the resources correctly. To reduce wait times for your jobs, consider contacting the Storrs HPC teams for information about priority access.

 

3. When submitting a help request to the Storrs HPC, please include as much info, including screenshots, about the steps leading up to where problems arose.

  • Why: We know how frustrating it can be when technical problems are slowing research progress. Sharing screenshots of the steps leading up to where problems arise tells us that the problem is reproducible and gives us a place to start troubleshooting together.

  • To submit a help request, send an email explaining the problem you’re facing to hpc@uconn.edu.

    • Examples of helpful things to screenshot include:

      • Error messages in the terminal

      • Paths to relevant files

      • Scripts that are failing

      • Modules that are being loaded

      • Partition (e.g., general-gpu) and node (e.g., cn123) where you’re working

 

4. Only use what you need.

  • The SLURM scheduler ensures resources are distributed fairly but using more resources than needed does take away from what’s available to all HPC users. We recommend that all users benchmark different configurations to see where they get optimal performance. If increasing the amount of resources does not improve performance, please do not request more cores.

  • In most cases, you should only use one core if your analyses are not parallelizable; using more would be a waste because the other cores would not be used. An example of how to request one core in an interactive session is below.

    • fisbatch -n 1 -p general --constraint="cpuonly"
    • Exception: Some analyses are not parallelizable but require more memory than is available to one core (default is 2 GB per core). If your analyses fail due to having insufficient memory, then you’ll need to request more memory using the --mem flag or by using more than one core—that’s even accepting that your analysis cannot be parallelized.

5. Please do not use scratch for long-term storage. Instead, backup files to shared or ARCHIVE.

  • It is every researcher’s worst nightmare to have data deleted and then have to restart a research project. That’s why we should not use scratch for long-term storage. Scratch is NOT backed up. Moreover, if scratch storage becomes limited, any files older than 60 days old can be deleted.

  • Our best defense from data loss is to backup files regularly to either our shared or archive directories (which are backed up once a week). You can make a folder for yourself in your lab’s shared or archive directories, which can generally be found at the paths below:

    • shared: /shared/<group_name>/your_name_here

    • archive: /shared/<group_name>/ARCHIVE/your_name_here

  • Scratch should only be used to store data you’re actively working on. Scratch is best for that because it has the fastest performance.

  • Follow these links for more info. on HPC data storage, backing up files, or using Globus.

6. Please install niche programs in your home or shared directories.

  • Storrs HPC staff members gladly install and maintain the most commonly used libraries (e.g., gcc), programming languages (e.g., R, Python, MATLAB), and programs (e.g., Gaussian).

    • Storrs HPC does not install niche programs or packages. Instead we facilitate installation by providing the base dependencies for most programs, like GCC compilers or base versions of python. If there is a specific program or package you or your lab will be using, we recommend that users install those program/packages in their shared or home directory.

    • Follow this link for more info on software installation on the cluster. If you have questions about how to build or compile, please discuss it with your PI, and then if you have difficulty, feel free to reach out to the Storrs HPC staff.

7. Storrs HPC staff are not experts in most of the software on the HPC.

  • The staff at Storrs HPC are glad to be your first stop for questions about partitions, compiling software, or submitting jobs efficiently. You can contact Storrs HPC staff by sending an email to hpc@uconn.edu.

  • But if you have specific software or coding questions, we recommend talking with your advisor or looking for forums related to that software where you may have a better chance at finding someone who has experience troubleshooting with that software.

  • If you have significant programming experience, copying and pasting the error message you’re receiving into a search engine (e.g., Google) may help you find forums where people have faced and fixed the same errors you’re experiencing. Even if no one has found a solution, reading forums or posts on StackOverflow citing the same error may at least feel validating. Sometimes, it’s nice to know that the coding problem you’re facing seems to be confusing others as well.

8. Please exercise caution when writing scripts that automatically resubmit themselves.

  • Problem: Scripts that automatically resubmit can be useful for jobs that need to be run repeatedly or to extend certain jobs from previous checkpoints. But when there’s an error in the script, those scripts can get stuck in a loop where they resubmit themselves tens of thousands of times which slows the HPC down for all users.

  • Solution: Include language in your script that limits the number of times a given job can resubmit. Here is an example of submission script written in bash.

    • #SBATCH -o job_%j.out # creates an out file with the prefix "job_" # changes to the directory where your files are dir=/scratch/advisor/student/path/to/directory cd $dir # sets the name of your script script=$dir/name_of_script.sh # checks the number of out files with the "job_" prefix # the number out files with the "job_" prefix = number of times job ran num_out=$(ls ./job_*.out | wc -l) # This checks if the number of times it has run has exceeded 5 times. # If it has run more than 5 times, the job will not be resubmitted. if [ $num_out -gt 5 ]; then echo "script reached maximum number of submissions."" # If it's less than 5, then the script will be resubmitted. elif [ $num_out -lt 5 ] ; then sbatch $script else echo "Something may have gone wrong. Check the job_*.out files for errors." fi

9. Please debug and test your code.

  • You can submit your test job to the debug partition. This allows a maximum run time of 30 minutes. Use this to extrapolate your job’s resource requirements. You can use this information to tailor your job submission so it runs more efficiently.

10. Please refrain from using sudo. It will not work.