Glossary

Throughout our documentation wiki you will find many terms related to the use of High-Performance Computing (HPC). This glossary of terms will help you to better understand many of the concepts which make HPC unique from other technologies or scientific instrumentation.

General Definitions

HPC Terminology

job – One or more commands (i.e., steps) contained in a script that has been dispatched to a compute node to be run unattended (i.e., in the background).

slurm – A job-scheduling application many HPC facilities use (including Storrs HPC) to manage the use of the cluster. Slurm has 3 major functions:

  • allocating users exclusive or non-exclusive access to compute nodes so they can perform work for some duration of time,

  • providing a framework for starting, running, and monitoring jobs, including parallel jobs that require multiple allocated nodes, and

  • determining the order in which pending jobs get allocated resources based on resource availability, wait time, and the user’s priority access.

Message Passing Interface (MPI) – A means of passing information between multiple compute nodes that are running a parallel program using the memory on each node.

module – Essentially, a synonym for software. On a laptop, we click thumbnails to open programs. On the HPC, we load modules and then we can use that program. However, some programs are incompatible with one another. The module system used on most clusters (including ours) allows many different, and sometimes incompatible, software applications to be available for users to search through and load as needed.

Graphical User Interface (GUI) - (pronounced like “gooey”) A visual screen that allows users to interact with a computer program by clicking buttons/pictures, typing information, and using a mousepad. For example, TikTok has a GUI that allows users to touch a phone’s screen to post videos, search, etc.

Command Line Interface (CLI) - (pronounced using its initials) A text-based way of interacting with a computer program by typing into a command prompt, thereby telling the program to do something.

Hardware

cluster – A collection of technology components – including servers, networks, and storage – deployed together to form a platform for scientific computation. Also referred to as a High-Performance Compute cluster, or a supercomputer.

processor – A server in the HPC environment contains two physical processors. Each processor is sometimes referred to as a socket, chip, or Central Processing Unit (CPU). Each of the two processors contains many individual cores. The two processors are connected to each other by a high-speed bus, as well as to other system components (memory, data storage, networks) that are local to that server.

core – A processor contains multiple cores, each of which can be used to execute instructions from a computational job.

compute node - A compute node has two physical processors, each of which contains a finite number of cores. These cores are the unit of measurement used by the cluster's job scheduler for allocating resources. A node also contains a finite amount of random access memory (RAM), high-speed flash storage, and network interconnects.

login node - A login node validates each user’s credentials (i.e., their username & password) and is the node you land after using SSH to access the cluster. The Storrs HPC has 3 login nodes.

GPU – A graphical processing unit (GPU) is a type of processor that is specially designed to handle computational tasks that are common in graphically intensive applications, such as highly parallelized matrix and vector operations. Also referred to as: graphics card, or video card.

partition – refers to a group of compute nodes that have similar hardware (e.g., GPUs), types of usage (e.g., long jobs vs. short high-throughput jobs), and/or level of priority required to access them

Infiniband – A high-speed, low-latency network that connects all compute nodes to each other and to data storage. This network is sometimes referred to as a fabric. It enables independent compute nodes to communicate with each other much faster than a traditional network, enabling computational jobs that span multiple servers to operate more efficiently, often through a technology known as Message Passing Interface (MPI).

memory card – a component within your computer that allows for short-term storage of the programs and data that your computer is actively working on; also referred to as random access memory (RAM).

storage drive – a component of your computer that allows you to store data that you’re not currently using so you can re-access it in the future. Storage drives (e.g., hard drives, HDD; or solid-state drives, SDD) are also where the operating system and other applications are stored for long-term use.

 

Software & Installation

(coming soon!)

 

Basics of the Condo Model

The Storrs High-Performance Computing (HPC) cluster is delivered under a business model known as the "condo model". The document below serves as a glossary of terms related to this business model so that researchers have a clear understanding of its implications.

condo model - A financial model for managing HPC resources where faculty researchers fund the capital equipment costs of individual compute nodes, while the university funds the operating costs of running these nodes for five years.

priority compute node - The smallest unit of resources which can be purchased by a researcher is an individual server, known as a compute node, as described above in the General Definitions. The cost of a compute node includes associated capital equipment costs associated with deploying that node, including external network and fabric interconnects, power delivery, software licenses, etc.

operating costs - The university centrally operates all compute nodes through University Information Technology Services (UITS) so that researchers can focus on science, and not systems' management. The costs to operate the cluster are split between operating costs and capital costs. Operating costs refer to all non-equipment costs, such as staff salaries and contracts for equipment maintenance.

capital costs – The actual physical equipment that the university purchases to operate HPC, including servers, switches, storage, etc. Most capital equipment carries is deployed for a period of five years.

priority users - Faculty who purchase compute nodes receive access to equivalent resources at a higher priority than other researchers. The faculty can designate others to receive access at the same priority level, such as their graduate students, postdoctoral researchers, etc. With priority access, computational jobs are moved to the front of the queuing system and are guaranteed to begin execution within twelve hours. A priority user can utilize their resources indefinitely. Access to resources is managed through the cluster's job scheduler (i.e., SLURM). Users do not receive direct access to compute nodes or privileged access ("root").

open access users - Any user of HPC resources who has not contributed to the purchase of those resources. A computational job submitted by a regular user is placed into a queue until sufficient resources are available. This queue is prioritized by multiple factors, including a "fair-share" score. A user who has priority access to a subset of resources is considered a regular user on all other resources to which they did not contribute.

job prioritization - There are multiple factors that determine the assigned priority for any given computational job. The factor with the most weight is the partition to which the job is submitted. Partitions used by priority users receive the highest priority. The second most important factor for job scheduling is a user’s fair-share score. A fair-share score is assigned to each user by the job scheduler. This score helps to ensure that HPC resources are used equitably by all non-priority users. If a user has executed a disproportionally large number of jobs recently, they may temporarily receive a lower priority than a user who has not executed jobs recently. The third factor that is used in calculating job priority is the age of your job. The longer a job sits in the queue, the higher its priority grows. If every job has the same priority, then the scheduling method becomes "first in, first out".

job preemption - There is one partition, named general_requeue, in which jobs may be canceled at any time. If a job is canceled in this partition, the job scheduler will automatically resubmit that job. Jobs using this partition should be short running, or take steps to frequently save their state so that they can be resumed when they are requeued.