Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand

There are many reasons a job may fail. A good first step is to use the shist command to check the ExitCode SLURM gave it. The command follows this format: shist --starttime YYYY-MM-DD.

Here’s an example of the output for a job that failed immediately with an ExitCode of 1.

Code Block
JobID         Partition        QOS    JobName      User      State    Elapsed   NNodes      NCPUS        NodeList ExitCode                 End 
------------ ---------- ---------- ---------- --------- ---------- ---------- -------- ---------- --------------- -------- -------------------
73088        priority-+ erm12009g+  submit_rx  jdt10005     FAILED   00:00:00        1         32           gtx21      1:0 2022-11-17T10:05:34 

The ExitCode of a job will be a number between 0 and 255. An ExitCode of 0 means that—as far as SLURM is concerned—the job ran and was completed properly. Any non-zero ExitCode will indicate that the job failed. One could then search the ExitCode on Google to investigate what SLURM thinks caused the job to fail. Sometimes this is helpful but not always. Either way, take note of what you find for future reference.

The next clue to investigate is the NodeList column. Sometimes a job fails because there is something wrong with the compute node our job was run on. If the compute node is the problem (and Storrs HPC staff haven’t fixed it already), the job should fail again with the same ExitCode. We can submit our job specifically to that same node to see if the job fails again. Try adding this to the #SBATCH header of your script to target a specific node. Here, we target gtx21 because that was the node listed in the NodeList column above.

Code Block
#SBATCH --nodelist=gtx21

Once you see the job has failed multiple times on the same node, then you can feel confident that a faulty node is likely the cause. Please submit a help request to Storrs HPC including a screenshot from the shist output.

My jobs are

...

failing due to insufficient memory. Or with an “out of memory” or “OOM” error. Why is this happening? And how do I fix this?

Expand

Short answer: If you received this error, your job most likely failed because the amount of memory (RAM) it needed was larger than the default. We can request more memory for your job using the --mem or --mem-per-cpu flag.

Long answer: There are several reasons a job may fail from insufficient memory. The most likely reason this problem is suddenly affecting you is that Storrs HPC Admins had to implement a new change to HPC 2.0. We’ll explain why in a moment, but first, let me go over the solution (assuming this is the problem). As of January 2023, the default amount of memory available per CPU is 2 gigabytes. But you can easily override the default using the --The default mem-per-cpu flag in your submission script. The new line will look of 2 GB is okay for many users, but not for all. Users who receive OOM errors need to request more memory when submitting or initiating jobs. Those users can easily override the default using the -mem-per-cpu flag in your submission script. The new line would look something like this:

Code Block
#SBATCH -n 2                            # asks for 2 cores
#SBATCH --mem-per-cpu=3G                # asks for 3 GB or RAM per core, or 6  GB total

Adding this line will tell SLURM to use 3 gigabytes of RAM per CPU you request. That means if we ask for 2 cores (-n 2), then we’ll be allocated 6 gigabytes of RAM total. Please note that the --mem-per-cpu flag must be used with the -n flag specifying the number of cores you want. Alternatively, users can use the --mem flag to specify the total amount of RAM they need regardless of how many CPUs are rquested.

Code Block
#SBATCH --mem=5G                        # asks for 10 gigabytes total

We encourage users to please adjust the --mem-per-cpu or --mem flags in a step-wise fashion. First, we try 3 gigabytes, then 4, then 5, etc. until our jobs start running without failing from memory errors. That strategy helps ensure that every HPC user's jobs get on quickly and run efficiently.

Now, we’ll explain why this problem is suddenly affecting our users. On HPC 1.0, the memory flags didn’t work so there wasn’t a great way to prevent jobs from failing due to insufficient memory. The problem didn’t happen super often though. The memory flags do work on HPC 2.0, and the default settings were a little too strict. SLURM was only letting 1 job run per node because it assumed all jobs needed the entire node’s memory. There were nodes with 128 cores and 500 GB of RAM where only 1 core and 1 gigabyte of RAM were used. Tons of jobs were piling up in the queue and the job wait times were really long.

So, the Storrs HPC Admins reset the default memory available per core to 2 gigabytes of RAM per core. We had to set it this low because some of the node architectures have much less memory than others. Resetting this variable enabled SLURM to allow more than one job to run on a node and reduced the job wait times, provided there is enough memory on that node. This default mem-per-cpu of 2 GB is okay for many users, but not for all. Some of our users run more RAM-intensive programs, meaning the default is not sufficient. So, now it has to be adjusted manually in the #SBATCH header. For more info on For more info on fisbatch, srun, and #SBATCH flags, see this link.

...