Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand

Short answer: If you received this error, your job most likely failed because the amount of memory (RAM) it needed was larger than the default. We can request more memory for your job using the --mem or --mem-per-cpu flag.

Long answer: There are several reasons a job may fail from insufficient memory. As of January 2023, the default amount of memory available per CPU is 2 gigabytes. The default mem-per-cpu of 2 GB is okay for many users, but not for all. Users who receive OOM errors need to request more memory when submitting or initiating jobs. Those users can easily override the default using the -mem-per-cpu flag in your submission script. The new line would look something like this:

Code Block
#SBATCH -n 2                            # asks for 2 cores
#SBATCH --mem-per-cpu=3G                # asks for 3 GB or RAM per core, or 6  GB total

Adding this line will tell SLURM to use 3 gigabytes of RAM per CPU you request. That means if we ask for 2 cores (-n 2), then we’ll be allocated 6 gigabytes of RAM total. Please note that the --mem-per-cpu flag must be used with the -n flag specifying the number of cores you want. Alternatively, users can use the --mem flag to specify the total amount of RAM they need regardless of how many CPUs are rquested.

Code Block
#SBATCH --mem=5G                        # asks for 10 gigabytes total

We encourage users to please adjust the --mem-per-cpu or --mem flags in a step-wise fashion. First, we try 3 gigabytes, then 4, then 5, etc. until our jobs start running without failing from memory errors. That strategy helps ensure that every HPC user's jobs get on quickly and run efficiently. For more info on fisbatch, srun, and #SBATCH flags, see this link.

My jobs are failing due to Timeout. I do not have access to priority; how can I resume a job after it times out?

Expand

Short answer: Once the job is cancelled by SLURM due to timeout, it cannot be resumed from that point because SLURM sets the exit code to “0” which denotes job completion. As far as SLURM is concerned, the job is now complete, with no state to resume from.

Long answer: One thing you can try is to use the timeout command to stop your program just before SLURM does. You can tell from the return code if the timeout was reached or not. It should set exit code “124”. If so, you can then requeue it with scontrol. Try the following:

In your submission script, add the following:

Code Block
#SBATCH --open-mode=append
#SBATCH --time=12:00:00

Then, use the timeout command to call your program:

Code Block
timeout 11h ./your_program
if [[ $? == 124 ]]; then
  scontrol requeue $SLURM_JOB_ID
fi

Note: The --open-mode=append ensures the output of each run is appended to the file specified by #SBATCH --output= to preserve the previous run’s output in the same file.

Disclaimer: This is untested on Storrs HPC; however, it should work as long as everything else is working correctly.

Some of my files were deleted. Is it possible to recover them? If so, how?

...