Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand

Short answer: Once the job is cancelled by SLURM due to timeout, it cannot be resumed from that point because SLURM sets the exit code to “0” which denotes job completion. As far as SLURM is concerned, the job is now complete, with no state to resume from.

Long answer: One thing you can try is to use the timeout command to stop your program just before SLURM does. You can tell from the return code if the timeout was reached or not. It should set exit code “124”. If so, you can then requeue it with scontrol. Try the following:

In your submission script, add the following:

Code Block
#SBATCH --open-mode=append
#SBATCH --time=12:00:00

Then, use the timeout command to call your program:

Code Block
timeout 11h ./your_program
if [[ $? == 124 ]]; then
  scontrol requeue $SLURM_JOB_ID
fi

Note: The --open-mode=append ensures the output of each run is appended to the file specified by #SBATCH --output= to preserve the previous run’s output in the same file.

Disclaimer: This is untested on Storrs HPC; however, it should work as long as everything else is working correctly.

My job is not running. It says “JobHeldUser” and its state is “SE.” Why is this happening? And how do I get my job running again?

Expand

Short answer: Your job is being “held.” To release the job and re-submit it to the job queue you can use the scontrol release {JOBID} command.

Long Answer: Your job failed. We have a separate FAQ on figuring out why a job failed here, but here we will focus on why your job is being held. When jobs fail, they used to be automatically re-queued. This was a problem for a number of users because re-running the job would overwrite their previous data. In January 2024, we re-configured SLURM to prevent this problem. Now, when jobs fail, they are not immediately re-queued. Instead, the jobs will be “held” from the queue until the submitting user “releases” those jobs back into the queue. This change prevents jobs from requeueing automatically and allows users to make a conscious choice to re-queue their jobs. You can re-queue jobs using the below commands:

  1. To release a single job

    Code Block
    scontrol release {JOBID}
  2. To release multiple jobs

    Code Block
    scontrol release {JOBID_1},{JOBID_2},{JOBID_3}
  3. To release all jobs with a given job name

    Code Block
    scontrol release jobname={JOBNAME}

Please feel free to contact us at hpc@uconn.edu with any questions or concerns.

Some of my files were deleted. Is it possible to recover them? If so, how?

...