Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand

Short answer: Your job is being “held.” To release the job and re-submit it to the job queue you can use the scontrol release {JOBID} command.

Long Answer: Your job failed. We have a separate FAQ on figuring out why a job failed here, but here we will focus on why your job is being held. When jobs fail, they used to be automatically re-queued. This was a problem for a number of users because re-running the job would overwrite their previous data. In January 2024, we re-configured SLURM to prevent this problem. Now, when jobs fail, they are not immediately re-queued. Instead, the jobs will be “held” from the queue until the submitting user “releases” those jobs back into the queue. This change prevents jobs from requeueing automatically and allows users to make a conscious choice to re-queue their jobs. You can re-queue jobs using the below commands:

  1. To release a single job

    Code Block
    scontrol release {JOBID}
  2. To release multiple jobs

    Code Block
    scontrol release {JOBID_1},{JOBID_2},{JOBID_3}
  3. To release all jobs with a given job name

    Code Block
    scontrol release jobname={JOBNAME}
  4. To release all of your held jobs

    Code Block
    squeue --me | grep ' SE ' | awk '{print $1}' | xargs -n1 scontrol release

If you release your jobs into the queue and they keep ending up back in the “held” state, that is an indication that there may be something failing within your submission script in which case you should cancel your jobs and start troubleshooting. Please note that jobs which are left in the queue with the “SE” state will be cancelled after seven days.

Please feel free to contact us at hpc@uconn.edu with any questions or concerns.

...