...
Expand | ||||
---|---|---|---|---|
Short answer: Once the job is cancelled by SLURM due to timeout, it cannot be resumed from that point because SLURM sets the exit code to “0” which denotes job completion. As far as SLURM is concerned, the job is now complete, with no state to resume from. Long answer: One thing you can try is to use the timeout command to stop your program just before SLURM does. You can tell from the return code if the timeout was reached or not. It should set exit code “124”. If so, you can then requeue it with scontrol. Try the following: In your submission script, add the following:
Then, use the timeout command to call your program:
Note: The Disclaimer: This is untested on Storrs HPC; however, it should work as long as everything else is working correctly. |
My job is not running. It says “JobHeldUser” and its state is “SE.” Why is this happening? And how do I get my job running again?
Expand | ||||||
---|---|---|---|---|---|---|
Short answer: Your job is being “held.” To release the job and re-submit it to the job queue you can use the Long Answer: Your job failed. We have a separate FAQ on figuring out why a job failed here, but here we will focus on why your job is being held. When jobs fail, they used to be automatically re-queued. This was a problem for a number of users because re-running the job would overwrite their previous data. In January 2024, we re-configured SLURM to prevent this problem. Now, when jobs fail, they are not immediately re-queued. Instead, the jobs will be “held” from the queue until the submitting user “releases” those jobs back into the queue. This change prevents jobs from requeueing automatically and allows users to make a conscious choice to re-queue their jobs. You can re-queue jobs using the below commands:
Please feel free to contact us at hpc@uconn.edu with any questions or concerns. |
Some of my files were deleted. Is it possible to recover them? If so, how?
...