There are many reasons a job may fail. A good first step is to use the shist command to check the ExitCode SLURM gave it. The command follows this format: shist --starttime YYYY-MM-DD . Here’s an example of the output for a job that failed immediately with an ExitCode of 1. Code Block |
---|
JobID Partition QOS JobName User State Elapsed NNodes NCPUS NodeList ExitCode End
------------ ---------- ---------- ---------- --------- ---------- ---------- -------- ---------- --------------- -------- -------------------
73088 priority-+ erm12009g+ submit_rx jdt10005 FAILED 00:00:00 1 32 gtx21 1:0 2022-11-17T10:05:34 |
The ExitCode of a job will be a number between 0 and 255. An ExitCode of 0 means that—as far as SLURM is concerned—the job ran and was completed properly. Any non-zero ExitCode will indicate that the job failed. One could then search the ExitCode on Google to investigate what SLURM thinks caused the job to fail. Sometimes this is helpful but not always. Either way, take note of what you find for future reference. The next clue to investigate is the NodeList column. Sometimes a job fails because there is something wrong with the compute node our job was run on. If the compute node is the problem (and Storrs HPC staff haven’t fixed it already), the job should fail again with the same ExitCode . We can submit our job specifically to that same node to see if the job fails again. Try adding this to the #SBATCH header of your script to target a specific node. Here, we target gtx21 because that was the node listed in the NodeList column above. Code Block |
---|
#SBATCH --nodelist=gtx21 |
Once you see the job has failed multiple times on the same node, then you can feel confident that a faulty node is likely the cause. Please submit a help request to Storrs HPC including a screenshot from the shist output. It is recommended to check the exit codes listed in the output above. The are listed here for reference: 0 → success non-zero → failure Exit code 1 indicates a general failure Exit code 2 indicates incorrect use of shell builtins Exit codes 3-124 indicate some error in job (check software exit codes) Exit code 125 indicates out of memory Exit code 126 indicates command cannot execute Exit code 127 indicates command not found Exit code 128 indicates invalid argument to exit Exit codes 129-192 indicate jobs terminated by Linux signals For these, subtract 128 from the number and match to signal code Enter kill -l to list signal codes Enter man signal for more information When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:)Common ExitCodes are listed below for reference: 0 → success non-zero → failure Exit code 1 indicates a general failure Exit code 2 indicates incorrect use of shell builtins Exit codes 3-124 indicate some error in job (check software exit codes) Exit code 125 indicates out of memory Exit code 126 indicates command cannot execute Exit code 127 indicates command not found Exit code 128 indicates invalid argument to exit Exit codes 129-192 indicate jobs terminated by Linux signals For these, subtract 128 from the number and match to signal code Enter kill -l to list signal codes Enter man signal for more information Please note: when a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
The next clue to investigate is the NodeList column. Sometimes a job fails because there is something wrong with the compute node our job was run on. If the compute node is the problem (and Storrs HPC staff haven’t fixed it already), the job should fail again with the same ExitCode . We can submit our job specifically to that same node to see if the job fails again. Try adding this to the #SBATCH header of your script to target a specific node. Here, we target gtx21 because that was the node listed in the NodeList column above. Code Block |
---|
#SBATCH --nodelist=gtx21 |
Once you see the job has failed multiple times on the same node, then you can feel confident that a faulty node is likely the cause. Please submit a help request to Storrs HPC including a screenshot from the shist output. |