...
Expand | ||||
---|---|---|---|---|
There are many reasons a job may fail. A good first step is to use the Here’s an example of the output for a job that failed immediately with an
The The next clue to investigate is the
Once you see the job has failed multiple times on the same node, then you can feel confident that a faulty node is likely the cause. Please submit a help request to Storrs HPC including a screenshot from the It is recommended to check the exit codes listed in the output above. The are listed here for reference: 0 → success non-zero → failure Exit code 1 indicates a general failure Exit code 2 indicates incorrect use of shell builtins Exit codes 3-124 indicate some error in job (check software exit codes) Exit code 125 indicates out of memory Exit code 126 indicates command cannot execute Exit code 127 indicates command not found Exit code 128 indicates invalid argument to exit Exit codes 129-192 indicate jobs terminated by Linux signals For these, subtract 128 from the number and match to signal code Enter kill -l to list signal codes Enter man signal for more information When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:). |
My jobs are failing due to insufficient memory. Or with an “out of memory” or “OOM” error. Why is this happening? And how do I fix this?
...