Steps to Successfully Troubleshoot on the HPC

You’ve encountered an error and are not sure what the next steps are? You are in the right place!

Step 1:

Gather information.

  • Did the job/command fail? Use the command “shist” to double check. Note the “Elapsed”, “NodeList” and “ExitCode” and “End” columns. Each of these give you information on why the job could’ve failed.

  • What are the outputs (if there are any)? Search for lines that indicate an error has occurred in the output files and log files.

    • Many programs will link to documentation or guides along with the logged error.

    • Check the log file from your program as well as the slurm output log file. Is it a program error or a cluster error?

  • scontrol show job <job_number> will give a lot of information about your job including the path to your log file (StdErr or StdOut)

    [username@login4 ~]$ scontrol show job 1234567 JobId=1234567 JobName=example_job_name UserId=username(user_id) GroupId=group_name(group_id) MCS_label=N/A Priority=priority_value Nice=0 Account=account_name QOS=qos_name JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=19:18:20 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2024-06-10T14:12:14 EligibleTime=2024-06-10T14:12:14 AccrueTime=2024-06-10T14:12:14 StartTime=2024-06-10T17:10:35 EndTime=2024-06-11T17:10:35 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-06-10T17:10:35 Scheduler=Main Partition=partition_name AllocNode:Sid=login_node:session_id ReqNodeList=(null) ExcNodeList=(null) NodeList=node_name BatchHost=node_name NumNodes=1 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,mem=64G,node=1,billing=32,gres/gpu=8 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0 Features=feature_name DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/path/to/command.sh WorkDir=/path/to/workdir StdErr=/path/to/job_1234567.log StdIn=/dev/null StdOut=/path/to/job_1234567.log Power= CpusPerTres=gpu:1 TresPerNode=gres:gpu:8 MailUser=email@example.com MailType=END

Step 2:

Perform a simple google search of the error.

Diagnose whether the error is HPC related, or program related. Most common problems can be found in the first few search results of google, so make sure to spend some time going through those forums!

Step 2.5:

Using google and the Knowledge Base.

Just like the article you’re currently reading, there is a plethora of information on the KB that can assist you. The search function on the KB might not be perfect but searching on google and adding “UConn HPC Knowledge Base” at the end of your query can help narrow down the search immensely.

Step 3:

Seek help locally.

Ask your PI, peers and others that you have access to working on similar programs/projects. More often than not, you aren’t the first person to encounter the error. Troubleshooting along with your lab mates is one of the best ways to learn together!

Step 4:

Submit a ticket.

With all the information you’ve gathered send an email to hpc@uconn.edu to start a ticket.

At minimum, the email should include:

  • JobID

  • Detailed steps or screenshots along with associated files used to recreate the error.

  • Actual output files/log files that indicate the error.

  • Any screenshots of other information that might be important

  • Your netID

Example:

 

We appreciate you taking the time to read this and following the steps in your future projects!

Happy troubleshooting!

~ Storrs HPC