Monitoring SLURM job resources

To help you get the most out of our compute clusters, we encourage you to keep an eye on the resources your SLURM jobs are using. We track CPU, GPU, and RAM usage for each job, and you can view detailed statistics at https://monitor.ccs.uky.edu. These insights can be really useful for optimizing your workflows and ensuring efficient resource allocation.

You can use the command squeue -u $USER -t R -o "%.18A %.8j %.8u %.2t %.20P" to list your running jobs in a simple table on LCC and MCC. The first column lists the JOBIDs of your currently running jobs. It is essential to use the -o format flag to recover the JOBIDs for array jobs that would otherwise be indicated at XXXXXX_YY, where XXXXXX is the array JOBID, and YY is the TASKID.

Log in with federated identity management CILogon.

Click on Sign in with CILogon button:

After clicking on the "Sign in with CILogon" button, you will be redirected to the CILogon page.
Select an Identity Provider from the list. (Most users will select "University of Kentucky")
Then click on Log-on.
Now, you will be redirected to your identity provider's authentication page, in this example, the University of Kentucky's auth page.
Log in with your Link Blue credentials. Do not add @uky.edu to your username.

Choose the correct SLURM Job Stats dashboard.

The left-hand column displays the dashboards available for MCC, and the right-hand column LCC. There are two SLURM Job Stats dashboards for LCC, either “Compute Jobs” or “GPU Jobs.”
Select the dashboard reflecting the stats you are interested in. In this example, we are interested in Compute Jobs on LCC.

Explore the recorded stats.

First, insert the JOBID in the top-left entry box labeled Slurm_Job_ID and press enter.
You should immediately see Job Information and graphs displayed. In this example, we have a job that is running across multiple nodes, as indicated in the Job Information panel.
Further, you can see that the job uses 99.8% of all CPUs allocated to it (this is across the multiple nodes!) and only 3.67% of the total allocated RAM, as indicated by the Overall CPU Efficiency and Overall RAM Efficiency sliders.
The next sections of the dashboard are node-specific. Due to the multi-node nature of this job, there will be three separate replicas of the stats, one for each node. The first row displays the average CPU usage over the last hour for the CPUs on the specific node.
Next is the Current CPU Utilization for the given node. Note that only cores 35, 39, 43, 46, and 47 are shown because those cores are allocated to the job on this node.
The next three plots present a time series of the processor utilization. On top, there is the Indiv. Process CPU Usage plot. This will display individual traces for each process running within the job on the node. In this particular case, there is a single process – the name is hidden. Note, that a single process can have larger than 100% CPU usage if it is multi-threaded. In the middle is the Indiv. CPU Core Utilization time series. This displays the CPU utilization % per core – the traces are overlapping and appear as a single line. Finally, the bottom-most plot is the Node CPU Efficiency – this is the average CPU utilization for CPUs allocated by the job on the particular node.
Finally, there are two plots representing RAM usage. The top plot is a time series of the RAM usage for the node. The bottom plot is a time series of the RAM usage divided by the total amount of RAM requested for the particular node.
Following the plots, you will see the next panel, which displays the same information as above, but this time for the second node of the job.
By default, only data for the last 6 hours is displayed. If you are interested in longer-term stats, you can change the time at the top-right of the page.

Ideally, a job should approach 100% Efficiency in CPU and RAM usage.