Monitoring SLURM job resources

Monitoring SLURM job resources

CCS collects performance metrics from compute nodes and aggregates them into a Grafana dashboard.


These dashboards allow you to monitor CPU, GPU, and memory usage for your SLURM jobs to help diagnose performance issues and improve efficiency.


Step 1 — Identify Your Job ID

To list your currently running jobs on LCC or MCC, run:

squeue -u $USER -t R -o "%.18A %.8j %.8u %.2t %.20P"

The first column shows the SLURM Job ID.

Important:
Using the -o flag ensures array jobs display the correct Job ID instead of the default array format.

Make a note of the Job ID.


Step 2 — Open the Grafana Monitoring Dashboard

Grafana login page showing username and password fields and a highlighted 'Sign in with CILogon' button.
Grafana login page showing username and password fields and a highlighted ‘Sign in with CILogon’ button.

Open a web browser and navigate to:

CCS Grafana monitoring portal


Step 3 — Sign in Using CILogon

  1. Click Sign in with CILogon.

  2. Select your identity provider
    (most users choose University of Kentucky).

  3. Click Log On.

  4. Enter your Link Blue credentials
    (do not include @uky.edu).


Step 4 — Select the SLURM Job Statistics Dashboard

Grafana Resource Monitoring page displaying dashboard options for MCC and LCC clusters including SLURM Job Stats dashboards.
Grafana Resource Monitoring page displaying dashboard options for MCC and LCC clusters including SLURM Job Stats dashboards.

After logging in:

  1. Choose the appropriate dashboard:

    • Compute Jobs (CPU jobs)

    • GPU Jobs (GPU jobs)


Step 5 — Enter Your SLURM Job ID

Grafana SLURM Job Stats dashboard showing the Slurm_Job_ID input field and job efficiency gauges for CPU and memory usage.
Grafana SLURM Job Stats dashboard showing the Slurm_Job_ID input field and job efficiency gauges for CPU and memory usage.
  1. Locate the field labeled Slurm_Job_ID

  2. Enter your Job ID

  3. Press Enter

The dashboard will update to display job statistics.


Step 6 — Interpret Job Efficiency Metrics

The Job Information panel displays:

  • Number of nodes used

  • CPU efficiency

  • Memory efficiency

Ideal jobs typically approach:

  • High CPU efficiency

  • Memory usage close to requested allocation

Low efficiency may indicate:

  • over-requested resources

  • I/O bottlenecks

  • idle compute time


Step 7 — Review Node-Level Metrics

For multi-node jobs, separate metric panels appear for each node.

These panels show:

  • CPU utilization per core

  • process-level CPU usage

  • memory usage over time


Step 8 — Adjust the Time Range

Grafana dashboard interface with the time range selector highlighted in the upper-right corner.
Grafana dashboard interface with the time range selector highlighted in the upper-right corner.

By default, Grafana displays recent data.

To view a longer time range:

  1. Click the time range selector in the upper-right corner

  2. Choose a new range (for example, Last 24 hours)

Center for Computational Sciences