Monitoring a node's resources

We periodically collect various metrics from our compute nodes and aggregate them into a Grafana dashboard to aid in data visualization. If one of your jobs is experiencing problems and you'd like to monitor
its resource usage (CPU, RAM, etc.), then this is the tool for you.

On LCC and MCC, you can use the command squeue -u $USER -t R to list all of your running jobs in a simple table. The last column in the printed table is the node (or list of nodes) that each job is running on.
Once you've identified one or more nodes you'd like to monitor, open a Web browser and visit https://monitor.ccs.uky.edu to view our Gragana dashboards.

Log in with federated identity management CILogon.

Click on Sign in with CILogon button:

After clicking on that "Sign in with CILogon" button, you will be redirected to CILogon page.
Select an Identity Provider from the list. (Most users will select "University of Kentucky")
Then click on Logon

Now, you will be redirected to your identity provider's authentication page. In this example. University of Kentucky's auth page.

Log in with your Link Blue credentials. Do not add @uky.edu to your username.

Upon logging in successfully, you will be presented with our Grafana dashboard.
Select the Cluster (MCC or LCC) you'd like to view metrics for.

Upon clicking on a particular cluster of your choice, you will see the default Node Statistics dashboard with a
generic compute node selected.

Select the node you'd like to view metrics for. Assume for this example that the job experiencing problems is running on the node rome001.
Just click on the box labeled "Host" and search for or scroll down to find the node rome001.

Now, the metrics for only rome001 will display.

Select the time range you're interested in. By default, the metrics displayed will be for the last 24 hours. In the top-right corner of your screen,
you can click on the "Last 24 hours" box to select a time range you're interested in (for example, the last 7 days).

View the metrics you're interested in. We provide many metrics available in expandable/collapsible panels on the page. Scroll down and look through
the panels to find the metrics you're interested in.

Now, suppose you're viewing a graph over a large time range and you notice an anomaly. You think you've found your problem, but you want to drill in to investigate further.
With your mouse, simply click the graph and drag over the region you're interested in to highlight it. Upon releasing your mouse, the dashboard will refresh to
"zoom into" the region you selected and display it with more granularity.

RCDDocs

Monitoring a node's resources

Analytics