How to profile a CUDA job with NVIDIA Nsight on LCC

How to profile a CUDA job with NVIDIA Nsight on LCC

Background

NVIDIA Nsight is a suite of performance analysis and debugging tools designed to help developers understand how their applications utilize GPUs. While CUDA itself provides the programming framework for NVIDIA GPUs, Nsight tools give insight into where applications spend time, how GPU resources are used, and where performance bottlenecks occur.

The Nsight family includes:

  • Nsight Systems (nsys)
    A system-wide profiler that captures CPU, GPU, memory, and interconnect activity across the application’s lifetime. It helps answer questions like:
    “Which kernels take the most time?”
    “Is the workload CPU-bound, GPU-bound, or I/O-bound?”
    “Are there synchronization issues between MPI ranks or threads?”

  • Nsight Compute (ncu)
    A kernel-level profiler that provides deep, per-kernel metrics such as memory bandwidth utilization, warp divergence, cache hit rates, and achieved occupancy. It is used to answer:
    “Why is this kernel underperforming?”
    “Is global memory access coalesced?”
    “Are we reaching peak throughput?”

Together, Nsight Systems and Nsight Compute allow developers and researchers to profile applications from the top down (application-level bottlenecks) and then bottom up (kernel-level inefficiencies). These tools are critical for high-performance computing (HPC), scientific simulation, and machine learning workloads running on GPU-enabled clusters.

 

Allocate an interactive GPU node (SLURM)

To profile interactively, request a GPU node with srun and start a shell on the compute host:

SLURM command to allocate an interactive GPU node
srun -p P4V12_SKY32M192_L \ -A gol_<linkblue>_uksr \ -t 1:00:00 \ -n 8 \ --gres=gpu:1 \ --pty bash

What each flag means

  • -p P4V12_SKY32M192_L – the partition (GPU queue). Choose one of LCC’s GPU partitions:

    • P4V16_HAS16M128_L

    • P4V12_SKY32M192_L

    • P4V12_SKY32M192_D

    • V4V16_SKY32M192_L

    • V4V32_SKY32M192_L

    • A2V80_ICE56M256_L

  • -A gol_<PIlinkblue>_uksr – replace <PIlinkblue> with your PI’s linkblue account

  • -t 1:00:00 – walltime (hh:mm:ss).

  • -n 8 – total tasks

  • --gres=gpu:1 – one GPU on the node.

  • --pty bash – open an interactive shell on the allocated node.

Load CUDA Module

Before running sanity checks or profiling, load the site-provided CUDA toolkit module. On LCC:

Module command to load CUDA toolkit
module load ccs/cuda/12.2.0_535.54.03

This ensures the correct compiler, headers, libraries, and Nsight tools are in your path.

 

Sanity Checks

After the CUDA module is loaded, confirm the environment:

Commands to verify CUDA environment and Nsight availability
hostname nvidia-smi which nsys && nsys --version which ncu && ncu --version

 

Example CUDA Application (Matrix Multiplication)

To verify profiling, you can use a simple CUDA C program that multiplies two square matrices.

Save this to matmul.cu:

CUDA C program for matrix multiplication (matmul.cu)
#include <stdio.h> #include <stdlib.h> #define N 1024 // matrix size N x N __global__ void matMulKernel(float *C, float *A, float *B, int n) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0.0f; if (row < n && col < n) { for (int k = 0; k < n; k++) { sum += A[row * n + k] * B[k * n + col]; } C[row * n + col] = sum; } } int main() { int size = N * N * sizeof(float); // Host memory float *h_A = (float*)malloc(size); float *h_B = (float*)malloc(size); float *h_C = (float*)malloc(size); // Initialize matrices for (int i = 0; i < N*N; i++) { h_A[i] = 1.0f; h_B[i] = 1.0f; } // Device memory float *d_A, *d_B, *d_C; cudaMalloc((void**)&d_A, size); cudaMalloc((void**)&d_B, size); cudaMalloc((void**)&d_C, size); cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 blocksPerGrid((N + 15) / 16, (N + 15) / 16); // Launch kernel matMulKernel<<<blocksPerGrid, threadsPerBlock>>>(d_C, d_A, d_B, N); cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); printf("C[0] = %f\n", h_C[0]); // quick correctness check // Cleanup free(h_A); free(h_B); free(h_C); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); return 0; }

 

Compile

Use nvcc from the CUDA module:

nvcc compilation command for the matrix multiplication program
nvcc -O2 -o matmul matmul.cu

 

Profile with Nsight

Nsight Systems:

Nsight Systems profiling command for the matmul application
nsys profile -o matmul_nsys --trace=cuda,osrt ./matmul

Reports (.nsys-rep) can be copied to your workstation and opened with NVIDIA Nsight Systems.

NVIDIA Nsight Systems timeline view displaying CPU and CUDA performance metrics across multiple threads, showing kernel memory usage, CUDA API calls, and profiling data for a matmul operation.

Figure: NVIDIA Nsight Systems timeline view displaying CPU and CUDA performance metrics across multiple threads, showing kernel memory usage, CUDA API calls, and profiling data for a matmul operation.

 

 

 

Center for Computational Sciences