How to profile a CUDA job with NVIDIA Nsight on LCC
Background
NVIDIA Nsight is a suite of performance analysis and debugging tools designed to help developers understand how their applications utilize GPUs. While CUDA itself provides the programming framework for NVIDIA GPUs, Nsight tools give insight into where applications spend time, how GPU resources are used, and where performance bottlenecks occur.
The Nsight family includes:
Nsight Systems (
nsys)
A system-wide profiler that captures CPU, GPU, memory, and interconnect activity across the application’s lifetime. It helps answer questions like:
“Which kernels take the most time?”
“Is the workload CPU-bound, GPU-bound, or I/O-bound?”
“Are there synchronization issues between MPI ranks or threads?”Nsight Compute (
ncu)
A kernel-level profiler that provides deep, per-kernel metrics such as memory bandwidth utilization, warp divergence, cache hit rates, and achieved occupancy. It is used to answer:
“Why is this kernel underperforming?”
“Is global memory access coalesced?”
“Are we reaching peak throughput?”
Together, Nsight Systems and Nsight Compute allow developers and researchers to profile applications from the top down (application-level bottlenecks) and then bottom up (kernel-level inefficiencies). These tools are critical for high-performance computing (HPC), scientific simulation, and machine learning workloads running on GPU-enabled clusters.
Allocate an interactive GPU node (SLURM)
To profile interactively, request a GPU node with srun and start a shell on the compute host:
SLURM command to allocate an interactive GPU node
srun -p P4V12_SKY32M192_L \
-A gol_<linkblue>_uksr \
-t 1:00:00 \
-n 8 \
--gres=gpu:1 \
--pty bashWhat each flag means
-p P4V12_SKY32M192_L– the partition (GPU queue). Choose one of LCC’s GPU partitions:P4V16_HAS16M128_LP4V12_SKY32M192_LP4V12_SKY32M192_DV4V16_SKY32M192_LV4V32_SKY32M192_LA2V80_ICE56M256_L
-A gol_<PIlinkblue>_uksr– replace<PIlinkblue>with your PI’s linkblue account-t 1:00:00– walltime (hh:mm:ss).-n 8– total tasks--gres=gpu:1– one GPU on the node.--pty bash– open an interactive shell on the allocated node.
Load CUDA Module
Before running sanity checks or profiling, load the site-provided CUDA toolkit module. On LCC:
Module command to load CUDA toolkit
module load ccs/cuda/12.2.0_535.54.03This ensures the correct compiler, headers, libraries, and Nsight tools are in your path.
Sanity Checks
After the CUDA module is loaded, confirm the environment:
Commands to verify CUDA environment and Nsight availability
hostname
nvidia-smi
which nsys && nsys --version
which ncu && ncu --version
Example CUDA Application (Matrix Multiplication)
To verify profiling, you can use a simple CUDA C program that multiplies two square matrices.
Save this to matmul.cu:
CUDA C program for matrix multiplication (matmul.cu)
#include <stdio.h>
#include <stdlib.h>
#define N 1024 // matrix size N x N
__global__ void matMulKernel(float *C, float *A, float *B, int n) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
if (row < n && col < n) {
for (int k = 0; k < n; k++) {
sum += A[row * n + k] * B[k * n + col];
}
C[row * n + col] = sum;
}
}
int main() {
int size = N * N * sizeof(float);
// Host memory
float *h_A = (float*)malloc(size);
float *h_B = (float*)malloc(size);
float *h_C = (float*)malloc(size);
// Initialize matrices
for (int i = 0; i < N*N; i++) {
h_A[i] = 1.0f;
h_B[i] = 1.0f;
}
// Device memory
float *d_A, *d_B, *d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
dim3 threadsPerBlock(16, 16);
dim3 blocksPerGrid((N + 15) / 16, (N + 15) / 16);
// Launch kernel
matMulKernel<<<blocksPerGrid, threadsPerBlock>>>(d_C, d_A, d_B, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
printf("C[0] = %f\n", h_C[0]); // quick correctness check
// Cleanup
free(h_A); free(h_B); free(h_C);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
return 0;
}
Compile
Use nvcc from the CUDA module:
nvcc compilation command for the matrix multiplication program
nvcc -O2 -o matmul matmul.cu
Profile with Nsight
Nsight Systems:
Nsight Systems profiling command for the matmul application
nsys profile -o matmul_nsys --trace=cuda,osrt ./matmulReports (.nsys-rep) can be copied to your workstation and opened with NVIDIA Nsight Systems.
Figure: NVIDIA Nsight Systems timeline view displaying CPU and CUDA performance metrics across multiple threads, showing kernel memory usage, CUDA API calls, and profiling data for a matmul operation.