Using Intel's VTune Profiler on LCC/MCC.
Intel VTune Profiler is a powerful performance analysis tool that helps identify hotspots, optimize threading, and analyze microarchitectural performance. This guide outlines the basic steps for profiling an application using VTune, applicable for both Intel (LCC) and non-Intel (MCC) processors.
1. Matrix Multiplication Example
Here is a simple matrix multiplication C++ program that we will use to demonstrate profiling with VTune. The program multiplies two 500x500 matrices.
matrix_mul.cpp:
C++ matrix multiplication program (matrix_mul.cpp)
#include <iostream>
#include <vector>
#include <chrono>
using namespace std;
using namespace chrono;
const int N = 500;
void matrixMultiply(const vector<vector<int>> &A, const vector<vector<int>> &B, vector<vector<int>> &C) {
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
C[i][j] = 0;
for (int k = 0; k < N; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
int main() {
// Initialize matrices A and B with some values
vector<vector<int>> A(N, vector<int>(N, 1));
vector<vector<int>> B(N, vector<int>(N, 2));
vector<vector<int>> C(N, vector<int>(N));
// Start timing
auto start = high_resolution_clock::now();
// Perform matrix multiplication
matrixMultiply(A, B, C);
// Stop timing
auto stop = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(stop - start);
// Print the time taken
cout << "Time taken by matrix multiplication: " << duration.count() << " ms" << endl;
return 0;
}Compilation:
Compile the program with debugging symbols:
Compile matrix_mul.cpp with debugging symbols
g++ -g -O2 -o matrix_mul matrix_mul.cpp2. Launching the VTune GUI
Access Open OnDemand:
Navigate to LCC: http://ood.ccs.uky.edu / MCC: http://mcc-ood.ccs.uky.edu
Launch VNC Viewer:
Click on the "Interactive Apps" tab.
Select "Morgan Computer Cluster (MCC)" or “Lipcomb Compute Cluster (LCC)” from the dropdown menu below “Desktops”.
Fill in the required details for your SLURM allocation, including:
Number of cores
Duration (in hours)
Partition
Click the "Launch" button.
Start VTune GUI
Once the desktop has loaded, start a terminal emulator.
Input the following on LCC/MCC to load the latest vtune version:
Load VTune module
module load vtune/latestStart the GUI with the command:
Launch VTune GUI
vtune-gui
3. Starting VTune profiling
Once the VTune GUI is open:
Create a New Project
Click on "New Project".
Name your project (e.g.,
matrix_mul_analysis).In the “Configure Analysis” tab, under "Application", click "Browse" and select the path to your compiled
matrix_mulbinary.
Configure Analysis Type
In the “How” portion of the “Configure Analysis” tab, press the down arrow to choose "Hotspots" as the type of analysis. This analysis focuses on identifying where the most CPU time is spent.
You can keep the default settings or customize the options if needed, such as adding more advanced memory or threading analysis. (Note, this is only available on LCC – our Intel-based cluster).
Run the Analysis
Click the "Start" button to run the profiling on your application. VTune will automatically collect data on CPU usage, including hotspots, execution time, and more.
Analyze the Results
After the analysis completes, VTune will show a summary report of the run:
Top Hotspots: These functions consume the most CPU time.
Call Stack: Shows the call hierarchy and which functions are the main contributors to the workload.
Source View: Allows you to view the specific lines in your code that are responsible for the highest CPU consumption.
For the matrix multiplication example, you should see that matrixMultiply is the top hotspot function.
4. Interpreting the VTune GUI Results
Here’s how to read the key metrics in VTune’s GUI:
Summary View:
This provides an overview of CPU time, effective time, and total thread count. For matrix multiplication, you will likely see something similar to:
CPU Time: Time spent executing code on the CPU.
Effective Time: Time during which the CPU was actively performing useful work.
Top Hotspots: These are functions where most of the CPU time is spent.
Hotspots:
Click on "Hotspots" to drill down into the top time-consuming functions. You should see that the matrixMultiply function takes the majority of the time.
Source View:
You can double-click on any hotspot function to see the source code and corresponding CPU usage for specific lines of code. This is useful for identifying bottlenecks within the code.
5. Optimizing the Matrix Multiplication Code
Based on the VTune results, if the matrix multiplication function is identified as a hotspot, here are some ways to optimize it:
1. Algorithmic Optimization:
You can implement more efficient matrix multiplication algorithms such as Strassen's Algorithm, which reduces the time complexity.
2. Parallelization:
Consider parallelizing the matrix multiplication using OpenMP to take advantage of multiple CPU cores.
Example of parallelizing thematrixMultiplyfunction:Parallelized matrix multiplication using OpenMP
void matrixMultiply(const vector<vector<int>> &A, const vector<vector<int>> &B, vector<vector<int>> &C) { #pragma omp parallel for for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { C[i][j] = 0; for (int k = 0; k < N; ++k) { C[i][j] += A[i][k] * B[k][j]; } } } }With OpenMP, VTune will also show thread-level performance in the Threading tab, giving you insights into parallel execution efficiency.
6. Best Practices for VTune GUI Usage
Focus on Hotspots: After profiling, focus your optimization efforts on the functions consuming the most CPU time.
Use Source View: Utilize the Source View in VTune to get detailed line-by-line analysis and make targeted code improvements.
Explore Different Analysis Types: Beyond hotspots, VTune offers various analysis types such as Threading, Memory Access, and I/O to gain deeper insights into your application’s performance.