Using Intel's VTune Profiler on LCC/MCC.

Using Intel's VTune Profiler on LCC/MCC.

Intel VTune Profiler is a powerful performance analysis tool that helps identify hotspots, optimize threading, and analyze microarchitectural performance. This guide outlines the basic steps for profiling an application using VTune, applicable for both Intel (LCC) and non-Intel (MCC) processors.

1. Matrix Multiplication Example

Here is a simple matrix multiplication C++ program that we will use to demonstrate profiling with VTune. The program multiplies two 500x500 matrices.

matrix_mul.cpp:

C++ matrix multiplication program (matrix_mul.cpp)
#include <iostream> #include <vector> #include <chrono> using namespace std; using namespace chrono; const int N = 500; void matrixMultiply(const vector<vector<int>> &A, const vector<vector<int>> &B, vector<vector<int>> &C) { for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { C[i][j] = 0; for (int k = 0; k < N; ++k) { C[i][j] += A[i][k] * B[k][j]; } } } } int main() { // Initialize matrices A and B with some values vector<vector<int>> A(N, vector<int>(N, 1)); vector<vector<int>> B(N, vector<int>(N, 2)); vector<vector<int>> C(N, vector<int>(N)); // Start timing auto start = high_resolution_clock::now(); // Perform matrix multiplication matrixMultiply(A, B, C); // Stop timing auto stop = high_resolution_clock::now(); auto duration = duration_cast<milliseconds>(stop - start); // Print the time taken cout << "Time taken by matrix multiplication: " << duration.count() << " ms" << endl; return 0; }

Compilation:

Compile the program with debugging symbols:

Compile matrix_mul.cpp with debugging symbols
g++ -g -O2 -o matrix_mul matrix_mul.cpp

2. Launching the VTune GUI

  1. Access Open OnDemand:

  2. Launch VNC Viewer:

    • Click on the "Interactive Apps" tab.

    • Select "Morgan Computer Cluster (MCC)" or “Lipcomb Compute Cluster (LCC)” from the dropdown menu below “Desktops”.

    • Fill in the required details for your SLURM allocation, including:

      • Number of cores

      • Duration (in hours)

      • Partition

    • Click the "Launch" button.

  3. Start VTune GUI

    • Once the desktop has loaded, start a terminal emulator.

    • Input the following on LCC/MCC to load the latest vtune version:

      Load VTune module

      module load vtune/latest
    • Start the GUI with the command:

      Launch VTune GUI

      vtune-gui

3. Starting VTune profiling

Once the VTune GUI is open:

  1. Create a New Project

    1. Click on "New Project".

    2. Name your project (e.g., matrix_mul_analysis).

    3. In the “Configure Analysis” tab, under "Application", click "Browse" and select the path to your compiled matrix_mul binary.

  2. Configure Analysis Type

    1. In the “How” portion of the “Configure Analysis” tab, press the down arrow to choose "Hotspots" as the type of analysis. This analysis focuses on identifying where the most CPU time is spent.

    2. You can keep the default settings or customize the options if needed, such as adding more advanced memory or threading analysis. (Note, this is only available on LCC – our Intel-based cluster).

  3. Run the Analysis

    1. Click the "Start" button to run the profiling on your application. VTune will automatically collect data on CPU usage, including hotspots, execution time, and more.

  4. Analyze the Results

    1. After the analysis completes, VTune will show a summary report of the run:

      1. Top Hotspots: These functions consume the most CPU time.

      2. Call Stack: Shows the call hierarchy and which functions are the main contributors to the workload.

      3. Source View: Allows you to view the specific lines in your code that are responsible for the highest CPU consumption.

For the matrix multiplication example, you should see that matrixMultiply is the top hotspot function.

4. Interpreting the VTune GUI Results

Here’s how to read the key metrics in VTune’s GUI:

Summary View:

This provides an overview of CPU time, effective time, and total thread count. For matrix multiplication, you will likely see something similar to:

  • CPU Time: Time spent executing code on the CPU.

  • Effective Time: Time during which the CPU was actively performing useful work.

  • Top Hotspots: These are functions where most of the CPU time is spent.

Hotspots:

Click on "Hotspots" to drill down into the top time-consuming functions. You should see that the matrixMultiply function takes the majority of the time.

Source View:

You can double-click on any hotspot function to see the source code and corresponding CPU usage for specific lines of code. This is useful for identifying bottlenecks within the code.

5. Optimizing the Matrix Multiplication Code

Based on the VTune results, if the matrix multiplication function is identified as a hotspot, here are some ways to optimize it:

1. Algorithmic Optimization:

  • You can implement more efficient matrix multiplication algorithms such as Strassen's Algorithm, which reduces the time complexity.

2. Parallelization:

  • Consider parallelizing the matrix multiplication using OpenMP to take advantage of multiple CPU cores.

    Example of parallelizing the matrixMultiply function:

    Parallelized matrix multiplication using OpenMP

    void matrixMultiply(const vector<vector<int>> &A, const vector<vector<int>> &B, vector<vector<int>> &C) { #pragma omp parallel for for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { C[i][j] = 0; for (int k = 0; k < N; ++k) { C[i][j] += A[i][k] * B[k][j]; } } } }

    With OpenMP, VTune will also show thread-level performance in the Threading tab, giving you insights into parallel execution efficiency.

6. Best Practices for VTune GUI Usage

  • Focus on Hotspots: After profiling, focus your optimization efforts on the functions consuming the most CPU time.

  • Use Source View: Utilize the Source View in VTune to get detailed line-by-line analysis and make targeted code improvements.

  • Explore Different Analysis Types: Beyond hotspots, VTune offers various analysis types such as Threading, Memory Access, and I/O to gain deeper insights into your application’s performance.

Center for Computational Sciences