Intel's VTune Amplifier is a performance profiling tool for C, C++, and Fortran code that can identify where in the code time is being spent in both serial and threaded applications. For threaded applications it can also determine the amount of concurrency and identify bottlenecks created by synchronization primitives.
|x86_64 Linux||/usr/local/tools/vtune*||Multiple versions are available. Use Dotkit to load.|
Before getting started, compile your code with -g to generate debug information that allows VTune Amplifier to correlate timing information with specific locations in your source code. Users should still run with optimizations to get an accurate representation of production run times. VTune Amplifier uses dynamic instrumentation and thus does not require use of Intel compilers or the use of any special compiler flags.
VTune Amplifier includes both a graphical user interface (GUI) and a command line (CL) interface that can be accessed with the amplxe-gui and amplxe-cl commands, respectively. When running the GUI, start by creating a new project, entering the executable path, arguments, environment variables, and setting other options. Once the VTune Amplifier project is created, set up a new analysis. The analysis types include: hotspots; concurrency; and locks and waits. Hotspots will profile your code's execution to determine which functions are consuming the most time and thus are targets for optimization. The hotspots analysis includes timing information from all threads and from sub-processes. The concurrency analysis analyzes how well a threaded application takes advantage of multi-core hardware and identifies functions and times during execution where available CPUs aren't fully utilized. The locks and waits analysis adds the ability to identify synchronization points that contribute to underutilization of CPUs. VTune Amplifier uses sampling to gather profile information and should only incur a 5% execution-time overhead. Once the analysis type is configured, click "Start" to run the analysis. Note that there is also a "Show Command Line" button in the bottom right-hand corner, which is a handy way to generate the CL equivalent of your project and analysis configuration.
After the analysis is complete, several options allow for viewing the collected data. A set of tabs/buttons near the top of the GUI allows you to choose various windows. The Summary window gives an overview of the analysis. The Bottom-up window displays performance data from the perspective of the bottom-level functions. The Top-Down Tree window displays inclusive and exclusive performance data from the perspective of the function call stacks during execution. Within the Bottom-up and Top-Down Tree windows, there is a call stack pane to display the stack trace for the sampled data and also a timeline pane that shows the CPU activity of the threads over time. At the bottom of these windows there are also filters that let you sort the data by the executed module, by individual thread, or by specific process. Above the window tabs/buttons you can also change the viewpoint by clicking on the yellow wrench next to the analysis title. Each viewpoint is a preset configuration that filters the performance data to focus on specific performance issues. The GUI contains a wealth of other features that are beyond the scope of this document; refer to the product documentation for more information.
LC CHAOS Linux systems have an MPI-enabled CL that can be accessed via amplxe-cl-mpi. This command takes the same arguments as the serial amplxe-cl command and will automatically append the MPI rank of each process to the name of the results directory. An example usage with MPI would be to run
srun -n 16 amplxe-cl-mpi -r my_result -collect hotspots -- my_mpi_app arg1 arg2
to create results directories my_result.0 through my_result.15. The GUI does not provide a mechanism to run analysis of MPI application; the amplxe-cl-mpi is currently the only way to run VTune Amplifier on MPI applications. The GUI also does not provide a way to view aggregated MPI results, so each MPI task's results must be opened individually in the GUI.
As of VTune version 2015 some, but not all, of the previously unavailable features are now available (via Linux Perf), particularly with respect to gathering performance counters. The "Advanced Hotspots" and "General Exploration" analyses should now be runable. However, some options, such as gathering call counts, loop counts, and context switches in the Advanced Hotspots analysis are not obtainable without the Intel drivers installed. Similarly, memory access and bandwidth information will not be obtainable with just Linux Perf. Options that are unavailable may result in an error message stating, "Cannot enable Hardware Event-based Sampling: problem with the driver (sep*/sepdrv*)." We have these capabilities enabled for a small set of nodes on various LC systems (rzoz and hyperion).
VTune Amplifier provides a tutorial in/usr/local/tools/vtune/documentation/en/tutorials/index.htm or on Intel's Web site. The example C++ code in the tutorial can be found in /usr/local/tools/vtune/samples/en/C++/.
The VTune Amplifier documentation can be found in /usr/local/tools/vtune/documentation/en/documentation_amplifier.htm or on Intel's Web site.
For more information, visit Intel's VTune Amplifier Web page.