Apollo: Fast, Lightweight, Dynamic Tuning for Data-Dependent Code
Background
Performance portability is a critical capability in parallel simulation codes. Multi-physics codes contain thousands of independent kernels, for which developers set architectural tuning parameters at compile time. Existing approaches to auto-tuning typically run many instances of a kernel when performing a guided search of the performance parameter space. As a result, fast searches still take minutes to process, and tuning is effective only if the code’s behavior changes slowly. In adaptive mesh refinement simulations, for example, where the code’s behavior can evolve during a run, tuning a kernel for one set of inputs may not improve performance on another—different tunings of each kernel may be necessary. To tune adaptive applications for the best performance, statically optimized code paths need to be chosen dynamically.
Approach and Benefits
Apollo, an auto-tuning extension of the RAJA portability framework, provides a lightweight and efficient solution to performance portability challenges by tuning parameters each time a kernel runs. During this machine learning process, classifiers map arbitrary runtime features into the fastest tuning parameter values, and conditional statements are evaluated before each kernel. This approach enables Apollo to respond quickly to changes in data during processing, freeing developers from manually choosing parameters.
Making an on-line tuning decision can be costly because existing approaches are unable to adapt quickly to changing inputs. Apollo mitigates this performance lag by using off-line training to learn how to directly select tuning parameter values using decision tree classifiers, which encode measured features to make intelligent decisions. Thus, the cost of modeling is spread over multiple runs while removing the overhead of constructing the model from the runtime.
The cumulative effect of Apollo’s efficiency has been shown in tests to select the fastest parameter up to 98% of the time and to increase performance up to 4.8x.
Figure 1. Apollo workflow, counter-clockwise from top left: The application generates a set of training data samples, selecting the fastest known template variant at runtime. Samples are input into the model-generation framework, which predicts the fastest parameter values via a decision tree classifier. C++ models, loaded at startup, are then linked into the application dynamically without recompilation.
Applications
Apollo’s first version runs as an extension of the RAJA programming model and uses Caliper to collect training data. Tests with CleverLeaf and LULESH have yielded compelling performance data.
- RAJA maintains single-source code while leveraging standard C++ features for portability and integration. RAJA interfaces with Apollo’s recording and tuning components; these are decoupled from the application, so the same executable can be run in either mode. This flexibility allows the decision model to re-train on new data without recompiling the application.
- Caliper is an introspection tool that connects independent data through contextual annotations. This system measures kernel runtime and stores arbitrary attribute values representing additional features. Apollo uses these capabilities to determine features that can be added to the model.
- CleverLeaf is an adaptive mesh refinement application whose kernels work on variably sized patches of data, making it an ideal test case for a dynamic auto-tuning framework. In strong scaling, where adaptively refined mesh is divided into smaller subdomains, Apollo speeds up execution.
- LULESH is a proxy application that models shock hydrodynamics. With kernels iterating over domain elements as well as material regions, the code’s algorithms and programming features are representative of standard C++ applications.
Table 1. Apollo runtime features: During auto-tuning, kernel, instruction, and application features are collected for each RAJA kernel. Additional multi-physics code features are within scope for future versions of Apollo.
Feature Type | Feature | Description |
---|---|---|
Kernel | func | Name of function |
func_size | Total number of instructions in kernel body | |
index_type | Type of RAJA IndexSet | |
loop_id | Address identifying kernel | |
num_indices | Number of indices in each segment | |
num_segments | Number of segments | |
stride | Stride of indices in each segment | |
Instruction | add, and, call, cmp, comisd, divsd, inc, jb, lea, loop, maxsd, minsd, mov, mulpd, nop, pop, push, pxor, ret, sar, shl/sal, sqrtsd, sub, test, ucomisd, unpckhpd, unpcklpd, xor, xorps | Instruction mnemonics are grouped to save space (for example, the add mnemonic corresponds to add, addpd, and addsd) |
Application | timestep | Current cycle |
problem_size | Global problem size | |
problem_name | Name of the input deck | |
patch_id | Numeric ID assigned to the AMR subdomain being processed (CleverLeaf only) |
Figure 2. Apollo decision tree: Using a simple code-generation algorithm, a decision tree is created containing conditional statements that test the values of features corresponding to each node. To predict an unseen sample, Apollo evaluates a node and compares feature values to determine which child node to visit. This example is based on num_indices.
void
apollo_begin_forall_iset(const RAJA::IndexSet& iset, void*) {
RAJA::apollo::model_params p;
const int num_indices = iset.getLength();
// Use decision tree to predict best execution policy
if ( num_indices <= 103938.0 ) {
if ( num_indices <= 19965.5 ) {
p.policy = seq_exec;
} else {
p.policy = omp_exec;
} else {
if ( num_indices <= 2382100.0 ) {
p.policy = omp_exec;
} else {
p.policy = omp_exec;
}
}
}
// Write predicted model parameters to the blackboard
RAJA::apollo::set_model_params(p);
}
Learn More
For more information, contact Giorgis Georgakoudis, David Beckingsale or Todd Gamblin. Apollo will be open source and extensible, and GitHub links will be provided upon release.