Performance portability is a critical capability in parallel simulation codes. Multi-physics codes contain thousands of independent kernels, for which developers set architectural tuning parameters at compile time. Existing approaches to auto-tuning typically run many instances of a kernel when performing a guided search of the performance parameter space. As a result, fast searches still take minutes to process, and tuning is effective only if the code’s behavior changes slowly. In adaptive mesh refinement simulations, for example, where the code’s behavior can evolve during a run, tuning a kernel for one set of inputs may not improve performance on another—different tunings of each kernel may be necessary. To tune adaptive applications for the best performance, statically optimized code paths need to be chosen dynamically.

Approach and Benefits

Apollo, an auto-tuning extension of the RAJA portability framework, provides a lightweight and efficient solution to performance portability challenges by tuning parameters each time a kernel runs. During this machine learning process, classifiers map arbitrary runtime features into the fastest tuning parameter values, and conditional statements are evaluated before each kernel. This approach enables Apollo to respond quickly to changes in data during processing, freeing developers from manually choosing parameters.

Making an on-line tuning decision can be costly because existing approaches are unable to adapt quickly to changing inputs. Apollo mitigates this performance lag by using off-line training to learn how to directly select tuning parameter values using decision tree classifiers, which encode measured features to make intelligent decisions. Thus, the cost of modeling is spread over multiple runs while removing the overhead of constructing the model from the runtime.

The cumulative effect of Apollo’s efficiency has been shown in tests to select the fastest parameter up to 98% of the time and to increase performance up to 4.8x.

Figure 1. Apollo workflow, counter-clockwise from top left: The application generates a set of training data samples, selecting the fastest known template variant at runtime. Samples are input into the model-generation framework, which predicts the fastest parameter values via a decision tree classifier. C++ models, loaded at startup, are then linked into the application dynamically without recompilation.

Applications

Apollo’s first version runs as an extension of the RAJA programming model and uses Caliper to collect training data. Tests with CleverLeaf and LULESH have yielded compelling performance data.

  • RAJA maintains single-source code while leveraging standard C++ features for portability and integration. RAJA interfaces with Apollo’s recording and tuning components; these are decoupled from the application, so the same executable can be run in either mode. This flexibility allows the decision model to re-train on new data without recompiling the application.
  • Caliper is an introspection tool that connects independent data through contextual annotations. This system measures kernel runtime and stores arbitrary attribute values representing additional features. Apollo uses these capabilities to determine features that can be added to the model.
  • CleverLeaf is an adaptive mesh refinement application whose kernels work on variably sized patches of data, making it an ideal test case for a dynamic auto-tuning framework. In strong scaling, where adaptively refined mesh is divided into smaller subdomains, Apollo speeds up execution.
  • LULESH is a proxy application that models shock hydrodynamics. With kernels iterating over domain elements as well as material regions, the code’s algorithms and programming features are representative of standard C++ applications.

Table 1. Apollo runtime features: During auto-tuning, kernel, instruction, and application features are collected for each RAJA kernel. Additional multi-physics code features are within scope for future versions of Apollo.

Feature Type Feature Description
Kernel func Name of function
  func_size Total number of instructions in kernel body
  index_type Type of RAJA IndexSet
  loop_id Address identifying kernel
  num_indices Number of indices in each segment
  num_segments Number of segments
  stride Stride of indices in each segment
Instruction add, and, call, cmp, comisd, divsd, inc, jb, lea, loop, maxsd, minsd, mov, mulpd, nop, pop, push, pxor, ret, sar, shl/sal, sqrtsd, sub, test, ucomisd, unpckhpd, unpcklpd, xor, xorps Instruction mnemonics are grouped to save space (for example, the add mnemonic corresponds to add, addpd, and addsd)
Application timestep Current cycle
  problem_size Global problem size
  problem_name Name of the input deck
  patch_id Numeric ID assigned to the AMR subdomain being processed (CleverLeaf only)
Figure 2. Apollo decision tree: Using a simple code-generation algorithm, a decision tree is created containing conditional statements that test the values of features corresponding to each node. To predict an unseen sample, Apollo evaluates a node and compares feature values to determine which child node to visit. This example is based on num_indices.
void
apollo_begin_forall_iset(const RAJA::IndexSet& iset, void*) {
  RAJA::apollo::model_params p;
  const int num_indices = iset.getLength();

  // Use decision tree to predict best execution policy
  if ( num_indices <= 103938.0 ) {
    if ( num_indices <= 19965.5 ) {
      p.policy = seq_exec;
    } else {
      p.policy = omp_exec;
    } else {
      if ( num_indices <= 2382100.0 ) {
        p.policy = omp_exec;
      } else {
        p.policy = omp_exec;
      }
    }
  }

  // Write predicted model parameters to the blackboard
  RAJA::apollo::set_model_params(p);

}