The vast supercomputing power at our disposal in the exascale era will go to waste unless we ensure that our applications can run at their peak performance. The amount of communication in an application will be the primary determinant of performance at those scales due to much faster cores but disproportionately slower networks. Fast, scalable and accurate modeling/simulation of the application’s communication is required to prepare parallel applications for exascale. Modeling and simulation have several use cases—performance prediction on a future architecture, network hardware co-design, and algorithmic research for future network topologies such as communication-avoiding algorithms, topology-aware job scheduling, task mapping, and network routing protocols.
TraceR/CODES: Framework for Scalable Simulations
Preparing existing applications for next-generation high performance computing (HPC) systems requires understanding the tradeoffs between computation and communication. To enable such explorations, LLNL has developed the TraceR/CODES framework—a highly accurate simulation infrastructure that uses current applications and systems to predict the performance of applications on future systems.
The TraceR/CODES framework provides a trace-driven toolchain for packet-level simulations of traffic flow on HPC networks. It captures complexities encountered at different levels of software stack:
- Traces from production applications enable replay of communication patterns and computation–communication overlap behavior of real applications.
- TraceR, a scalable packet-level network simulator, simulates the intricate details, protocols, and collective algorithms used in message passing interface (MPI) for multi-job workloads.
- CODES provides a unified application programming interface (API) to reproduce packet-level flow on traffic as it happens on real networks.
TraceR is designed as an application on top of the CODES simulation framework. It uses traces generated by BigSim’s emulation framework to simulate an application’s communication behavior by leveraging the network API exposed by CODES. Under the hood, CODES uses the Rensselaer Optimistic Simulation System (ROSS) as the parallel discrete-event simulation (PDES) engine to drive the simulation. ROSS provides the environment for conducting a discrete-time-stamped-based replay of events in the framework. The optimistic nature of ROSS drives the scalability of the TraceR/CODES framework and enables fast simulation using large core counts in comparison to other simulation frameworks.