Operating at the leading edge of high performance computing (HPC), as Livermore does, requires innovation in the software tools that ensure scientific codes will run on supercomputers. The codes and computers are often too complex for commercial products, so LLNL teams are developing specialized software to accommodate build dependencies, data storage, memory management, scalable workflows, mathematical algorithms, and more.
Regression testing is a key area of, and opportunity for, efficiency and dependability improvements. This essential activity helps uncover bugs in a scientific application after enhancements, patches, or other changes have been deployed. Routinely running a representative set of regression test cases ensures the code continues to run the same way and produce the same answers as before. Computational scientist Shawn Dawson explains, “Many of Livermore’s programmatic codes are too massive and dynamic for manual testing processes. For large, frequently updated codes, regression testing must be automated.”
The Automated Testing System (ATS) project was born from these circumstances nearly two decades ago. This Livermore-developed, scalable, Python 3–based tool automates the running of a scientific code’s regression tests. With tens of thousands of tests run every day, ATS helps mission-driven projects meet milestones and maintain code integrity. By adopting this common testing framework, code teams can use HPC resources more efficiently without needing to develop their own ad hoc solutions.
ATS features include generating and comparing plot files, testing code correctness and performance with different software libraries and compiler configurations, and archiving metrics related to each test run. Furthermore, the tool is flexible enough for experts and non-experts alike. The Python 3 code base keeps pace with modern development practices, and users need only learn a few basic statements to start running tests. Users can specify which types of tests to run and which directories to run them in, as well as define post-processing functions such as creating graphs. ATS generates files and reports that users view to identify run differences due to source code changes. Tests can be executed over many parallel MPI processors, with OpenMP threads, and utilize graphics processing unit (GPU) devices on current machines.
ATS is extendable to new machines, and users can evaluate their code’s performance over time and across platforms. ATS can be run as part of a project’s continuous integration (CI) strategy. It may be driven by Gitlab-CI, for instance, as part of the code update cycle. Throughout ATS development, upgrades have enabled application portability to ever-evolving HPC architectures—beginning with the IBM Blue Gene machines of the early 2000s and continuing into the petascale era of GPU-based supercomputers. ATS now runs on a range of unclassified and classified systems that leverage BlueOS or TOSS (Tri-Laboratory Operating System Stack) such as the Lab’s flagship supercomputer Sierra, Lassen, RZAnsel, and several commodity clusters. When El Capitan comes online in 2023, the ATS project plans to use LLNL’s Flux resource manager to run reliably on the exascale-class machine.
Today ATS is managed by Dawson and David Bloss from Computing’s Applications, Simulation, & Quality Division. Its primary users are domain scientists and project teams in the Lab’s Weapons and Complex Integration organization. According to Dawson, user feedback is crucial to the tool’s success. He explains, “Testing is a complex issue. We can never test 100% of a given application due to variations in data input, but we test the large majority of use cases. When a user finds an area we missed, we incorporate it into the test suite.”
In 2021, the team released ATS as open source—see the GitHub repository and documentation, which includes a tutorial—and contributions are welcome. Dawson states, “With the open sourcing of ATS, and the current work on the Flux scheduler, we are motivated to continue to port ATS to the newest machines. We welcome contributions and feedback and suggestions from LLNL as well as other HPC centers.”