Kathryn Mohror Loves HPC, Faults and All
Kathryn Mohror’s decision as to where to begin her computer science career was easy. “I work in high performance computing,” says Kathryn, “and at Livermore, the biggest systems in the world are at my fingertips.”
At LLNL’s Center for Applied Scientific Computing, Kathryn researches scalable fault-tolerant computing and data storage techniques. The large number of components in today’s leading-edge computing systems has increased the rate of hardware faults, which can halt calculations and degrade performance. Applications make checkpoints—snapshots of the computer’s state—at regular intervals in order to recover after a fault occurs, but writing to and accessing checkpoints in a supercomputer’s parallel file system consumes valuable computational resources. Kathryn is part of an ongoing effort to create an efficient and reliable scheme for storing checkpoints at a range of levels and locations. She and her colleagues have implemented the Scalable Checkpoint/Restart library on Livermore’s Linux supercomputers, and are currently porting it to the largest and most powerful system, Sierra.
Kathryn’s work not only focuses on rapidly recovering from failures but also on how to maximize the effectiveness of applications running on large-scale parallel computers. Performance measurement and analysis tools were central to her doctoral research and remain an abiding interest. When running an application on a machine with hundreds of thousands of processors, or more than a million processors, it can be difficult to ascertain whether the application is running at its best, and if not, what could be hindering performance. Kathryn reviews historical data for applications to identify inhibiting factors and develops tools that give researchers the information they need to tune their programs and maximize results. After all, says Kathryn, “It’s all about getting the answers more quickly.”
Kathryn’s first encounter with computer programming was for her chemistry bachelor’s degree at Portland State University. Chemistry was a subject at which she excelled but found uninspiring, and that brief programming experience convinced her to switch to computer science. After earning her doctorate from Portland State in 2010, she joined Livermore as a postdoctoral scholar, later transitioning to a staff scientist position. Having developed a close relationship with Livermore researchers during graduate school, Kathryn knew that LLNL would be a good fit for her. What she appreciates most about her colleagues is the enthusiasm they demonstrate for their work and for coming to work each day. She is also grateful for Livermore’s alternative work options. Since Kathryn’s husband is a small business owner who lives and works in Portland, Oregon, she spends her weekends in Portland and works from there remotely a week each month. “This flexible arrangement lets me see my spouse and do the job I love,” says Kathryn.
I help the fastest machines in the world run efficiently. Failures are a big problem on machines this big, and they can happen at any time during a program run. My job is to keep these failures from impacting application run time so that domain scientists can get the answers they need as quickly as possible. Every day on the way to my office I walk by the machine room and get to see Sierra, one of the fastest computers in the world. There’s really not much that can top that in my book.
I also love gardening, cooking, yoga, running, and cycling.