Refreshing the controls and IT infrastructure at NIF
As LLNL’s NIF enters its second decade of full-scale operations, the demands on all aspects of its information technology (IT) infrastructure are becoming more varied, complex, and critical. To meet these demands, the NIF IT team has embarked on an IT modernization effort to push NIF’s technology infrastructure into the next paradigm.
“To support NIF’s core missions, including nuclear stockpile stewardship and scientific discovery, scientists and engineers require a reliable, performant, and secure IT infrastructure,” says Phil Adams, chief technology officer and lead architect for NIF IT. “The compute infrastructure and many of the applications that enable the NIF to do amazing science are now approaching 18 years old.”
NIF’s unique environment requires that constant planning and attention be paid to the IT engine that drives it. Adam explains, “Operating environments such as NIF have different constraints than most facilities at the Laboratory and in industry. There are limited options for programmable logic controllers, scopes, and safety interlock systems, and these technologies are expensive to develop, test, and deploy on a one-of-a-kind laser control system.”
As developers and engineers work to upgrade and replace NIF’s systems and hardware with solutions that offer better probability for interoperability, the end-user experience and the program mission is at the forefront of their planning. The challenge, as Adams points out, is making the very old technologies communicate with the very new without compromising the operational integrity of the NIF.
To this end, the team leverages data to increase their understanding of the ecosystem. Extensive use of Splunk has provided some of the analytic capabilities and the ability to pull information from myriad relational databases to offer valuable insight.
“We’re just starting to use the Splunk Machine Learning Toolkit to respond to IT issues and to predict abnormal behavior within the laser facility,” says Marvin Christensen, cybersecurity manager for the NIF and Photon Sciences Principal Directorate. “Data is key to NIF operations, so the ability to correlate multiple data sources into one view has been game-changing for our team.”
A new partnership with ExtraHop Networks is adding another layer of security and analysis to the IT infrastructure at NIF. ExtraHop is a tool that provides analysis of all network interactions in real time and leverages machine learning to identify threats. The tool is providing deeper visibility into how applications and diagnostic systems are using NIF networks and what protocols are in use.
The NIF IT team maintains close partnerships with several other vendors, including Cisco, NetApp, and Oracle. “By leveraging key vendor relationships and making them trusted partners in the ongoing success of NIF, we’re able to homogenize NIF’s IT environment,” says Adams. “Where possible, we’ve tried to simplify our architecture, eliminate custom code, and utilize open source and commercial software to leverage the constant innovation in the industry.” By collaborating with vendors in a continuous peer review process, fundamental impact areas and the architectural path forward have been validated by other experts in the field—ensuring a greater likelihood of interoperability.
Cybersecurity is an increasing focus area for the NIF IT team. “Defending industrial environments from cyber threats is not just about ensuring that anti-virus definitions are up to date,” says Adams. “We must understand what aspects of the ecosystem make NIF potentially vulnerable and then mitigate the risks. We’re doing this by taking an intelligence-driven approach to cybersecurity.”
The team’s priority is to secure NIF’s data center while providing the flexibility for developers to continue to access the sensitive areas of the controls system and production tools. Since a single tool that collects and analyzes each facet of the enterprise does not exist, multiple tools are being leveraged to cover the critical aspects while visualizing the posture in Splunk.
The team is also figuring out how best to leverage the cloud paradigm. “The potential financial and operational benefits of cloud computing make it compelling,” says Adams, who is working with vendors and LLNL’s Livermore Information Technology (LivIT) program to explore various pathways to adoption. He continues, “There is real appeal there in terms of solutions for tiered storage, redundancy in data centers, and function-based computing that could make NIF more durable and capable for the future.”
When it comes to modernization, especially the move to the cloud, Adams and his team are taking prudent and measured steps largely because the consequence of error on the NIF systems is high; a system failure would be felt immediately and acutely by the scientific community. The team takes a similarly exacting approach when performing the evolutionary, and sometimes radical, IT changes that must be implemented and integrated without disrupting NIF’s 24/7 operations. The team manages infrastructure transitions with Saturday IT maintenance outages roughly every six weeks. “All work is planned, tested extensively, and communicated to the user community before pushing the changes into production,” says Allan Casey, NIF’s IT manager.
“Sometimes it feels like these changes happen at a glacial pace,” says Casey, “but we do things carefully and deliberately to ensure that our infrastructure is optimized and fully capable of supporting this massively complex, awesome machine.”
Ultimately, IT infrastructure modernization boils down to one thing: making NIF more successful. “Our primary mission is to provide an infrastructure that supports and enables NIF’s drive to explore fusion ignition and to safeguard the nuclear stockpile,” Adams says. “We’ve always made IT decisions based on making NIF more efficient and capable, and we’ll continue to do so. Modern software, systems, and tools allow us to do that faster and better.”
Figure 1. NIF relies on a Cisco/NetApp converged infrastructure called FlexPod in order to provide a resilient, highly available platform that keeps shot operations running. Even the slightest unforeseen downtime can cause cascading failures to various application systems. This engineered system has the capacity to handle sudden bursts of activity induced by laser operations without causing latency seen on more inefficient IT systems.