Leveraging data-intensive computing for sleuthing seismic signals
Since the 1960s, scientists at Lawrence Livermore have been developing methods to locate, identify, and distinguish nuclear explosions from earthquakes and other types of seismic activity. Today, high-performance computers and new data-intensive computing techniques are improving our country’s ability to monitor nuclear explosions and allowing researchers to better define seismically active faults.
In the last decade, the ability to monitor international seismic events has grown more computationally challenging. The data collected from seismology sensors deployed around the world has been increasing at exponential rates, and the advent of ultra-cheap sensors promises to maintain this trend. With ubiquitous sensors comes the need to store and process enormous data volumes.
“Livermore and other organizations tasked with tracking seismicity around the world now must monitor, analyze, and respond to data that is one million times more voluminous than what we were dealing with 10 years ago,” explains Stan Ruppert, who leads a team of computer scientists tasked with developing data analytic capabilities to address the problem. “Today we are handling millions of signal sources, hundreds of thousands of recording sensors, and billions of derived parameters and measurements. The growing data volume is overwhelming the community’s traditional processing techniques.”
One of the most significant problems associated with combining data from different sensors is organizing the metadata that come with them. Often, the data collected are of unreliable quality, updates are unpredictable, key values are inconsistent, and data are provided in inconsistent formats. In addition, duplicate and conflicting metadata must be coalesced into a single logical entity. Currently, these issues are solved by manual data fixes or complex data manipulations buried in application code—methods that are unsustainable given the substantial increase in incoming sensor data and the accompanying exponential growth in metadata.
To address this issue, the Livermore team is designing a new metadata processing architecture that easily handles data provenance for incoming data in any format, isolates downstream applications from ingestion and transformation details, and preserves the complete provenance of the data. The proposed architecture leverages NoSQL data warehousing techniques developed for large-scale text processing and analysis. The new metadata processing design will provide a highly scalable and maintainable solution that also improves the quality and availability of critical sensor metadata.
Big data techniques are also being used to help understand new seismic events by comparing the waves generated to “templates” of past underground nuclear tests and other seismicity. A wide variety of nuclear proliferation/detonation detection capabilities rely on recognizing the specific signature of an event in a large volume of streaming real-time or archived data. If templates of the target signatures can be defined, then automated template matching (e.g., correlation, pattern recognition) offers a potential way to detect, locate, and identify the source of interest, even when the signal is buried in a noisy background.
An important step for utilizing big data has been the use of distributed-data platforms, such as Hadoop and Storm. A partnership between Livermore’s Geophysical Monitoring Program (GMP) team and Livermore Computing (LC) has helped demonstrate the potential of data-intensive computing to solve earth science problems.
To better understand the distribution and behavior of correlated seismic events, the team cross-correlated a 50 terabyte global dataset consisting of more than 300 million seismograms. Running on a conventional distributed cluster, the dataset took 42 days to complete. In anticipation of processing much larger datasets, the team re-architected the system to run as a series of MapReduce jobs on a custom LC Hadoop cluster, dubbed “Bigfoot.” The results were impressive—a 19-fold performance increase on a 1 terabyte test dataset. Given the results, the team expects to be able to rerun the entire 50-terabyte dataset in 2 days versus the original 42 days.
“We think Hadoop will lead to a viable signal correlation architecture for processing streaming seismic data in the next few years,” says Doug Dodge, who co-authored a paper detailing the results on Bigfoot. “This will dramatically improve our ability to conduct research on massive seismic datasets.”
Fundamental algorithmic transformations were required to achieve this performance increase. No single data-intensive tool (e.g., MapReduce) will solve every analytic workflow, so the LC and GMP teams are investigating other software techniques and hybrid hardware, such as Livermore’s Catalyst system. One prototype is helping to ensure the quality, reliability, and accuracy of seismic data archived by Incorporated Research Institutions for Seismology (IRIS), the world’s largest seismic data distributor. In partnership with IRIS, a representative sample of global seismic data (45 terabytes) derived from the IRIS repository is being analyzed with LLNL tools from the distributed computing ecosystem and with techniques from digital signal processing, statistics, and machine learning to facilitate the extraction of quality data by the world-wide seismology community.
These innovative seismic-monitoring technologies should ultimately extend to other data-rich domains also important to Livermore’s national security missions, including applied electromagnetics, streaming signals, and the electric grid.