Math for Data Mining

Finding patterns within massive amounts of unexplored data requires the use of sophisticated linear algebra and presents a unique challenge. Van Emden Henson, Geoff Sanders, and their team at Livermore’s Center for Applied Scientific Computing (CASC) have developed improved matrix factorization algorithms to address the common problems encountered when parsing extremely large, intricate datasets. The project, funded by the Laboratory Directed Research and Development (LDRD) program, builds on work completed during a previous LDRD-funded project by expanding factorization techniques to yield new, robust capabilities aimed at more complicated data sets.

The results of these factorizations allow researchers to discover relationships in the data, such as hierarchies, groups, and communities. Their current work has significant potential applications in the Global Security directorate and is also closely aligned with the Laboratory’s cyber and intelligence missions.

The team’s previous LDRD-funded work for CASC dealt with data sets referred to as symmetric square matrices, which are “very well-behaved and involve only real numbers in their calculations and therefore are easy to work with on the computer—interpretation is simpler,” explains Sanders. “What we’ve moved on to with this current project,” adds Henson, “are data sets and matrices that do not behave as nicely. We often run into rectangular matrices and nonsymmetric square matrices, whose structure severely complicates the linear algebra.”

Says Sanders, “Factoring such matrices introduces imaginary numbers and other ‘ugly’ math that is so difficult that researchers often artificially symmetrize this kind of data to make it easier to deal with, even though doing so skews the results.”

In order to find meaning within these datasets, the team has applied the algorithms they developed for symmetric square matrices, but with several key alterations. “What we bring to the table with this kind of data is a multilevel approach—performing calculations on several levels of data. This relatively unexplored avenue could prove to be faster and more accurate than more traditional approaches to these matrices,” says Henson. Both Henson and Sanders come from a background in multigrid and multilevel analysis, skills that have been essential to efficiency improvement within physics simulation codes at the Laboratory.

Rectangular matrices, such as those that might represent the frequency of specific words in an extensive collection of unfamiliar documents, are problematic to analyze. In this example, researchers might want to know what information these documents contain and the most important topics within that information.

To uncover this, the team developed a method of building the rectangular matrices into larger matrices and then factoring them so they can be analyzed in a similar way to the symmetric square matrices. This approach can also allow researchers to find hierarchies within groups, at many different levels of detail or breadth.

“Through revealing factorization,” says Henson, “the data can find the relationships for us by grouping things together in ways that we never would have noticed in reading it ourselves.” One proof-of-concept exercise conducted by members of the team demonstrates the new approach. Using a random group of profiles on the career networking website LinkedIn, colleagues at Waterloo University created matrices representing scientists and their listed skills. They used the techniques developed in this LDRD project, analyzing the matrices to find relationships between the people, their careers, and their skills.

For instance, what skills are most commonly found in which careers? Which people have which skills in common? Clustering, partitioning, and community identification within this data set was accomplished using the newly developed factorization algorithms in conjunction with algebraic multigrid analysis.

Nonsymmetric square matrix analysis presents equally challenging but different hurdles. Some data, for example, contain cyclic relationships among several groups of vertices, but highly cyclic structures are undetectable when the matrix is symmetrized and traditional factorizations are employed. For this problem, the team discovered a factorization of nonsymmetric square matrices that readily reveals the cyclic structure. They demonstrated that patterns in the complex numbers present in the factorization reveal the cyclic community structure, and also that the factorization is easier to compute when highly cyclic structure is present.

Team member Christine Klymko has performed much of the project’s work in this area; her presentation of the findings won the Best Poster award at the Laboratory’s internal postdoctoral poster session. Five summer students have contributed to the overall project as well.

These solutions to revealing opaque data have built the solid foundation for approaching similar mathematical issues in the future, with highly valuable application to practical problems. With further development and refinement, the project will have a major impact on the way the Laboratory is able to process and analyze crucial data.