M&IC System History, Governance Model, and Strategy

In the latter half of the 1990s, the maturation of parallel computing technology made it possible for the nation to contemplate the development of production-level 3D scientific applications requiring super-teraflop computational capability. In fact, the Stockpile Stewardship Program (SSP) led in identifying this potential and proposed the Accelerated Strategic Computing Initiative (ASCI) as its spearhead to enable certification, in conjunction with subcritical and other experiments and theory, in the absence of underground testing.

LLNL, as an institution, recognized that if one of its major programs was embarking on an adventure that had the potential to revolutionize scientific methods in the next century, the health of the institution depended on a science and technology (S&T) base that also had access to powerful ASCI-class computing environments. This strategic move kept the disciplines at the forefront and positioned LLNL as the pre-eminent simulation site today. From this notion was born Multiprogrammatic and Institutional Computing.

M&IC is truly institutional. Many directorates invest, and the institution invests. The growth of M&IC since 1997 has been significant, as shown in Figure 1 and Table 1; the total capacity currently available to M&IC scientists is about 550 TF/s.

Growth of M&IC computing power (in GF/s) from 1997-2009.

Figure 1. Growth of M&IC computing power (in GF/s) from 1997-2009.

 

  FY97 FY98 FY99 FY00 FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08 FY09
T3D 37 0 0 0 0 0 0 0 0 0 0 0 0
Compass 35 70 70 70 70 0 0 0 0 0 0 0 0
TC98 0 0 96 108 176 147 0 0 0 0 0 0 0
SUN 0 12 24 12 12 12 0 0 0 0 0 0 0
Qbert 0 0 0 0 0 0 12 12 0 0 0 0 0
Snowbert 0 0 0 0 0 0 0 0 57 57 57 57 0
Ebert 0 0 0 0 0 0 0 0 0 0 0 0 563
TC2K 0 0 0 683 683 683 683 683 0 0 0 0 0
GPS 0 0 0 0 0 192 277 277 277 277 0 0 0
LX 0 0 0 0 43 101 0 0 0 0 0 0 0
iLX 0 0 0 0 0 0 634 678 678 678 0 0 0
ASCI Blue 0 16 89 99 74 74 0 0 0 0 0 0 0
ASCI Frost 0 0 0 0 326 326 0 0 0 0 0 0 0
MCR 0 0 0 0 0 11059 11059 11059 11059 11059 0 0 0
Thunder 0 0 0 0 0 0 0 22938 22938 22938 22938 22938 0
Atlas 0 0 0 0 0 0 0 0 0 0 44237 44237 44237
Zeus 0 0 0 0 0 0 0 0 0 0 11059 11059 11059
Yana 0 0 0 0 0 0 0 0 0 0 3034 3034 3034
uBGL 0 0 0 0 0 0 0 0 0 0 0 0 229376
Hive
(256 GB*4)
0 0 0 0 0 0 0 0 0 0 0 0 563
Hera 0 0 0 0 0 0 0 0 0 0 0 0 121650
Total Peak GF 72 98 279 972 1384 12594 12665 35647 35009 35009 81325 81325 410482

Table 1. Growth of M&IC computing power (in GF/s) from 1997-2009.

The M&IC governance model is both grass roots and hierarchical. The "board of directors" (the Institutional Computing Executive Group, or ICEG) consists of well-known LLNL scientists who are qualified to identify deficiencies and request improvements. Typically, ICEG members are appointed by ADs in the various directorates. Hierarchically, M&IC management reports to the Director's Office, namely to the Deputy Director for S&T, who provides guidance relative to the institution's overall S&T goals and at the highest level manages allocations. Generally, it is not difficult to meet both the scientists' requests and the institution's, and this is a challenge that M&IC facilitates. Lest the investment levels highlighted in Figure 2 be viewed as excessive, we note that the M&IC environment is comparable to the best unclassified environments anywhere in the country, and the total investment at LLNL is only about $11 million per year. Such is the power of leverage and momentum from partnering with the Advanced Simulation and Computing (ASC) Program.

M&IC cost history (in $K), FY03-FY10.

Figure 2. M&IC cost history (in $K), FY03-FY10.

The institution covers all the operational costs and also invests in the high performance computing (HPC) hardware. The programs and directorates invest only in the hardware. A share of the computing resource (called a bank) is correlated to the level of investment. The size of the bank is proportional to the level of investment. Access to the institution's banks is managed through an HPC request process, which depends on the size of the request. Smaller requests are awarded by the M&IC program office. Large requests are required to compete under the Grand Challenge process.

Because of strong and consistent investments, LLNL has the benefit of one of the most experienced and well-staffed scientific computing centers in the world. An investment in hardware is leveraged by attention from experienced integrators, operators, and services staff, and from a well-engineered foundation in networks and storage. All of this mitigates considerably the risks inherent in investing in the newest and best cost performance technologies.

Our platform strategy has been to straddle multiple technology curves to appropriately balance risk and benefit, following three complementary technology curves as shown in Figure 3. The first allows support for today's stockpile needs, the second delivers an affordable path to a future petaflop system, and the third provides a low-cost transition to the next generation of platform. M&IC investments have favored curve #2, open source commodity clusters. We believe that for the next 2–3-year cycle, clusters are the best solution for M&IC.

Figure 3. Platform strategy technology curves.