Making Red Storm a Success Subtitle: It Takes a Village to Build a Supercomputer

Making Red Storm a SuccessSubtitle: It Takes a Village to Build a Supercomputer June 7-9, 2006 Sue Kelly Sandia National Laboratories smkelly@sandia.gov, 505-845-9770 SAND-2006-3384P Unlimited Release Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Outline • Red Storm brief background • A year of continual improvement • Sandia contributions to Red Storm • Current status and future work

Red Storm Project Background • ASC Program Capability (versus capacity) HPC machine • Resource to NNSA’s Stockpile Stewardship program for advanced simulations • Available on a limited basis to other national security programs and scientific endeavors • Timeline • Contract awarded to Cray in September, 2002 • Hardware delivered Sept 2004 thru Jan 2005 • Achieved initial operation in March, 2005 • Achieved limited availability in September, 2005 • Machine general availability targeted for September, 2006

Red Storm Configuration

Topic 2 - A Year of Continual Improvement • Significant efforts in hardware and software reliability bore fruit • Performance improvements were integrated into production source base, giving the best of both worlds – performance and reliability MPI Latency (lower is better)

Topic 3 - Sandia Contributions During Initial Development • Programmatic: • Active management • Mentorship of Cray developers • Milestone-based payment schedule • Technical: • Sandia-developed architecture • Based on more than a decade of experience with MPP (Massively Parallel Processor) systems • Created a Statement of Work that embodied the design • Software for application run-time environment • Compute node light weight kernel operating system • Virtual file system library • Logarithmic job launcher • Compute processor allocator

Risk Mitigation Efforts • Developed C version of Seastar NIC (Network Interface Chip) firmware • Implementation reduced latency from >25 msec to 7 msec • Deployed in version 1.2 of Cray’s XT3 software release • Provided first version of NIC-resident network software/firmware (protocol offload engine) • Introduced compatible MPI enhancements • Developed interim booting mechanism • Used for development • Scaled to several thousand nodes • Developed interim parallel I/O file system since Lustre was not ready in time for initial operation • The HCFS was an extension of PVFS

Ties to Research • Light Weight Kernel OS (Operating System) • SUNMOS -> Puma -> Cougar -> Catamount • Network Interconnects • Ongoing portals work; Red Storm implements portals V3.3 • Prototypes of NIC-based portals implementations • CANNOT OVER EMPHASIZE THE IMPORTANCE OF THIS PRIOR RESEARCH • MPI • Collectives • Overlap and independent progress • I/O • Parallel I/O • High Performance I/O • Reliability, Availability, and Serviceability (RAS) • Cluster Integration Tool Kit • Theoretical RAS analysis Supplemental Slides contain references

The Hero Runs • The early adopters • Science runs • POP, SEAM, CTH • Application verification • CTH, ITS, Sage, Partisn, Salinas, Alegra, Presto, Calore • Application scaling • POP, SEAM, CTH, Salinas, Sage, ITS • Benchmarking • UMT2K, sPPM, HPL, HPCC • The first “production” users • Bert Still – “redstorm-s seems to be well ahead of other MPP machines at a comparable point in their development cycle; other systems i've used are continuing to have lustre problems well into their 2nd and 3rd years of service. ultimately, the people make the system work, and the staff at SNL are outstanding” • Tim Jones – “Analysis codes run fast on red storm”, “I/O improvements has helped my analysis time tremendously; jobs are ~3-4 time faster” • Ref: Daly talk

The Village • Champions for the computer facility • The networking guys • Viz and data services • Security team • User account processes and programs • Accounting system • Help desk • Machine oversight and work prioritization committee

Topic 4: Current StatusExceptional scaling results SEAM Benchmark Red Storm and Blue Gene Sage Results

Numerous Production Successes • LANL classified work • 5000 nodes • Running since January • “Scaling is nearly linear” • Mean Time Between Interrupts (MTBI) was 17 hours during last 3 months • CTH • 2000, 5000 node jobs • Ran for 5 days • Salinas • 400 nodes – multiple jobs • Each simulation ran approximately 5 days • Scaling study up to 2048 nodes • Fuego • 1024 nodes – multiple jobs • Running since February • Presto • 512, 1024 nodes • “achieved a computational rate of 7.5msec/day (simulation time over clock time). This is the fastest computational rate that we have seen for a full body B61 model (1.5 million elements)” • pF3d • 128, 256 nodes (due to unavailability of more nodes) • Running since Fall ’05 • “There are now significant changes in the NIF target designs, which will vastly improve margin and robustness…with the information gained from the Redstorm runs, we can streamline a lot of our work by reducing parameter space."

Current & Future Red Storm Efforts • Ramping up to support hardware upgrade: 5th row, dual-cores, SeaStar 2.1 NIC—all due this summer • Added support for dual-core AMD Opterons to the compute node light weight kernel OS • Implemented as two virtual nodes • Master processor does network I/O for both • Adding support for network protocol offload engine in the Seastar network interface chip • reduces zero-byte latency from ~7msec to ~4msec • Combining above two efforts to support 4-way AMD Opterons • Formed a team to improve Lustre performance • Initial efforts are aimed at assessing current state • Developing a methodology for analyzing progress • Stratifying the relevant components and measuring performance at each point in the pipe line

Summary • This talk focused on Sandia contributions and how the Sandia research program was critical to the success of the efforts. • While many challenges remain, Red Storm has evolved to a high performing, scalable platform for production use. • The XT3 product line, based on Red Storm has helped other scientific communities accomplish their goals (PSC, ORNL, ERDC, CSCS, AWE, …).

Selected References for Each Research Area • Light Weight Kernel OS (Operating System) • Brightwell, Ron, Rolf Riesen, Keith Underwood, Trammell B Hudson, Patrick Bridges, Arthur B Maccabe, "A Performance Comparison of Linux and a Lightweight Kernel,"Conference Paper, IEEE International Conference on Cluster Computing, December 2003. • Maccabe, Arthur B., Patrick G. Bridges, Ron B. Brightwell, Rolf E. Riesen, Trammell B. Hudson, "Highly Configurable Operating Systems for Ultrascale Systems," Workshop Paper, First International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters, June 2004. • Kelly, Suzanne M, Ron B Brightwell, John P VanDyke, "Catamount Software Architecture with Dual Core Extensions,"Conference Paper, Cray User Group, May 2006. • Network Interconnects • Pedretti, Kevin and Ron Brightwell, “A NIC-Offload Implementation of Portals for Quadrics QsNet,” Proceedings of the Fifth LCI International Conference on Linux Clusters, May 2004. • Brightwell, Ron B, Douglas Doerfler, Keith D Underwood, "A Comparison of 4X Infiniband and Quadrics Elan-4 Technologies,"Conference Paper, 2004 International Conference on Cluster Computing (Cluster 2004), September 2004. • Brightwell, Ron, Trammell Hudson, Kevin Pedretti, Keith D Underwood, Rolf Riesen, "Implementation and Performance of Portals 3.3 on the Cray XT3," Conference Paper, IEEE International Conference on Cluster Computing, September 2005. • Brightwell, Ron, Trammell Hudson, Kevin Pedretti, Keith D Underwood, "Cray's SeaStar Interconnect: Balanced Bandwidth for Scalable Performance," Journal Article, IEEE MIcro, Accepted/Published June 2006.

Selected References for Each Research Area (cont) • MPI • Brightwell, Ron, Keith D Underwood, "Evaluation of an Eager Protocol Optimization for MPI," Conference Paper, Tenth European PVM/MPI User Group Conference, September 2003. • Brightwell, Ron, Sue Goudy, Arun Rodrigues, Keith D Underwood, "Implications of Application Usage Characteristics for Collective Communication Offload," Journal Article, Internation Journal of High-Performance Computing and Networking Special Issue: Design and Performance Evaluation of Group Communication in Parallel and Distributed Systems, Vol. 4, No. 2, Accepted/Published February 2006. • Brightwell, Ron B, Rolf Riesen, Keith D Underwood, "Analyzing the Impact of Overlap, Offload, and Independent Progress for MPI," Journal Article, International Journal of High Performance Computing Applications, Vol. 19, No. 2, pp. 103–117, Accepted/Published August 2005. • I/O • Coloma, Kenin, Alok N. Choudhary, Wei-keng Liao, Lee Ward, Eric Russell, Neil Pundit, “Scalable High-level Caching for Parallel I/O,” IPDPS 2004 • Oldfield, Ron A, David F Kotz, "Improving data access for computational grid applications,"Journal Article, Cluster Computing: The Journal of Networks, Software Tools and Applications, Accepted/Published June 2005. • Coloma, Kenin, Alok N. Choudhary, Avery Ching, Wei-keng Liao, Seung Woo Son, Mahmut T. Kandemir, Lee Ward, “Power and Performance in I/O for Scientific Applications,” IPDPS 2005.

Selected References for Each Research Area (cont) • Reliability, Availability, and Serviceability (RAS) • Laros, James H., III, Lee Ward, Nathan W. Dauchy, Ron B. Brightwell, Trammell B. Hudson, Ruth A. Klundt, "An Extensible, Portable, Scalable Cluster Management Software Architecture,"Conference Paper, IEEE International Conference on Cluster Computing, September 2002. • Laros, James H., III, Lee H. Ward, Nathan W. Dauchy, James Vasak, Ruth A. Klundt, Glenn A. Laguna, Marcus R. Epperson, Jon R. Stearley, "The Cluster Integration Toolkit,"Conference Paper, Cluster World Conference and Expo, June 2003. • Kelly, Suzane M, “A Use Case Model for RAS in an MPP Environment,"Conference Paper, Cray User Group, May 2004. • Stearley, Jon R, "Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS),"Conference Paper, Linux Clusters Institute (LCI05), April 2005. • Laros, James H, III, "A Software and Hardware Architecture for a Modular, Portable, Extensible Reliability Availability and Serviceability System,"Conference Paper, 2nd Workshop on High Performance Computing Reliability Issues in conjunction with the 12th International Symposium on High Performance Computer Architecture, February 2006.

Making Red Storm a Success Subtitle: It Takes a Village to Build a Supercomputer

Making Red Storm a Success Subtitle: It Takes a Village to Build a Supercomputer

Presentation Transcript

Cyberinfrastructure, E-Science and the San Diego Supercomputer Center

New eCommerce Trends and Technologies

Overall Goal of the IS-STM

Making an Interactive Table of Contents in PowerPoint

Chembakolli – a village in India Water

Standard Agenda

Coding for Malnutrition- A Success Story

making a success of agile working

The Urban Game

Scarcroft Village Development Plan

Comberton Village College

If the World Was a Village of 100 People

Comberton Village College

Communicating and Spreading Success Sponsored by: Health Quality Council of Alberta

Manufacturing Cycle

Sakai Architecture and Roadmap

Seeds of Success

Reading Comprehension: Strategies for Success

The Ohio Supercomputer Center Blue Collar Computing Initiative