1 / 29

The Challenge of Scale

The Challenge of Scale. Dan Reed Dan_Reed@unc.edu Chancellor’s Eminent Professor Vice Chancellor for IT and CIO University of North Carolina at Chapel Hill Director, Renaissance Computing Institute (RENCI) Duke University North Carolina State University

duaa
Download Presentation

The Challenge of Scale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Challenge of Scale Dan Reed Dan_Reed@unc.edu Chancellor’s Eminent Professor Vice Chancellor for IT and CIO University of North Carolina at Chapel Hill Director, Renaissance Computing Institute (RENCI) Duke University North Carolina State University University of North Carolina at Chapel Hill Renaissance Computing Institute

  2. On Being the Right Size The most obvious differences between different animals are differences of size, but for some reason the zoologists have paid singularly little attention to them. … But yet it is easy to show that a hare could not be as large as a hippopotamus, or a whale as small as a herring. For every type of animal there is a most convenient size, and a large change in size inevitably carries with it a change of form. J. B. S. Haldane Renaissance Computing Institute

  3. You Might Be A Big System Geek If … • You think a $2M cluster • is a nice, single user development platform • You need binoculars • to see the other end of your machine room • You order storage systems • and analysts issue “buy” orders for disk stocks • You measure system network connectivity • in hundreds of kilometers of cable/fiber • You dream about cooling systems • and wonder when fluorinert will make a comeback • You telephone the local nuclear power plant • before you boot your system Renaissance Computing Institute

  4. How Big Is Big? • Every 10X brings new challenges • 64 processors was once considered large • it hasn’t been “large” for quite a while • 1024 processors is today’s “medium” size • 2048-8096 processors is today’s “large” • we’re struggling even here • 100K processor systems • are in construction • we have fundamental challenges … • … and no integrated research program Norman et al Renaissance Computing Institute

  5. Large Systems Renaissance Computing Institute

  6. Petascale Systems In 2004 • Component count for 1 PF peak • 200,000 5 GF processors • assuming 4-way nodes • 50,000 NICs and associated switches • enough fiber to wire a small country • (optional) 50,000 local scratch disks • assuming 1 GB/processor • 200 TB of DRAM • they’re called jellybeans for a reason! • Other sundry (but stay tuned for alternatives) • one unused football field (either flavor) • conveniently located electrical generating station • >20 MW (w/o cooling) at ~100 watts/processor • Power issues • 20 MW at $0.05/KWH is $1000/hour or $8.7M/year IBM Blue Gene/L Renaissance Computing Institute

  7. Petascale Systems In 2008 • Technology trends • multicore processors • IBM Power4 and SUN UltraSPARC IV • Itanium “Montecito” in 2005 • quad-core and beyond are coming • reduced power consumption • laptop and mobile market drivers • increased I/O and memory interconnect integration • PCI Express, Infiniband, … • Let’s look forward a few years to 2008 • 8-way or 16-way cores (8 or 16 processors/chip) • ~10 GF cores (processors) and 4-way nodes (4, 8-way cores/node) • 12x Infiniband-like interconnect • With 10 GF processors • 100K processors and 3100 nodes (4-way with 8 cores each) • 1-3 MW of power, at a minimum Renaissance Computing Institute

  8. Power Consumption “The power problem is the No. 1 issue in the long-term for computing. It's time for us to stop making 6-mile-per-gallon gas guzzlers.” Greg Papadopoulos SUN Chief Technology Officer “This is a seminal shift in the industry. If all you do is make chips smaller, you will hit the power crisis. I'm optimistic that it is an opportunity to do more holistic design that takes into account everything . . . not just performance.” Bernie Meyerson IBM Chief Technology Officer “What matters most to the computer designers at Google is not speed, but power – low power because data centers can consume as much power as a city.” Eric Schmidt Google CEO Renaissance Computing Institute

  9. Power Consumption • Power has many implications • cost, for sure • but also physical system size and reliability • Blue Gene/L uses low power processors • that’s no accident • Moore’s law isn’t a birthright • CMOS scaling issues are now a challenge • power, junction size, fab line costs, … • Scaling also affects power and RAS • at ~50 nm feature size • static power (leakage) is comparable to dynamic (switching) power • leakage increases dramatically with operating temperature • SRAM soft error rate (SER) increased 30X (Intel) • when moving from 0.25 to 0.18 micron geometry and from 2 to 1.6V • ECC does not catch all errors (Compaq) • perhaps 10 percent uncaught • worse, cheap systems have no ECC Renaissance Computing Institute

  10. Node Failure Challenges • Domain decomposition • spreads vital data across all nodes • each spatial cell exists in one memory • except possible ghost or halo cells • Single node failure • causes blockage of the overall simulation • data is lost and must be recovered • “Bathtub” failure model operating regimes • infant mortality • normal mode • late failure mode • Simple checkpointing helps; the optimum interval is roughly where  is time to complete a checkpoint M is the time before failure R is the restart time due to lost work Burn in Late Failure Normal Aging Failure Rate Elapsed Time Renaissance Computing Institute

  11. ASCI Q Petascale Reliability 1 hour reliability • Facing the issues • ASCI Q boot time is ~8 hours • not far from the system MTTF • application checkpoint frequency • MTTF 1/ = 1-r • A few assumptions • assume independent component failures • an optimistic and not realistic assumption • N is the number of processors • r is probability a component operates for 1 hour • R is probability the system operates for 1 hour • Then or for large N MTTF (hours) System Size Renaissance Computing Institute

  12. ASCI White Availability (LLNL) Hardware failures dominate Source: Mark Seager Renaissance Computing Institute

  13. Experimental Fault Assessment • Memory • random single bit flips • text, data, heap and stack • regular and floating-point registers • Message passing • random bit flips (MPI level) • payload corruption • single bit flip or multiple bit flip (burst error) • Failure modes • application crash • MPI error detected via MPI error handler • application detected via assertion checks • other(e.g., segmentation fault) • application hang (no termination) • application execution completion • correct or incorrect output (fault not manifest) Renaissance Computing Institute

  14. Fault Code Suite Characteristics See C-D. Lu and D. A. Reed, “Assessing Fault Sensitivity in MPI Applications,” SC2004, to appear Source: Charng-da Lu Renaissance Computing Institute

  15. NAMD: Applications Can Help • Output is correct if each number is identical • up to 3 decimal places to the result from a baseline run. • NAMD has assertion and range checks • certain messages and values (e.g. molecular velocities) Source: Charng-da Lu Renaissance Computing Institute

  16. NAMD Working Sets Renaissance Computing Institute

  17. CAM: Or Maybe Not • CAM from CCSM • community atmospheric model Source: Charng-da Lu Renaissance Computing Institute

  18. What is a Petascale System? • Embrace failure, complexity, and scale • a mind set change Renaissance Computing Institute

  19. Autonomic Behavior • Learn from biology • a cold is not fatal • systems build resistance • Petascale implications • monitor internal state • even if substantial resources are required • respond to warnings, not just failures • develop adaptation strategies Renaissance Computing Institute

  20. ACPI Power Control • ACPI • Advanced Configuration and Power Management • HP, Intel, Microsoft, Phoenix, Toshiba, … • OS management of system power consumption • originally targeted at laptop/mobile device market • ACPI defines the following • hardware registers on chip • BIOS interfaces • Thermal failure is a big issue for HPC systems • monitor and react in many ways • processor clock speed based on code • disk spin down to conserve power/reduce heat Renaissance Computing Institute

  21. SMART Disks • SMART • Self Monitoring, Analysis and Reporting Technology • on-disk monitoring and data analysis • ATA/IDE and SCSI support • Typical SMART capabilities • head flying height, data throughput, spin up time • reallocated sector count, seek error rate, seek time performance • spin retry count, drive calibration retry count, temperature • Drive spin up time (for example) • indicative of motor or bearing failure • By monitoring, one can identify • performance problems • failure probability Renaissance Computing Institute

  22. Contract MonitorTask(s) Cluster Centroids Tolerance Rules Fuzzy Logic Decision Process Inputs Defuzzifier Outputs Fuzzifier Actuators Sensors System Sensors Actuators Power Consumption and Failures • Intelligent monitoring • Autopilot tagged sensors • SMART (disk temperature) • ACPI (thermal zone) • active cooling policy and throttling • lm_tools • CPU and board temperature • accessible via Autopilot Manager • statistical sampling for monitoring • failure prediction based on history • Failure model applications • adaptive checkpointing • batch queue selection Renaissance Computing Institute

  23. Configuration x86 Linux cluster Myrinet interconnect Measurements microprocessor shown at left motherboard measured at six locations next slide Linpack Temperature Measurements Computation Terminated Celsius Celsius Renaissance Computing Institute

  24. Thermal dynamics matter reliability and fault management power and economics Why? Arrhenius equation temperature implications mean time to catastrophic failure of commercial silicon 2X for every 10 C above 70 C Linpack Temperature Measurements Computation Terminated Celsius Celsius Renaissance Computing Institute

  25. Failures and Autonomic Recovery • 106 hours for component MTTF • sounds like a lot until you divide by 105! • It’s time to take RAS seriously • systems do provide warnings • soft bit errors – ECC memory recovery • disk read/write retries, packet loss and retransmission • status and health provide guidance • node temperature/fan duty cycles • Software and algorithmic responses • diagnostic-mediated checkpointing • algorithm-based fault tolerance • domain-specific fault tolerance • loosely synchronous algorithms • optimal system size for minimum execution time LANL 10 TF Pink Node Temperature Renaissance Computing Institute

  26. Software Evolution and Faults • Cost dynamics • people costs are rising • hardware costs are falling • Two divergent software world views • parallel systems • life is good – deus ex machina • Internet • we’ll all die horribly – trust no one • What does this mean for software? • abandon the pre-industrial “craftsman model” • adopt an “automated evolution” model Renaissance Computing Institute

  27. Artificial Life Chaos Theory Genetic Algorithms Dynamical Systems Neural Networks Decentralized Control Genetic Programming Evolutionary Software: A Revolution? • Learn some biological lessons • environmental adaptation • homeostatic behavior and immunity • social structures and specialization • ants, termites, … • Evolve software components • adaptive software agents • interacting building blocks • challenges • define basic building blocks • specify evolutionary rules Renaissance Computing Institute

  28. Fault Rules Fuzzy Sets Sensors One Possible Model Software Controls Fault Injection Fault Models Fitness Functions Defuzzifier Performance Measurement System Execution Candidate Libraries Genetic Programming Engine Fuzzy Logic Assessment System Behavior Fault Monitor Failure Indicators Fuzzifier Software Building Blocks Failure Indicators Performance Data Renaissance Computing Institute

  29. Renaissance Computing Institute • Vision • a multidisciplinary institute • academe, commerce and society • broad in scope and participation • from art to zoology • Objectives • enrich and empower human potential • communities at all levels • create multidisciplinary partnerships • science, engineering and computing • commerce, humanities and the arts • develop and deploy leading infrastructure • driven by collaborative opportunities • computing, communications and data management • visualization, collaboration and manufacturing • enable and sustain economic development Renaissance Computing Institute

More Related