Testing and Evaluation Strategies for Large-Scale Computing Systems

Testing and Evaluation Greg Lindahl was University of Virginia, now HPTi

My Qualifications • assisted Wall Street clustering (130 machines) • built Centurion I & II (300 nodes) at UVa • 276 node Forecast Systems Lab cluster • myrinet, fiberchannel, complex software • every one is different • “we” (community) would like to do 6,000+ nodes

Motivationsfor Test and Eval • Roll your own: Installation testing • Acceptance testing: Trust but verify • Uh, does it run as fast as it’s supposed to? • We’ll pretend that “It doesn’t work; what the heck do we do now?” never happens

Classes of Machine • Boring: 300 nodes • occasional failures OK, as long as the user job eventually runs • Exciting: 6000 nodes • Much more hardware to be subtly flaky on you • user’s job may never finish due to series of errors • system software more likely to flake out

Classes of Failure • Infant mortality • Burn in, you’re done (ha) • Systematic errors (broken network adapter, 10% bad cables) • capability test can catch • Software • Weird failures • Compaq shipped me 276 bad power supplies at FSL; only statistics pointed the finger at it

How to test • I use the same suite for both installation testing and acceptance testing • Simulated Use Testing is King • apps, I/O, both capability and capacity jobs, job mix • Add in apps from other disciplines to stress machine in unusual ways • NAS PB • Big Pile of Benchmarks (Susan and I)

Terascale is different • Occasional failures that are no big deal on small machine are fatal to large capability jobs • My 300-node reliability daemon has too many false positives for a terascale machine! • Probability of un-isolated problems increases

RAS features needed:Small and TeraScale • Reliability: Monitoring of networks, daemons, logfiles • Today, not all relevant info comes out (soft memory errors, IDE drive retries) • Some straightforward development needed (myrinet has SNMP now, etc) • symptoms can be subtle: bad fibre caused apparent software problem on an SGI, but now we know...

RAS Continued • Higher Availability • Reasonably fault tolerant: comp node failure only takes out 1 job, doesn’t require operator intervention • Serviceability: • cluster partition (test new software release) • rolling upgrade (nodes upgraded as available) • checkpoint allows much better access to nodes without discomfiting users

Summary • Lots of raw materials exist to attack the testing and evaluation problem • The methodology exists, too • Even if you buy a complete system, you still need to know how to write the acceptance test • Open source RAS has a long way to go • but there is a small system solution today

Gong This Slide:MM5 Performance (t3a) Gigaflops

Testing and Evaluation Strategies for Large-Scale Computing Systems