1 / 11

Testing and Evaluation

Testing and Evaluation. Greg Lindahl was University of Virginia, now HPTi. My Qualifications. assisted Wall Street clustering (130 machines) built Centurion I & II (300 nodes) at UVa 276 node Forecast Systems Lab cluster myrinet, fiberchannel, complex software every one is different

sheena
Download Presentation

Testing and Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Testing and Evaluation Greg Lindahl was University of Virginia, now HPTi

  2. My Qualifications • assisted Wall Street clustering (130 machines) • built Centurion I & II (300 nodes) at UVa • 276 node Forecast Systems Lab cluster • myrinet, fiberchannel, complex software • every one is different • “we” (community) would like to do 6,000+ nodes

  3. Motivationsfor Test and Eval • Roll your own: Installation testing • Acceptance testing: Trust but verify • Uh, does it run as fast as it’s supposed to? • We’ll pretend that “It doesn’t work; what the heck do we do now?” never happens

  4. Classes of Machine • Boring: 300 nodes • occasional failures OK, as long as the user job eventually runs • Exciting: 6000 nodes • Much more hardware to be subtly flaky on you • user’s job may never finish due to series of errors • system software more likely to flake out

  5. Classes of Failure • Infant mortality • Burn in, you’re done (ha) • Systematic errors (broken network adapter, 10% bad cables) • capability test can catch • Software • Weird failures • Compaq shipped me 276 bad power supplies at FSL; only statistics pointed the finger at it

  6. How to test • I use the same suite for both installation testing and acceptance testing • Simulated Use Testing is King • apps, I/O, both capability and capacity jobs, job mix • Add in apps from other disciplines to stress machine in unusual ways • NAS PB • Big Pile of Benchmarks (Susan and I)

  7. Terascale is different • Occasional failures that are no big deal on small machine are fatal to large capability jobs • My 300-node reliability daemon has too many false positives for a terascale machine! • Probability of un-isolated problems increases

  8. RAS features needed:Small and TeraScale • Reliability: Monitoring of networks, daemons, logfiles • Today, not all relevant info comes out (soft memory errors, IDE drive retries) • Some straightforward development needed (myrinet has SNMP now, etc) • symptoms can be subtle: bad fibre caused apparent software problem on an SGI, but now we know...

  9. RAS Continued • Higher Availability • Reasonably fault tolerant: comp node failure only takes out 1 job, doesn’t require operator intervention • Serviceability: • cluster partition (test new software release) • rolling upgrade (nodes upgraded as available) • checkpoint allows much better access to nodes without discomfiting users

  10. Summary • Lots of raw materials exist to attack the testing and evaluation problem • The methodology exists, too • Even if you buy a complete system, you still need to know how to write the acceptance test • Open source RAS has a long way to go • but there is a small system solution today

  11. Gong This Slide:MM5 Performance (t3a) Gigaflops

More Related