1 / 14

Software Testing Doesn’t Scale

Software Testing Doesn’t Scale. James Hamilton JamesRH@microsoft.com Microsoft SQL Server. Overview. The Problem: S/W size & complexity inevitable Short cycles reduce S/W reliability S/W testing is the real issue Testing doesn’t scale trading complexity for quality

aldens
Download Presentation

Software Testing Doesn’t Scale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Testing Doesn’t Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server

  2. Overview • The Problem: • S/W size & complexity inevitable • Short cycles reduce S/W reliability • S/W testing is the real issue • Testing doesn’t scale • trading complexity for quality • Cluster-based solution • The Inktomi lesson • Shared-nothing cluster architecture • Redundant data & metadata • Fault isolation domains

  3. S/W Size & Complexity Inevitable • Successful S/W products grow large • # features used by a given user small • But union of per-user features sets is huge • Reality of commodity, high volume S/W • Large feature sets • Same trend as consumer electronics • Example mid-tier & server-side S/W stack: • SAP: ~47 mloc • DB: ~2 mloc • NT: ~50 mloc • Testing all feature interactions impossible

  4. Short Cycles Reduce S/W Reliability • Reliable TP systems typically evolve slowly & conservatively • Modern ERP systems can go through 6+ minor revisions/year • Many e-commerce sites change even faster • Fast revisions a competitive advantage • Current testing and release methodology: • As much testing as dev time • Significant additional beta-cycle time • Unacceptable choice: • reliable but slow evolving or fast changing yet unstable and brittle

  5. Testing the Real Issue • 15 yrs ago test teams tiny fraction of dev group • Now tests teams of similar size as dev & growing rapidly • Current test methodology improving incrementally: • Random grammar driven test case generation • Fault injection • Code path coverage tools • Testing remains effective at feature testing • Ineffective at finding inter-feature interactions • Only a tiny fraction of Heisenbugs found in testing (www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Avialiability_talk.ppt) • Beta testing because test known to be inadequate • Test team growth scales exponentially with system complexity • Test and beta cycles already intolerably long

  6. The Inktomi Lesson • Inktomi web search engine (SIGMOD’98) • Quickly evolving software: • Memory leaks, race conditions, etc. considered normal • Don’t attempt to test & beta until quality high • System availability of paramount importance • Individual node availability unimportant • Shared nothing cluster • Exploit ability to fail individual nodes: • Automatic reboots avoid memory leaks • Automatic restart of failed nodes • Fail fast: fail & restart when redundant checks fail • Replace failed hardware weekly (mostly disks) • Dark machine room • No panic midnight calls to admins • Mask failures rather than futile attempt to avoid

  7. Apply to High Value TP Data? • Inktomi model: • Scales to 100’s of nodes • S/W evolves quickly • Low testing costs and no-beta requirement • Exploits ability to lose individual node without impacting system availability • Ability to temporarily lose some data W/O significantly impacting query quality • Can’t loose data availability in most TP systems • Redundant data allows node loss w/o data availability lost • Inktomi model with redundant data & metadata a solution to exploding test problem

  8. Connection Model/Architecture • All data & metadata multiply redundant • Shared nothing • Single system image • Symmetric server nodes • Any client connects to any server • All nodes SAN-connected Client Server Node Server Cloud

  9. Query execution on many subthreads synchronized by root thread Query execute Compilation & Execution Model Client Server Thread Lex analyze Parse Normalize Optimize Code generate Server Cloud

  10. Lose node • Recompile • Re-execute Node Loss/Rejoin • Execution in progress Client • Rejoin. • Node local recovery • Rejoin cluster • Recover global data at rejoining node • Rejoin cluster Server Cloud

  11. Redundant Data Update Model • Updates are standard parallel plans • Optimizer knows all redundant data paths • Generated plan updates all • No significant new technology • Like materialized view & index updates today Client Server Cloud

  12. Fault Isolation Domains • Trade single-node perf for redundant data checks: • Fairly common…but complex error recovery is even more likely to be wrong than original forward processing code • Many of the best redundant checks are compiled out of “retail versions” when shipped (when needed most) • Fail fast rather than attempting to repair: • Bring down node for mem-based data structure faults • Never patch inconsistent data…other copies keep system available • If anything goes wrong “fire” the node and continue: • Attempt node restart • Auto-reinstall O/S, DB and recreate DB partition • Mark node “dead” for later replacement

  13. Summary • 100 MLOC of server-side code and growing: • Can’t fight it & can’t test it … • quality will continue to decline if we don’t do something different • Can’t afford 2 to 3 year dev cycle • 60’s large system mentality still prevails: • Optimizing precious machine resources is false economy • Continuing focus on single-system perf dead wrong: • Scalability & system perf rather than individual node performance • Why are we still incrementally attacking an exponential problem? • Any reasonable alternatives to clusters?

  14. Software Testing Doesn’t Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server

More Related