Lecture 12: DB Benchmarking

Lecture 12: DB Benchmarking Oct 4, 2006 ChengXiang Zhai Most slides are adapted from Kevin Chang’s lecture slides

Why benchmarking? What does it measure? Price Functionality Performance

Why Benchmarks? • The three most important aspects of a DBMS? • functionality, price, and performance • Performance is hard to figure out • what to implement? How to implement? • Performance is hard to compare • response time? throughput? cost? • ease of use? maintenance?

Before the Wisconsin Benchmarks • Vendors quote performance numbers for marketing, but • None was published or verified • Performance figures generally not comparable • Big customers could afford benchmarking competition • use real target applications • but difficult and confusing without a standard procedure • Vendors only as serious as necessary to make the sale • “Vendors had little incentive to publish their performance because it was often embarrassing.” [TBH91] • contribute little to move the state of the art forward

When customers do their own • “A customer will typically involve one or more db vendors and a hardware vendor in this process. These organizations will not encourage the customer to conduct more thorough and detailed tests because such tests take longer and are more likely to uncover problems that might kill the sale. The customer will be encouraged to hurry the testing process and make the selection.” [TBH91] • “A [customer-defined] benchmark will present many opportunities for debate [over interpretation]. Both managers and technicians will be involved in rultings that require fundamental tradeoffs between realism, fairness, and expense. … A complex benchmark will leave managers with the feeling that Solomon had it easy.” [TBH91]

When DB vendors do their own • “They like to set up and perform preliminary testing in private, bring the customer in to witness the test, and then get the customer out quickly before anything can go wrong.” [TBH91]

The Wisconsin Benchmarks The Wisconsin benchmarks changed all that • around 1981 – 1983 • The benchmark: • a synthesized data set: the WISC database • control various parameters: • selectivity, # of duplicate tuples, # of aggregate groups • a set of 32 single-user complex SQL queries • selections, joins, projections, aggregates, updates

Wisconsin Benchmarking Results Several named major vendors benchmarked • INGRES (university-version and commercial-version) • IDM (Intelligent Database Machines) of Britton-Lee • with and without DAC (database accelerator) • DIRECT (a multiprocessor DB machine) of Wisconsin • ORACLE

Wisconsin Benchmarking Results DeWitt (then assistant prof.) v.s. Ellison: “The relative poor performance of ORACLE made it apparent that the system had some fairly serious problems that needed correction, for ORACLE was typically a factor of 5 slower than INGRES and the IDM 500 on most selection queries.” “In retrospect, the reasons for this popularity were only partially due to its technical quality. The primary reason for its success was that it was the first evaluation containing impartial measures of real products.” -- DeWitt [TBH91]

Consequently • Angry vendors: • angry vendor called the author’s boss • demanded a recount (recoding or a remeasuring) • published their own numbers for the query set • began to patch problems in their DBMSs • use the Wisconsin benchmarks for regression testing • The DeWitt clause: in most software license agreements • can’t publish performance numbers • DB gurus criticized the shortcomings: • the synthesized relations hard to scale (make larger) • it is not “real” • Customers began to demand Wisc bench. results

Benchmark Wars Followed “Benchmark wars start if someone loses an important or visible benchmark evaluation. The loser reruns it using regional specialists and gets new and winning numbers. Then the opponent reruns it using his regional specialists and of course gets even better numbers. The loser then reruns it using some one-star gurus. This progression can continue all the way to five-star gurus. At a certain point, a special version of the system is employed, with promises that the enhanced performance features will be included in the next regular release.” [TBH91]

The Long-term Effects of the WB • Vendors equalized their performance on the Wisc. Bench. queries (cross-vendor, release-to-release) • Gurus thought long and hard about the characteristics of a “good” DB benchmark---and they are still thinking • Vendors started to learn how to cheat on benchmarks • Customers and gurus began to think about how to stop cheating

The WB Shortcomings • Not “realistic”: • queries were of interest for the authors’ parallel platform • but not reflect the OLTP systems of the day (banks) • Not multiuser • System price wasn’t factored in • The data set hard to scale up (make larger) • 2MB, 10,000 tuples • systems will grow, so should benchmarks • Successors: • the Anon et. al. paper: DebitCredit or “TP1” • TPC: TPC-A, TPC-B, TPC-C • addressed these shortcomings, measuring concurrent TPS

The Anon et. al. Paper • Jim Gray, 1984 • early version distributed to professionals in academia and industry for comments • published as “Anon et. al.” to suggest group effort • The benchmark tests: • an interactive OLTP emulation: DebitCredit • modeled after the actual state of Bank of America in the early 1970s • two batch tests that stress IO: • Scan: scans and updates 1000 records • Sort: disk sort of one million records

What’s Good About DebitCredit? • Why was it popular and influential?

The DebitCredit Benchmark • Relevance: modeled bank OLTP • Simplicity: • one deposit transaction over the ABTH files: • account, branch, teller, and history (logs) • Scalability: DB size scales up as TPS does • for each TPS: 100k accounts, 10 branches, 100 tellers • e.g., 10 TPS: 1000k A, 100 B, and 1000 T • Comparability: • system requirements: • 95% transactions with 1sec response time • configuration control: • instead of specifying “equivalent” configurations • use cost as normalization factor • simple summary metrics: TPS, $/TPS

Who Uses Benchmarks Today? • DBMS vendors • to tune, tweak, optimize, and hide weaknesses • Hardware vendors • with DBMS vendor liaisons to prompt sales • Third parties product reviews • Academic institutions • as part of research projects • Customers • help making buying decisions

How To Cheat on a Benchmark? • Use a special version of the system that is not released or customers will never use • Use a data organization or non-standard approach that is never used in practice by customers • Disable every feature not explicitly required by the benchmark • Price the benchmarked system using special discounts • Just lie • More later

How to Solve These Problems? • Cheaters’ creativity knows no bounds • details later • “Benchmarketing” wars heated up • gurus/consultants often hired as auditors to certify results • The Transaction Processing Performance Council, 1988 • an independent “benchmark” body: • to design and publish standardized DB benchmarks • consultant Omri Serlin’s proposal, with 8 vendors initially

TPC today • http://www.tpc.org • about 20+ companies (all the major DB vendors) • Early benchmarks: • TPC-A: TPC version of DebitCredit • TPC-B: stress test of core backend DB servers • Current benchmarks: • TPC-C: OLTP warehouse order entry and inventory monitoring • TPC-H: ad-hoc decision support queries • TPC-R: also decision support with “non ad-hoc” queries • TPC-W: Web e-commerce transactions

TPC Web Site Highlights (www.tpc.org) • Who is the fastest on each benchmark? What are the trends? What do the published numbers mean? Why are there both hardware and software vendors listed? What happened to TPC-A and TPC-B? How do we know the numbers aren’t lies? How do we find out the details of how the benchmark was run? What are the “withdrawn” results? How detailed are the benchmark specs? How is the price computed?

Point to Ponder • Why more hardware vendors than software on TPC results?

Why More HW than SW Vendors? • HW evolves faster and SW relatively “stable”? • HW more players than SW? • SW market saturated by just a few vendors • HW vendors view DB as driving applications? • Software marketing more than performance? • compatibility, market dominance

What makes a good benchmark? (Gray) • Relevant • must resemble an actual class of apps • extrapolation is impossible • hence special benchmarks for ECAD (OO7), Sci DBs (Sequoia), … • Portable • should be easy to implement on many different platforms • Scalable • should apply to small and large systems • e.g.: different size DBs for different size platforms • Simple • must be understandable for creditability

Writing a Benchmark: Focuses Benchmark focuses: • Core speed and potential bottlenecks • “micro-benchmarks”, e.g., Sort • Functionalities • e.g.: Wisconsin • End-to-end “scenarios” performance: • e.g.: DebitCredit, TPC-C

Writing a Benchmark: Metrics • Performance: • Wicsonsin: total elapsed time of various queries • DebitCredit: TPS • defines the DebitCredit transaction as the “unit transaction” • simple, allows easy comparison • the single-number metric made the benchmark popular • Price/Performance: • Anon et. al.’s approach • inspired by the bank bidding case study • 100 TPS, 5M$: 50k$/TPS • 100 TPS, 25M$: 250k$/TPS • adopted by TPC

What operations to include? • A specified mix of common operations? • E.g., concurrent sales assembly 30%, summary reports 10%, and everything else can be new order entry • Use probability distribution to determine type of next operation • Harder than running each operation in isolation, or in a predetermined order

What operations to include? • Utility functions? • Recovery, data loading, index construction… • Is logging required? Locking? What granularity? Can they be turned on/off or tuned during the benchmark? Same physical DB for all operations? • Not glamorous, but customers need them • “Panicked vendors may state that none of their customers uses level-x but uses level-y with no dire consequences. Vendor user groups are good sources of sanity.” [TBH91]

What operations to include? • Application logic? • “Unless you have compute-intensive applications, application logic should be the first thing eliminated from a benchmark. Application logic rarely accounts for more than 10-15% of the CPU load. Specifying application logic, then verifying it across all bidders, is time consuming and error prone.”

Famous Cheating Methods • Put the entire benchmark inside a single precompiled stored procedure in the DBMS • run it with a single call (much cheaper) • no run-time query optimization • Divide the DB into several physical DBs • hide the fact of only DB-level locking • Use local clients, instead of remote

Famous Cheating Methods • Use an unreleased version of the DBMS • and promise it will be ready soon • Use your 5-star wizards to tune the DBMS • Leads to escalating wizard wars, especially for customer-supplied benchmarks • Help the query optimizer: • reorder query conditions for optimizer to pick the “right” plan • break query into a series of queries, or un-nested subqueries • Remove functionalities: • turn off logging, locking, or anything not explicitly required

Famous Problems Uncovered • Lock the entire table for one insert • Can’t handle tables larger than one disk • Bulk load too slow to be practical • Nested queries always use sequential scan • Core dumps during the benchmark • Can’t support tuples longer than 2KB

Famous Problems uncovered • During heavy multi-user insertions, DBMS corrupts the db • Application hangs if multiple users call SQL within 5 seconds of each other • Different number of rows in answer with/without Order-By clause in query • Query requires temporary space equal to db size

To Find Out More • TPC’s web site is a great starting point • www.tpc.org • The Benchmark Handbook [TBH91] • edited by Jim Gray, is the authoritative text • online at www.benchmarkresources.com/handbook/

Conclusion: Benchmark = Bound • The guru’s promise: • while cleverness should be rewarded, the clever people may disappear after you own the equipment • The performance guarantee: • a benchmark result is a guarantee that your performance will never exceed the published result • well, but sometimes comparing their “upper bounds” is still meaningful, better than nothing!

What You Should Know • Wisconsin benchmarks • DebitCredit benchmark • Why is benchmarking challenging? • Benchmark = Bound

Carry Away Messages • Benchmarks can stimulate technology progress • Benchmarks can bias our research directions • How to benchmark search engines?

Lecture 12: DB Benchmarking