TEXTURE Benchmark: Evaluating Text Query Performance on Relational DBMS

The TEXTURE Benchmark: Measuring Performance of Text Queries on a Relational DBMS Vuk Ercegovac David J. DeWitt Raghu Ramakrishnan

Applications Combining Text and Relational Data Query: SELECT SCORE, P.id, FROM Products P WHERE P.type = ‘PDA’ and CONTAINS(P.complaint, ‘short battery life’, SCORE) ORDER BY SCORE DESC ProductComplaints How should such an application be expected to perform?

Possibilities for Benchmarking 1. http://es.csiro.au/TRECWeb/vlc2info.html 2. http://trec.nist.gov 3. http://www.tpc.org 4. S. DeFazio, Full-text Document Retrieval Benchmark, chapter 8. Morgan Kaufman, 2 edition, 1993 8. P. O’Neil. The Set Query Benchmark. The Benchmark Handbook, 1991 10. C. Turbyfill, C. Orji, and D. Bitton. AS3AP- a Comparative Relational Database Benchmark. IEEE Compcon, 1989.

Contributions of TEXTURE • Design micro-benchmark to compare response time using a mixed relational + text query workload • Develop TextGen to synthetically grow a text collection given a real text collection • Evaluate TEXTURE on 3 commercial systems

Why a Micro-benchmark Design? • A fine level of control for experiments is needed to differentiate effects due to: • How text data is stored • How documents are assigned a score • Optimizer decisions

Why use Synthetic Text? • Allows for systematic scale-up • User’s current data set may be too small • Users may be more willing to share synthetic data Measurements on synthetic data shown empirically by us to be close to same measurements on real data

A Note on Quality • Measuring quality is important! • Easy to quickly return poor results • We assume that the three commercial systems strive for high quality results • Some participated at TREC • Large overlap between result sets

Outline • TEXTURE Components • Evaluation • Synthetic Text Generation

TEXTURE Components Query Templates Query 1 Query 2 … Query n Response Time A Response Time B QueryGen DBGen TextGen System A System B Relational Text Attributes

Overview of Data • Schema based on Wisconsin Benchmark [5] • Used to control relational predicate selectivity • Relational attributes populated by DBGen [6] • Text attributes populated by TextGen (new) • Input: • D: document collection, m: scale-up factor • Output: • D’: document collection with |D| x m documents • Goal: Same response times for workloads on D’ and corresponding real collection 5. D. DeWitt. The Wisconsin Benchmark: Past, Present, and Future. The Benchmark Handbook, 1991. 6. J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly Generating Billion-record Synthetic Databases. ACM SIGMOD, 1994

Overview of Queries • Query workloads derived from query templates with following parameters • Text expressions: • Vary number of keywords, keyword selectivity, and type of expression (i.e., phrase, Boolean, etc.) • Keywords chosen from text collection • Relational expression: • Vary predicate selectivity, join condition selectivity • Sort order: • Choose between relational attribute or score • Retrieve ALL or TOP-K results

Example Queries • Example of a single relation, mixed relational and text • query that sorts according to a relevance score. SELECT SCORE, num_id, txt_short FROM R WHERE NUM_5 = 3 and CONTAINS(R.txt_long, ‘foo bar’, SCORE) ORDER BY SCORE DESC • Example of a join query, sorting according to a relevance score on S.txt_long. SELECT S.SCORE, S.num_id, S.txt_short FROM R, S WHERE R.num_id = S.num_id and S.NUM_05 = 2 and CONTAINS(S.txt_long, ‘foo bar’, S.SCORE) ORDER BY S.SCORE DESC

Outline • TEXTURE Components • Evaluation • Synthetic Text Generation

Overview of Experiments • How is response time affected as the database grows in size? • How is response time affected by sort order and top-k optimizations? • How do the results change when input collection to TextGen differs?

Data and Query Workloads • TextGen input is TREC AP Vol.1[1] and VLC2 [2] • Output: relations w/ {1, 2.5, 5, 7.5, 10} x 84,678 tuples • Corresponds to ~250 MB to 2.5 GB of text data • Text-only queries: • Low (< 0.03%) vs. high selectivity (< 3%) • Phrases, OR, AND • Mixed, single relation queries: • Low (<0.01%) vs. high selectivity (5%) • Pair with all text-only queries • Mixed, multi relation queries: • 2, 3 relations, vary text attribute used, vary selectivity • Each query workload consists of 100 queries 1. http://es.csiro.au/TRECWeb/vlc2info.html 2. http://trec.nist.gov

Methodology for Evaluation • Setup database and query workloads • Run workload per system multiple times to obtain warm numbers • Discard first run, report average of remaining • Repeat for all systems (A, B, C) • Platform: Microsoft Windows 2003 Server, dual processor 1.8 GHz AMD, 2 GB of memory, 8 120 GB IDE drives

Scaling: Text-Only Workloads • How does response time vary per system as the data set scales up? • Query workload: low text selectivity (0.03%) • Text data: synthetic based on TREC AP Vol. 1

Mixed Text/Relational Workloads • Drill down on scale factor 5 (~450K tuples) • Query workload Low: text selectivity (0.03%) • Query workload High: text selectivity (3%) • Do the systems take advantage of relational predicate for mixed workload queries? • Query workload Mix: High text, low relational selectivity (0.01%) Seconds per system and workload (synthetic TREC)

Top-k vs. All Results • Compare retrieving all vs. top-k results • Query workload is Mix from before • High selectivity text expression (3%) • Low selectivity relational predicate (0.01%) Seconds per system and workload (450K tuples, synthetic TREC)

Varying Sort Order • Compare sorting by score vs. sorting by relational attribute • When retrieving all, results similar to previous • Results for retrieving top-k shown below Seconds per system and workload (450K tuples, synthetic TREC)

Varying the Input Collection • What is the effect of different input text collections on response time? • Query workload: low text selectivity (0.03%) • All results retrieved • Text Data: synthetic TREC and VLC2 Seconds per system and collection (450K tuples)

Outline • Benchmark Components • Evaluation • Synthetic Text Generation

Synthetic Text Generation • TextGen: • Input: document collection D, scale-up factor m • Output: document collection D’ with |D| x m documents • Problem: Given documents D, how do we add documents to obtain D’ ? • Goal: Same response times for workloads on D’ and corresponding real collection C, |C|=|D’| • Approach: Extract “features” from D and draw |D’| samples according to features

Document Collection Features • Features considered • W(w,c) : word distribution • G(n, v) : vocabulary growth • U,L : number of unique, total words per document • C(w1, w2, …, wn, c) : co-occurrence of word groups • Each feature is estimated by a model • Ex. Zipf[11] or empirical distribution for W • Ex. Heaps Law for G[7] 7. H. S. Heaps, Information Retrieval, Computational and Theoretical Aspects. Academic Press, 1978. 11. G. Zipf. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Hafner Publications, 1949.

Process to Generate D’ • Pre-process: estimate features • Depends on model used for feature • Generate |D’| documents • Generate each document by sampling W according to U and L • Grow vocabulary according to G • Post-process: Swap words between documents in order to satisfy co-occurrence of word groups C

Feature-Model Combinations • Considered 3 instances of TextGen, each a combination of features/models

Which TextGen is a Good Generator? • Goal: response time measured on synthetic (S) and real (D) should be similar across systems • Does the use of randomized words in D’ affect response time accuracy? • How does the choice of features and models effect response time accuracy as the data set scales?

Use of Random Words • Words are strings composed of a random permutation of letters • Random words are useful for: • Vocabulary growth • Sharing text collections • Do randomized words affect measured response times? • What is the affect on stemming, compression, and other text processing components?

Effect of Randomized Words • Experiment: create two TEXTURE databases and compare across systems • Database AP based on TREC AP Vol. 1 • Database R-AP: randomize each word in AP • Query workload: low & high selectivity keywords • Result: response times differ on average by < 1%, not exceeding 4.4% • Conclusion: using random words is reasonable for measuring response time

Effect of Features and Models • Experiment: compare response times over same sized synthetic (S) and real (D) collections • Sample s documents of D • Use TextGen to produce S at several scale factors • |S| = 10, 25, 50, 75, and 100% of |D| • Compare response time across systems • Must repeat for each type of text-only query workload • Used as framework for picking features/models

TextGen Evaluation Results • How does response time measured on real data compare to the synthetic TextGen collections? • Query workload: low selectivity text only query (0.03%) • Graph is for System A • Similar results obtained for other systems

Future Work • How should quality measurements be incorporated? • Extend the workload to include updates • Allow correlations between attributes when generating database

Conclusion • We propose TEXTURE to fill the gap seen by applications that use mixed relational and text queries • We can scale-up a text collection through synthetic text generation in such a way that response time is accurately reflected • Results of evaluation illustrate significant differences between current commercial relational systems

References • http://es.csiro.au/TRECWeb/vlc2info.html • http://trec.nist.gov • http://www.tpc.org • S. DeFazio, Full-text Document Retrieval Benchmark, chapter 8. Morgan Kaufman, 2 edition, 1993 • D. DeWitt. The Wisconsin Benchmark: Past, Present, and Future. The Benchmark Handbook, 1991. • J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly Generating Billion-record Synthetic Databases. ACM SIGMOD, 1994 • H. S. Heaps, Information Retrieval, Computational and Theoretical Aspects. Academic Press, 1978. • P. O’Neil. The Set Query Benchmark. The Benchmark Handbook, 1991 • K. A. Shoens, A. Tomasic, H. Garcia-Molina. Synthetic Workload Performance Analysis of Incremental Updates. In Research and Development in Information Retrieval, 1994. • C. Turbyfill, C. Orji, and D. Bitton. AS3AP- a Comparative Relational Database Benchmark. IEEE Compcon, 1989. • G. Zipf. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Hafner Publications, 1949.

Questions?

TEXTURE Benchmark: Evaluating Text Query Performance on Relational DBMS

TEXTURE Benchmark: Evaluating Text Query Performance on Relational DBMS

Presentation Transcript

BENCHMARKS

Chapter 4

RankSQL : Query Algebra and Optimization for Relational Top-k Queries

Relational DBMS

DBMS: Past, Present, and the Future

Relational Query Optimization

BigBench: Big Data Benchmark Proposal

Nondeterministic Queries in a Relational Grid Information Service

CryptDB : A Practical Encrypted Relational DBMS

Lecture #2 ER-Model and The Relational Model

Modelling Relational Statistics With Bayes Nets

Chapter 5

Chapter 5

SQL Playbook

FlatSQL vs. MySQL Queries using Unix tools vs. a DBMS

The TEXTURE Benchmark: Measuring Performance of Text Queries on a Relational DBMS

The Relational Algebra and Calculus

F28DM: DATABASE MANAGEMENT SYSTEMS Queries in the Relational Model: SQL

Introduction

Jeopardy

Chapter 14: Advanced Topics: DBMS, SQL, and ASP.NET