Enhancing Query Processing through Approximate Lineage in Probabilistic Databases

Approximate Lineage for Probabilistic Databases Christopher Ré and Dan Suciu University of Washington

Approximate Lineage in One Slide • Lineage (Provenance) • In QP used to track correlations • Explain query/view results • VLDBs have lots of lineage • Chokes QP • Hard for users to understand • Obs: lineage contains a lot of redundancy! In a view, lineage is all derivations of a tuple probabilistic databases Especially with complex queries/views This work: Approximate the lineage, by keeping only the most important correlations

Overview • Motivation & Preliminaries • An apx lineage approach: Sufficient Lineage • Experiments • Conclusions

Inspired by the Geneontology (GO) Database A Protein Database Standard pDB, e.g. Mystiq, Trio Data are from somewhere Process (P) Atoms Lineage (l) is important Manually Created Lineage from a wide variety of sources – not all trusted the same Machine inferred Some with confidence, too!

PRA[Fuhr&Rolleke 97], Trio [Widom 05], Mystiq [R,Dalvi,S07] Review: Lineage tracking Lineage propagates with queries /views “Proteins related to same process as Àac11’” How do we derive the lineage ? V(y) :- P(x,y),P(Àac11’, y), x  Àac11’ l1 Lineage tracks allderivations Process (P) Prob QP: Pr[V(‘AGO2’)] = Pr[l1] Big DB = Big Lineage (GO) 1 tuple 10MB lineage! Big Lineage chokes the engine!

Problems with Large Lineage in pDB This talk • Lineage is used to: • Process Queries • Give explanations to users • Find influential atoms • Large: chokes QP • Large:Many redundant explanations • Large:Needle in a haystack On VLDBs, helpful to shrink (approximate) the lineage

Approximate Lineage Approach Original VLDB Level 2 Database (Small lineage) Level 1 Database (Big lineage) error, e a l smaller, approximate formula All (most) querying on Level 2 database (using a instead of l) Focus is on the Level 2 database

Sufficient lineage (SL) • Represent as? • Use as to: • Answer queries? • Provide explanations? • Find influential tuples? • Build good a, efficiently? DNF formulae, that logically imply l Reuse existing systems! a is a lower bound l See paper The remainder of this talk Nugget: An algorithm that always finds small, good SL

Formalizing “good as” Choosing an approximation a for a lineage function, l Formalizing this, Atoms E[l – a]  e An atom is a Boolean proposition. A world is a set of the true atoms. Expectation of difference over all worlds, should be small Intuition: a should agree on most worlds NB: really standard ℓ2 distance

Illustrating Good Lineage E[l – a] = E[l] – E[a]  e e= 0.054 Intuition: Pr[a] high means good lineage 0.9 *(1 - (1 - 0.8)(1-0.3)) 0.9 * 0.8 = 0.72 = 0.9 *0.86 = .774

1st step: Lineage DNFs to “graphs” X1 Y1 (X1˄ Y1) ˅(X2˄ Y1) X2 Y2 We can think of DNFs as graphs (k-DNF  a k-hypergraph) Atoms = nodes Ym Xn Monomials = edges Trick: matching is an SL formula. Goal: Given error e, find a subset of edges with error smaller than eand small size, i.e. a best lower bound;

How big a matching could we need? Assume Pr[Xi ] = Pr[Yj] = 0.5 X1 Y1 X2 Y2 Pr[M] = 1- (1-0.25)|M| Matching of size 9 implies Pr[M] > .9 For any e > 0.1 ; M can always < 9 Ym Xn Subtle: size bound depends on k, e and Pr[Xi] – not # of tuples If l has a small good matching, take a to be matching. Call this a “good enough matching”

There is not always a good-enough matching X1 ˄ APX(Y1 ˅ Y2 ˅ … Ym) ˅ (X2 ˅ Z) X1 Y1 (Y1 ˅ Y2 ˅ … Ym) – a (k-1)-DNF Y2 Y5 Formally, {X1,X2} is a small cover Must apx the (k-1)-DNF w. smaller e to account for correlations Ym X2 Z Obs: no “good-enough matching”, then cover must be small Best matching is  0.4 , but formula very close to 0.625! nodes in any maximal matching

SL is always small THM (SL is always small) Size of SL is constant in data. Two Cases: Small-good matching Small-cover of important nodes We’re done! Recurse on k-1 DNF Requires “non-vanishing” probs In datasets, usually, Pr > 10-3 Exponential in query Similar to data-complexity Problem: Maximum matching in general hypergraphs is NP-hard need a maximal matching – pick greedily! Apx NP-hard!

Summary of Constructing SL • For SL, good lineage = big lineage • Not true in general. • Gave an algorithm that always finds small SL • Constant in the data • Exponential in almost everything else • Main trick: Don’t try to find optimal solutions, when sloppy is good enough!

Other fun results in the paper • Sufficient Lineage (SL) • Error bounds for QP • Finding influential tuples • Polynomial Lineage (PL): DNF to polynomial • Use Taylor/Fourier approximation of poly • Algos for QP, explanations and influential tuples • Leverage extensive prior art! PL smaller than SL, but not usable in pDBs (Mystiq, Trio).

Experiments • Geneontology Database • Publically available • Predefined views • Atoms = “evidence codes” • Discuss a single view • 6 tables • 2 sources of evidence • 1119 tuples • 141MB Similar results on IMDB data not presented “All proteins associated with a single protein”

Compression Ratio v. Error Compress Ratio 30x compression 141MB to 4MB Good compression ratio even for stringent error e, error level (smaller is more conservative)

Effect on QP Compute each tuple in the view Original Lineage Running Time Seconds (Log10 Scale) Sufficient Lineage e, error level (smaller is more conservative)

Which ls give the biggest gain? Original Lineage Win: Compressing big terms # Terms Sufficient Lineage Compressing Single View Top 500 formula in descending size (# is rank)

Conclusion • Discussed approximate lineage approach • Goal: Fast QP, Explanations • Sufficient Lineage • Can be used by standard QPs • Improves QP dramatically • Apx lineage is more general, e.g. Polynomial

Enhancing Query Processing through Approximate Lineage in Probabilistic Databases

Enhancing Query Processing through Approximate Lineage in Probabilistic Databases

Presentation Transcript

Exact and approximate inference in probabilistic graphical models

Approximate Probabilistic Optimization Using Exact-Capacity-Approximate-Response-Distribution (ECARD)

Probabilistic Databases

A new class of lineage expressions over probabilistic databases computable in PTIME

A Course on Probabilistic Databases

Search for Approximate Matches in Large Databases

Lineage Processing over Correlated Probabilistic Databases

Indexing Correlated Probabilistic Databases

Managing Probabilistic Duplicates in Databases

On approximate majority and probabilistic time

Efficient Query Evaluation on Probabilistic Databases

Approximate reasoning for probabilistic real-time processes

Approximate Aggregation Techniques for Sensor Databases

A Course on Probabilistic Databases

Approximate Simulations for Task-Structured Probabilistic I/O Automata

Databases With Uncertainty And Lineage

Scrubbing Query Results from Probabilistic Databases

Queries with Difference on Probabilistic Databases

APPROXIMATE QUERY PROCESSING IN DATABASES

Efficient Query Evaluation on Probabilistic Databases

Approximate Aggregation Techniques for Sensor Databases