89 Views

Download Presentation
##### Approximate Lineage for Probabilistic Databases

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Approximate Lineage for Probabilistic Databases**Christopher Ré and Dan Suciu University of Washington**Approximate Lineage in One Slide**• Lineage (Provenance) • In QP used to track correlations • Explain query/view results • VLDBs have lots of lineage • Chokes QP • Hard for users to understand • Obs: lineage contains a lot of redundancy! In a view, lineage is all derivations of a tuple probabilistic databases Especially with complex queries/views This work: Approximate the lineage, by keeping only the most important correlations**Overview**• Motivation & Preliminaries • An apx lineage approach: Sufficient Lineage • Experiments • Conclusions**Inspired by the Geneontology (GO) Database**A Protein Database Standard pDB, e.g. Mystiq, Trio Data are from somewhere Process (P) Atoms Lineage (l) is important Manually Created Lineage from a wide variety of sources – not all trusted the same Machine inferred Some with confidence, too!**PRA[Fuhr&Rolleke 97], Trio [Widom 05], Mystiq [R,Dalvi,S07]**Review: Lineage tracking Lineage propagates with queries /views “Proteins related to same process as `Aac11’” How do we derive the lineage ? V(y) :- P(x,y),P(`Aac11’, y), x `Aac11’ l1 Lineage tracks allderivations Process (P) Prob QP: Pr[V(‘AGO2’)] = Pr[l1] Big DB = Big Lineage (GO) 1 tuple 10MB lineage! Big Lineage chokes the engine!**Problems with Large Lineage in pDB**This talk • Lineage is used to: • Process Queries • Give explanations to users • Find influential atoms • Large: chokes QP • Large:Many redundant explanations • Large:Needle in a haystack On VLDBs, helpful to shrink (approximate) the lineage**Approximate Lineage Approach**Original VLDB Level 2 Database (Small lineage) Level 1 Database (Big lineage) error, e a l smaller, approximate formula All (most) querying on Level 2 database (using a instead of l) Focus is on the Level 2 database**Overview**• Motivation & Preliminaries • An apx lineage approach: Sufficient Lineage • Experiments • Conclusions**Sufficient lineage (SL)**• Represent as? • Use as to: • Answer queries? • Provide explanations? • Find influential tuples? • Build good a, efficiently? DNF formulae, that logically imply l Reuse existing systems! a is a lower bound l See paper The remainder of this talk Nugget: An algorithm that always finds small, good SL**Formalizing “good as”**Choosing an approximation a for a lineage function, l Formalizing this, Atoms E[l – a] e An atom is a Boolean proposition. A world is a set of the true atoms. Expectation of difference over all worlds, should be small Intuition: a should agree on most worlds NB: really standard ℓ2 distance**Illustrating Good Lineage**E[l – a] = E[l] – E[a] e e= 0.054 Intuition: Pr[a] high means good lineage 0.9 *(1 - (1 - 0.8)(1-0.3)) 0.9 * 0.8 = 0.72 = 0.9 *0.86 = .774**1st step: Lineage DNFs to “graphs”**X1 Y1 (X1˄ Y1) ˅(X2˄ Y1) X2 Y2 We can think of DNFs as graphs (k-DNF a k-hypergraph) Atoms = nodes Ym Xn Monomials = edges Trick: matching is an SL formula. Goal: Given error e, find a subset of edges with error smaller than eand small size, i.e. a best lower bound;**How big a matching could we need?**Assume Pr[Xi ] = Pr[Yj] = 0.5 X1 Y1 X2 Y2 Pr[M] = 1- (1-0.25)|M| Matching of size 9 implies Pr[M] > .9 For any e > 0.1 ; M can always < 9 Ym Xn Subtle: size bound depends on k, e and Pr[Xi] – not # of tuples If l has a small good matching, take a to be matching. Call this a “good enough matching”**There is not always a good-enough matching**X1 ˄ APX(Y1 ˅ Y2 ˅ … Ym) ˅ (X2 ˅ Z) X1 Y1 (Y1 ˅ Y2 ˅ … Ym) – a (k-1)-DNF Y2 Y5 Formally, {X1,X2} is a small cover Must apx the (k-1)-DNF w. smaller e to account for correlations Ym X2 Z Obs: no “good-enough matching”, then cover must be small Best matching is 0.4 , but formula very close to 0.625! nodes in any maximal matching**SL is always small**THM (SL is always small) Size of SL is constant in data. Two Cases: Small-good matching Small-cover of important nodes We’re done! Recurse on k-1 DNF Requires “non-vanishing” probs In datasets, usually, Pr > 10-3 Exponential in query Similar to data-complexity Problem: Maximum matching in general hypergraphs is NP-hard need a maximal matching – pick greedily! Apx NP-hard!**Summary of Constructing SL**• For SL, good lineage = big lineage • Not true in general. • Gave an algorithm that always finds small SL • Constant in the data • Exponential in almost everything else • Main trick: Don’t try to find optimal solutions, when sloppy is good enough!**Other fun results in the paper**• Sufficient Lineage (SL) • Error bounds for QP • Finding influential tuples • Polynomial Lineage (PL): DNF to polynomial • Use Taylor/Fourier approximation of poly • Algos for QP, explanations and influential tuples • Leverage extensive prior art! PL smaller than SL, but not usable in pDBs (Mystiq, Trio).**Overview**• Motivation & Preliminaries • An apx lineage approach: Sufficient Lineage • Experiments • Conclusions**Experiments**• Geneontology Database • Publically available • Predefined views • Atoms = “evidence codes” • Discuss a single view • 6 tables • 2 sources of evidence • 1119 tuples • 141MB Similar results on IMDB data not presented “All proteins associated with a single protein”**Compression Ratio v. Error**Compress Ratio 30x compression 141MB to 4MB Good compression ratio even for stringent error e, error level (smaller is more conservative)**Effect on QP**Compute each tuple in the view Original Lineage Running Time Seconds (Log10 Scale) Sufficient Lineage e, error level (smaller is more conservative)**Which ls give the biggest gain?**Original Lineage Win: Compressing big terms # Terms Sufficient Lineage Compressing Single View Top 500 formula in descending size (# is rank)**Conclusion**• Discussed approximate lineage approach • Goal: Fast QP, Explanations • Sufficient Lineage • Can be used by standard QPs • Improves QP dramatically • Apx lineage is more general, e.g. Polynomial