1 / 23

Approximate Lineage for Probabilistic Databases

Approximate Lineage for Probabilistic Databases. Christopher Ré and Dan Suciu University of Washington. Approximate Lineage in One Slide. Lineage (Provenance) In QP used to track correlations Explain query/view results VLDBs have lots of lineage Chokes QP Hard for users to understand

lynna
Download Presentation

Approximate Lineage for Probabilistic Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Lineage for Probabilistic Databases Christopher Ré and Dan Suciu University of Washington

  2. Approximate Lineage in One Slide • Lineage (Provenance) • In QP used to track correlations • Explain query/view results • VLDBs have lots of lineage • Chokes QP • Hard for users to understand • Obs: lineage contains a lot of redundancy! In a view, lineage is all derivations of a tuple probabilistic databases Especially with complex queries/views This work: Approximate the lineage, by keeping only the most important correlations

  3. Overview • Motivation & Preliminaries • An apx lineage approach: Sufficient Lineage • Experiments • Conclusions

  4. Inspired by the Geneontology (GO) Database A Protein Database Standard pDB, e.g. Mystiq, Trio Data are from somewhere Process (P) Atoms Lineage (l) is important Manually Created Lineage from a wide variety of sources – not all trusted the same Machine inferred Some with confidence, too!

  5. PRA[Fuhr&Rolleke 97], Trio [Widom 05], Mystiq [R,Dalvi,S07] Review: Lineage tracking Lineage propagates with queries /views “Proteins related to same process as `Aac11’” How do we derive the lineage ? V(y) :- P(x,y),P(`Aac11’, y), x  `Aac11’ l1 Lineage tracks allderivations Process (P) Prob QP: Pr[V(‘AGO2’)] = Pr[l1] Big DB = Big Lineage (GO) 1 tuple 10MB lineage! Big Lineage chokes the engine!

  6. Problems with Large Lineage in pDB This talk • Lineage is used to: • Process Queries • Give explanations to users • Find influential atoms • Large: chokes QP • Large:Many redundant explanations • Large:Needle in a haystack On VLDBs, helpful to shrink (approximate) the lineage

  7. Approximate Lineage Approach Original VLDB Level 2 Database (Small lineage) Level 1 Database (Big lineage) error, e a l smaller, approximate formula All (most) querying on Level 2 database (using a instead of l) Focus is on the Level 2 database

  8. Overview • Motivation & Preliminaries • An apx lineage approach: Sufficient Lineage • Experiments • Conclusions

  9. Sufficient lineage (SL) • Represent as? • Use as to: • Answer queries? • Provide explanations? • Find influential tuples? • Build good a, efficiently? DNF formulae, that logically imply l Reuse existing systems! a is a lower bound l See paper The remainder of this talk Nugget: An algorithm that always finds small, good SL

  10. Formalizing “good as” Choosing an approximation a for a lineage function, l Formalizing this, Atoms E[l – a]  e An atom is a Boolean proposition. A world is a set of the true atoms. Expectation of difference over all worlds, should be small Intuition: a should agree on most worlds NB: really standard ℓ2 distance

  11. Illustrating Good Lineage E[l – a] = E[l] – E[a]  e e= 0.054 Intuition: Pr[a] high means good lineage 0.9 *(1 - (1 - 0.8)(1-0.3)) 0.9 * 0.8 = 0.72 = 0.9 *0.86 = .774

  12. 1st step: Lineage DNFs to “graphs” X1 Y1 (X1˄ Y1) ˅(X2˄ Y1) X2 Y2 We can think of DNFs as graphs (k-DNF  a k-hypergraph) Atoms = nodes Ym Xn Monomials = edges Trick: matching is an SL formula. Goal: Given error e, find a subset of edges with error smaller than eand small size, i.e. a best lower bound;

  13. How big a matching could we need? Assume Pr[Xi ] = Pr[Yj] = 0.5 X1 Y1 X2 Y2 Pr[M] = 1- (1-0.25)|M| Matching of size 9 implies Pr[M] > .9 For any e > 0.1 ; M can always < 9 Ym Xn Subtle: size bound depends on k, e and Pr[Xi] – not # of tuples If l has a small good matching, take a to be matching. Call this a “good enough matching”

  14. There is not always a good-enough matching X1 ˄ APX(Y1 ˅ Y2 ˅ … Ym) ˅ (X2 ˅ Z) X1 Y1 (Y1 ˅ Y2 ˅ … Ym) – a (k-1)-DNF Y2 Y5 Formally, {X1,X2} is a small cover Must apx the (k-1)-DNF w. smaller e to account for correlations Ym X2 Z Obs: no “good-enough matching”, then cover must be small Best matching is  0.4 , but formula very close to 0.625! nodes in any maximal matching

  15. SL is always small THM (SL is always small) Size of SL is constant in data. Two Cases: Small-good matching Small-cover of important nodes We’re done! Recurse on k-1 DNF Requires “non-vanishing” probs In datasets, usually, Pr > 10-3 Exponential in query Similar to data-complexity Problem: Maximum matching in general hypergraphs is NP-hard need a maximal matching – pick greedily! Apx NP-hard!

  16. Summary of Constructing SL • For SL, good lineage = big lineage • Not true in general. • Gave an algorithm that always finds small SL • Constant in the data • Exponential in almost everything else • Main trick: Don’t try to find optimal solutions, when sloppy is good enough!

  17. Other fun results in the paper • Sufficient Lineage (SL) • Error bounds for QP • Finding influential tuples • Polynomial Lineage (PL): DNF to polynomial • Use Taylor/Fourier approximation of poly • Algos for QP, explanations and influential tuples • Leverage extensive prior art! PL smaller than SL, but not usable in pDBs (Mystiq, Trio).

  18. Overview • Motivation & Preliminaries • An apx lineage approach: Sufficient Lineage • Experiments • Conclusions

  19. Experiments • Geneontology Database • Publically available • Predefined views • Atoms = “evidence codes” • Discuss a single view • 6 tables • 2 sources of evidence • 1119 tuples • 141MB Similar results on IMDB data not presented “All proteins associated with a single protein”

  20. Compression Ratio v. Error Compress Ratio 30x compression 141MB to 4MB Good compression ratio even for stringent error e, error level (smaller is more conservative)

  21. Effect on QP Compute each tuple in the view Original Lineage Running Time Seconds (Log10 Scale) Sufficient Lineage e, error level (smaller is more conservative)

  22. Which ls give the biggest gain? Original Lineage Win: Compressing big terms # Terms Sufficient Lineage Compressing Single View Top 500 formula in descending size (# is rank)

  23. Conclusion • Discussed approximate lineage approach • Goal: Fast QP, Explanations • Sufficient Lineage • Can be used by standard QPs • Improves QP dramatically • Apx lineage is more general, e.g. Polynomial

More Related