1 / 28

Lineage Processing over Correlated Probabilistic Databases

Lineage Processing over Correlated Probabilistic Databases. Bhargav Kanagal Amol Deshpande University of Maryland. Motivation: Information Extraction/Integration. Structured entities extracted from text in the internet. ...located at 52 A Goregaon West Mumbai . ADDRESS SEGMENTATION.

maille
Download Presentation

Lineage Processing over Correlated Probabilistic Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lineage Processing over Correlated Probabilistic Databases Bhargav Kanagal Amol Deshpande University of Maryland

  2. Motivation: Information Extraction/Integration Structured entities extracted from text in the internet ...located at 52 A Goregaon West Mumbai ... ADDRESS SEGMENTATION Location INFORMATION EXTRACTION CORRELATIONS CarAds SENTIMENT ANALYSIS Reputed [Gupta&Sarawagi’2006, Jayram et al. 2006]

  3. Why Lineage Processing ? Location List all “reputed” car sellers in “Mumbai” who offer Honda cars CarAds SELECT SellerId FROM Location, CarAds, Reputed WHERE reputation = ‘good’ AND city = `Mumbai’ Location.SellerId = CarAds.SellerId AND CarAds.SellerId = Reputed.SellerId We need to compute the probability of the above boolean formula Reputed [Das Sarma et al. 2006]

  4. Motivation: RFID based Event Monitoring A building instrumented with RFID readers to track assets / personnel • RFID readings are noisy • Miss readings • Add spurious readings • Subjected to probabilistic modeling • Probabilities associated with events • Spatial and Temporal correlations found(PC, X, 2pm), prob = 0.9 Was the PC correctly transferred from room A to the conference room ? found(x,PC)∧found(z,PC)∧ [found(y1,PC)∨found(y2,PC)] [RFID Ecosystem UW, Diao et al. 2009, Letchner et al. 2009, KD 2008]

  5. PrDB System Overview insert into reputation values (‘z1’,219, uncertain(‘Good 0.5; Bad 0.5’); User insert factor ‘0 0 1; 1 1 1’ in address on ‘y1.e’,‘y2.e’; • Insert data + correlations • Issue • SPJ queries • Inference queries • Aggregation queries Query Processor PARSER INDSEP Manager INDSEP Indexes Data tables A Relational DBMS Uncertainty Parameters [Kanagal & Deshpande SIGMOD 2009, SDG08, www.cs.umd.edu/~amol/PrDB/]

  6. Outline • Motivation & Problem definition [done] • Background • Probabilistic Databases as Junction trees • Query processing over Junction trees • INDSEP • Lineage Processing over Junction trees • Lineage Processing using INDSEP • Results

  7. Background: ProbDBs as Junction trees Random Variable 1 tuple exists 0 otherwise Tuple Uncertainty Attribute Uncertainty Converted to Tuple Uncertainty Correlations Consise encoding of the joint probability distribution Query evaluation is performed directly over Junction Trees Forest of junction trees

  8. Background: Junction trees p(b,c) p(a,b,c) p(b,c,d) Separator Clique Each clique and separator stores joint pdf (POTENTIAL) Tree structure reflects Markov property Given b, c: a independent of d Joint distribution Marginal: p(a,d)

  9. Marginal Computation Steiner tree + Send messages toward a given pivot node {b, c, n} • For ProbDBs ≈ 1 million tuples, not scalable • Span of the query can be very large – almost the complete database accessed even for a 3 variable query • Searching for cliques is expensive: Linear scan over all the nodes is inefficient PIVOT Keep query variables Keep correlations Remove others

  10. Shortcut Potentials How can we make marginal computation scalable ? 100 ops Boundary separators Distribution required to completely shortcut the partition Which to build ? 50 ops Shortcut Potential Junction tree on set variables {c, f, g, j, k, l, m}

  11. INDSEP - Overview Obtained by hierarchical partitioning of the junction tree Variables: {a,b,..} {c,f,..} {j,n..q} Child Separators: p(c), p(j) Tree induced on the children Shortcut potentials of children: {p(c), p(c,j), p(j)} Root I1 I2 I3 P1 P2 P3 P4 P5 P6 Actual Construction: [Kanagal & Deshpande SIGMOD 2009]

  12. Computing Marginals using INDSEP Recursion on INDSEP {b, c, n} Root {b, c} {j, n} {c, j} {b, c} {n} I1 I2 I3 {b, c, n} P1 P2 P3 P4 P5 P6 Intermediate Junction tree [Kanagal & Deshpande SIGMOD 2009]

  13. Outline • Motivation & Problem definition [done] • Background [done] • Junction trees & Query processing over junction trees • INDSEP • Lineage Processing over Junction trees • Lineage Processing using INDSEP • Results

  14. Lineage Processing Typically classified into 2 types The problem of lineage processing is #P-complete in general for correlated probabilistic databases, even for read-once lineages Reduction from #DNF Read-Once Non-Read-Once (a∧b)∨(c∧d) (a∧b)∨(b∧c) ∨(c∧d)

  15. Lineage Processing on Junction trees Naïve: Evaluate marginal query over variables in formula (a∧b)∨(c∧d) p(a, b, c, d) COMPLEXITY Simplifcation (name of the above process) Dependent on the size of the intermediate pdf Here, it is at least (n+1) (#terms in the formula) Not scalable to large formulae Multiply with p(a∧b|a,b) p(a, b, a∧b, c, d) Eliminate a,b p((a∧b)∨(c∧d)) p(a∧b, c, d) Multiply Multiply / Eliminate p(a∧b, c∧d) p(a∧b, c, d, c∧d) Eliminate

  16. Lineage Processing [Optimization opportunities] 1. EAGER Exploit conditional independence & simplify early Query: (a∧b)∨(c∧d) p(a, c, d) p(a, d) PIVOT p(a, c∧d) p(a, d) [Kanagal & Deshpande SIGMOD 2010]

  17. Lineage Processing [Optimization opportunities] 2. EAGER+ORDER Distribute simplification into the product (c∧h)∨(m∧n) p(f, h) p(c, f, g) p(g,m∧n) p(c, f, g, h) p(g,m∧n) p(c, f, g, h, m∧n) Max pdf: 5 p(g, c∧h) Max pdf: 4 p(c, h, m∧n) p(c∧h, m∧n) p((c∧h)∨(m∧n)) How to compute good ordering ? [Kanagal & Deshpande SIGMOD 2010]

  18. Lineage Processing [Pivot Selection] Also influences the intermediate pdf size (b∧c)∨g Pivot = (ab) Max pdf: 3 Pivot = (cfg) Max pdf: 4 Optimal Pivot: Only n possible choices, estimate pdf size for each pivot location

  19. Outline • Motivation & Problem definition [done] • Background [done] • Junction trees & Query processing over junction trees • INDSEP • Lineage Processing over Junction trees [done] • Lineage Processing using INDSEP • Results

  20. Lineage Processing using INDSEP (b∧c) ∨((d∨e) ∧(n∨o) ) Root {b∧c, d∨e, c} {j, n∨o} {b, c, d, e} {c, j} {n, o} I1 I2 I3 P1 P2 P3 P4 P5 P6 Recursion bottomed out using EAGER+ORDER But what is the running time ?

  21. Lineage Planning Phase (b∧c) ∨((d∨e) ∧(n∨o) ) • If a node exceeds a threshold, do approximations to estimate probability • In addition, modify query plan for: • Multiple lineages that share variables • Exploiting disconnections QUERY PLAN Estimate maximum intermediate pdf size at each node 4 4 7 5 6 4 4

  22. Results Datasets D1: Fully independent D2: Correlated D3: Highly Correlated (long chains) Comparison Systems NAIVE EAGER EAGER + ORDER NOTE: LOG scale NOTE: LOG scale Query Processing times for different heuristics EAGER+ORDER is much more efficient than others

  23. Results Highly dependent on size of lineage Multiquery processing exploits sharing NOTE: LOG scale Query Processing time vs Lineage size Ratio vs Sharing factor

  24. Conclusions • Proposed a scalable system for evaluating boolean formula queries over correlated probabilistic databases • Future • Plan to further the approximation approaches • Envelopes of boolean formulas for upper and lower bounds Thank you 

  25. Lineage Processing (contd.) Construct complete graph on factors to be multiplied Amount of simplification possible when nodes are multiplied p(f, h) p(c, f, g) 4 - 2 p(g, c∧h) Pick the biggest edge Merge / Simplify nodes together Recompute new edge weights

  26. Lineage Processing via INDSEP [Improvement 1] Multiple Lineage Processing: Exploit possibility of sharing (m∧c)∨g Root (n∧c)∨g {j, m} {c, g, j} I1 I2 I3 {c, g, j} {j, n} P1 P2 P3 P4 P5 P6 Sharing across multiple levels Need not even share variables, just paths

  27. Lineage Processing via INDSEP [Improvement 2] Extend to forest of junction trees: Real world data sets may have independences Index constructed to minimize disk wastage, combining forests together Root (a∧o) I1 I2 I3 {a, c} {j, o} {c, j} P1 P2 P3 P4 P5 P6 {a, c} {j} {o} j and o are disconnected !! a and o are disconnected !! Preprocess formula, keep variables in connected components together

  28. Lineage Processing via INDSEP [Improvement 3] What about complexity ? Complexity not evident from the algorithm Root {b, c, d, e} {c, j} {j, n∨o} {n, o} {b∧c, d∨e, c} I1 I2 I3 {b∧c, c} {d∨e} {j, n} {o} P1 P2 P3 P4 P5 P6 Compute lwidth here Intermediate junction tree Compute lwidth here “Predict” how large the intermediate cliques will be Approximate for all portions whose estimate is more than a threshold, e.g., 10

More Related