1 / 36

Agrégation de documents XML probabilistes

Agrégation de documents XML probabilistes. Serge Abiteboul 1 , T.-H. Hubert Chan 2 , Evgeny Kharlamov 1,3 Werner Nutt 3 , Pierre Senellart 4 1 INRIA Saclay – Île-de-France 2 The University of Hong-Kong 3 Free University of Bozen-Bolzano 4 Télécom ParisTech. Incomplete Databases.

ariane
Download Presentation

Agrégation de documents XML probabilistes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Agrégation de documents XML probabilistes Serge Abiteboul1, T.-H. Hubert Chan 2, Evgeny Kharlamov 1,3 Werner Nutt 3, Pierre Senellart 4 1 INRIA Saclay – Île-de-France 2 The University of Hong-Kong 3 Free University of Bozen-Bolzano 4 Télécom ParisTech

  2. Incomplete Databases An incomplete database D contains many instances D = { d1,..., dn,...} Query q(x), constant c • c is a certainanswer for q if c  q(di) for all di D • c is a possibleanswer for q if c  q(di) for some di D Many ways to represent incomplete databases Agrégation de documents XML probabilistes

  3. Probabilistic Databases Incomplete database D = { d1,..., dn } • with probabilitiesPr(di) > 0 for each instance • such that Pr(d1) + ... + Pr(dn) = 1 Query q returns constant c with probability p if p =  Pr(di) cq(di) • Mainly studied in the relational setting • Imprecise data on the Web Probabilistic XML Agrégation de documents XML probabilistes

  4. IT-personnel person person name bonus name bonus John Laptop PDA Mary 37 50 30 44 Personnel Data, Instance 1 Agrégation de documents XML probabilistes

  5. IT-personnel person person name bonus name bonus Rick PDA PDA Mary 25 50 44 Personnel Data, Instance 2 Agrégation de documents XML probabilistes

  6. Example: Personnel Queries “What are the names of the IT personnel?” “What bonuses were paid for the PDA project?” “What is the sum of bonuses paid to all employees?" Agrégation de documents XML probabilistes

  7. Personnel DB: Certain/Possible Answers “What are the names of the IT personnel?” Mary: certain Rick: possible “What bonuses were paid for the PDA project?” 44: certain 15: possible “What is the sum of bonuses paid to all employees?“ no certain answer 161, 119: possible  Aggregate queries depend on the presence of many data Agrégation de documents XML probabilistes

  8. If We Had Probabilities … … we could ask • What is the probability that the sum of bonuses = 161? • What are all possible sums of bonuses? And what is each one’s probability? • What is the expected value of the sum of bonuses? And what the variance? Distribution of sums of bonuses Moments Agrégation de documents XML probabilistes

  9. Our focus The Problem Space Probabilistic XMLDocument Models Events Distributional Nodes (MUX-DET) COUNT SUM MIN COUNTD AVG Aggregate Function Single Path Queries Tree Pattern Queries Query Language Tree Pattern Querieswith Joins Agrégation de documents XML probabilistes

  10. IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA PDA Mary J, M M M 37 50 25 50 30 44 15 Pr(J) = 0.3Pr(M) = 0.6 Probabilities of Events Probabilistic XML: Events [Abiteboul/Senellart] J: John hired for Laptop projectM: Mary worked overtime Independent Events Agrégation de documents XML probabilistes

  11. IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA PDA Mary J, M M M 37 50 25 50 30 44 15 “John was hired, Mary worked overtime” J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Agrégation de documents XML probabilistes

  12. “John was hired, Mary worked overtime” IT-personnel person person name bonus name bonus John Laptop PDA Mary 37 50 30 44 J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Pr(d1) = 0.3 x 0.6 Agrégation de documents XML probabilistes

  13. IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA PDA Mary J, M M M 37 50 25 50 30 44 15 “John wasn’t hired, Mary worked overtime” J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Pr(d2) = 0.7 x 0.6 Agrégation de documents XML probabilistes

  14. “John wasn’t hired, Mary worked overtime” IT-personnel person person name bonus name bonus Rick PDA PDA Mary 25 50 44 J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Pr(d2) = 0.7 x 0.6 Agrégation de documents XML probabilistes

  15. IT-personnel person person name name bonus bonus MUX MUX PDA Mary Laptop PDA MUX John Rick 37 50 25 50 DET 15 30 44 Probabilistic XML: MUX and DET Nodes [Nierman/ Jagadish, Kimelfeld/ Sagiv] 0.3 0.7 0.3 0.7 0.6 0.4 Children represent mutually exclusive choices,choices for different mux-nodes are independent MUX Deterministic nodes, children are combined DET Agrégation de documents XML probabilistes

  16. IT-personnel person person name name bonus bonus MUX MUX PDA John 0.3 PDA 0.7 Mary 0.3 0.7 Laptop PDA MUX John Rick 0.6 0.4 25 50 37 50 25 50 DET 15 30 44 30 44 Probabilistic XML: MUX and DET Nodes Pr = 0.3 x 0.7 x 0.6 Agrégation de documents XML probabilistes

  17. Probabilistic XML (PXML) • A PXML documentD • represents (exponentially) many document instancesd • each with a probability Pr(d) • PXML document models • CIE: long-distance dependencies • MUX-DET: only hierarchical dependencies • MUX-DET can be expressed by CIE, but not (concisely) the other way round Other models can be reduced to the ones above, or behave similarly Agrégation de documents XML probabilistes

  18. Aggregate Functions : finite bags of valuesdomain Examples: • count, countd: finitebags of anythingN • sum, avg: finite bags of rational numbersQ Similarly: min, max, parity, top K, ... Agrégation de documents XML probabilistes

  19. Aggregate Queries Q = (q) Two Layers • nonaggregate query q(x) • returns set of nodes q(d) over instance d • aggregate function  • applied to the labels of nodes in q(d) • returns single value (q(d)) Agrégation de documents XML probabilistes

  20. IT-personnel bonus * Single Path Queries Which bonuseshave been paid? Simple form of tree pattern queries Paths of node labels or *, connected by “child” and “descendant” edges Return the set of leaf nodes reachable from the root along such a path qbonus Agrégation de documents XML probabilistes

  21. Single Path Aggregate Queries: Examples • SUM(qbonus) “What is the sum of all bonuses?” • MAX(qbonus) “What is maximal bonus that was paid?” Agrégation de documents XML probabilistes

  22. Answer Distributions PXML document D, instances {d1,…, dn} SUM(qbonus) returns exactly one number for every di  SUM(qbonus) is a random variable  SUM(qbonus) induces a probability distribution over D f(s)=  Pr(di), SUM(qbonus)(di) = s the answer distribution Notation: SUM(qbonus)(D) or (q)(D) abstractly Agrégation de documents XML probabilistes

  23. Special Case: Document Aggregation D with instances {d1,…, dn}, • Applying  to a regular documentdi : (di) := ({| c | c is a value on a leaf of di|}) • Applying  to the probabilistic document D: (D)(c) =  Pr(di) (di) = c yields again a distribution (D) Agrégation de documents XML probabilistes

  24. Reduction to Document Aggregation (q)(D) = ? Step 1: Compute a smaller PXML document D' = q(D) containing only matching paths Step 2: Apply  to D' Theorem: (q)(D) = (D') Depends on document models and simple path queries Agrégation de documents XML probabilistes

  25. name name J J John Rick Mary Applying qbonus = keep only the paths that match IT-personnel qbonus( ) person person bonus bonus J J Laptop PDA PDA J, M M M 37 50 25 50 30 44 15 … analogous for MUX-DET Agrégation de documents XML probabilistes

  26. name name MUX Mary 0.3 0.7 John Rick Evaluating Single Path Queries/2 qbonus( ) IT-personnel person person bonus bonus MUX PDA 0.3 0.7 Laptop PDA MUX 0.6 0.4 37 50 25 50 DET 15 30 44 Agrégation de documents XML probabilistes

  27. Problems Investigated PXML document D, constant c • Possible Value: Decide Pr((D) = c) > 0 • Probability Computation: Compute Pr((D) = c) • Moment Computation: Compute E((D)k) E is “expected value” Agrégation de documents XML probabilistes

  28. Aggregation over CIE Agrégation de documents XML probabilistes

  29. Aggregation over CIE/2 • Possible Value: “Too much propositional logic present” • Probability Computation: cannot be easier … • Moment Computation: • Difficult for MIN, COUNTD, AVG • Easy for COUNT and SUM: “Moments are sums, moments of COUNT and SUM are sums of sums, which can be rearranged …” Agrégation de documents XML probabilistes

  30. Aggregation over MUX-DET Agrégation de documents XML probabilistes

  31. COUNT, SUM, MIN are Easy ... … because they allow for divide and conquer evaluation: SUM {| a,b,c,d |} = SUM {| a,b |} + SUM {| c,d |}  is a monoid aggregatefunction if ({| a1,..., an|} = ({| a1 |}) ... ({| an |}) for some commutative monoid (M, ) and all a1,..., an in M Examples: • count, sum, min, parity, top K • countd, avg Agrégation de documents XML probabilistes

  32. MUX p q ( ) = p( ) + q ( ) Other node ( ) = ( )  ( ) Convex Sums and Convolutions If  is a monoid function, answer distributions can be computed bottom up, using two operations: Convex Sum: depends on the monoid Convolution: Agrégation de documents XML probabilistes

  33. Convolution of Distributions (M, ) monoid (D1), (D2) distributions of subdocuments (((D1)  (D2)) (c) = (D1)(c1) (D2)(c2) c1c2 = c Agrégation de documents XML probabilistes

  34. Approximating Query Answers Over CIE, probability and moment computation can be hard How good are Monte-Carlo methods? Classical results (Hoeffding) imply: To achieve | E((D)k) – Estimate | <  with probability 1–  at most O(R2k -2 log 1/) samples are needed, where R = max |(d)|. Consequence: Given and , at most quadratically many samples are needed for E(COUNTD(D)). Agrégation de documents XML probabilistes

  35. Probabilistic Aggregation: Related Work • Tree pattern queries over MUX-DET with HAVING constraints [Cohen/Kimelfeld/Sagiv] • Conjunctive queries with HAVING constraints over relational probabilistic databases [Re/Suciu] • Work on various special topics in the relational setting • probabilistic data streams • uncertain schema mappings Agrégation de documents XML probabilistes

  36. Aggregates over PXML: Conclusion First results of an ongoing project • Map of the problem space • Largely complete investigation for single path queries: • Intractability for CIE • Hierarchical dependencies in MUX-DET can be exploited for monoid aggregation functions Some results carry over to other models, e.g., • Uncertain schema mappings (Dong/Halevy/Yu) Current work: • richer query languages, continuous distributions on leaves Agrégation de documents XML probabilistes

More Related