Agrégation de documents XML probabilistes

Agrégation de documents XML probabilistes Serge Abiteboul1, T.-H. Hubert Chan 2, Evgeny Kharlamov 1,3 Werner Nutt 3, Pierre Senellart 4 1 INRIA Saclay – Île-de-France 2 The University of Hong-Kong 3 Free University of Bozen-Bolzano 4 Télécom ParisTech

Incomplete Databases An incomplete database D contains many instances D = { d1,..., dn,...} Query q(x), constant c • c is a certainanswer for q if c  q(di) for all di D • c is a possibleanswer for q if c  q(di) for some di D Many ways to represent incomplete databases Agrégation de documents XML probabilistes

Probabilistic Databases Incomplete database D = { d1,..., dn } • with probabilitiesPr(di) > 0 for each instance • such that Pr(d1) + ... + Pr(dn) = 1 Query q returns constant c with probability p if p =  Pr(di) cq(di) • Mainly studied in the relational setting • Imprecise data on the Web Probabilistic XML Agrégation de documents XML probabilistes

IT-personnel person person name bonus name bonus John Laptop PDA Mary 37 50 30 44 Personnel Data, Instance 1 Agrégation de documents XML probabilistes

IT-personnel person person name bonus name bonus Rick PDA PDA Mary 25 50 44 Personnel Data, Instance 2 Agrégation de documents XML probabilistes

Example: Personnel Queries “What are the names of the IT personnel?” “What bonuses were paid for the PDA project?” “What is the sum of bonuses paid to all employees?" Agrégation de documents XML probabilistes

Personnel DB: Certain/Possible Answers “What are the names of the IT personnel?” Mary: certain Rick: possible “What bonuses were paid for the PDA project?” 44: certain 15: possible “What is the sum of bonuses paid to all employees?“ no certain answer 161, 119: possible  Aggregate queries depend on the presence of many data Agrégation de documents XML probabilistes

If We Had Probabilities … … we could ask • What is the probability that the sum of bonuses = 161? • What are all possible sums of bonuses? And what is each one’s probability? • What is the expected value of the sum of bonuses? And what the variance? Distribution of sums of bonuses Moments Agrégation de documents XML probabilistes

Our focus The Problem Space Probabilistic XMLDocument Models Events Distributional Nodes (MUX-DET) COUNT SUM MIN COUNTD AVG Aggregate Function Single Path Queries Tree Pattern Queries Query Language Tree Pattern Querieswith Joins Agrégation de documents XML probabilistes

IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA PDA Mary J, M M M 37 50 25 50 30 44 15 Pr(J) = 0.3Pr(M) = 0.6 Probabilities of Events Probabilistic XML: Events [Abiteboul/Senellart] J: John hired for Laptop projectM: Mary worked overtime Independent Events Agrégation de documents XML probabilistes

IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA PDA Mary J, M M M 37 50 25 50 30 44 15 “John was hired, Mary worked overtime” J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Agrégation de documents XML probabilistes

“John was hired, Mary worked overtime” IT-personnel person person name bonus name bonus John Laptop PDA Mary 37 50 30 44 J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Pr(d1) = 0.3 x 0.6 Agrégation de documents XML probabilistes

IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA PDA Mary J, M M M 37 50 25 50 30 44 15 “John wasn’t hired, Mary worked overtime” J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Pr(d2) = 0.7 x 0.6 Agrégation de documents XML probabilistes

“John wasn’t hired, Mary worked overtime” IT-personnel person person name bonus name bonus Rick PDA PDA Mary 25 50 44 J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Pr(d2) = 0.7 x 0.6 Agrégation de documents XML probabilistes

IT-personnel person person name name bonus bonus MUX MUX PDA Mary Laptop PDA MUX John Rick 37 50 25 50 DET 15 30 44 Probabilistic XML: MUX and DET Nodes [Nierman/ Jagadish, Kimelfeld/ Sagiv] 0.3 0.7 0.3 0.7 0.6 0.4 Children represent mutually exclusive choices,choices for different mux-nodes are independent MUX Deterministic nodes, children are combined DET Agrégation de documents XML probabilistes

IT-personnel person person name name bonus bonus MUX MUX PDA John 0.3 PDA 0.7 Mary 0.3 0.7 Laptop PDA MUX John Rick 0.6 0.4 25 50 37 50 25 50 DET 15 30 44 30 44 Probabilistic XML: MUX and DET Nodes Pr = 0.3 x 0.7 x 0.6 Agrégation de documents XML probabilistes

Probabilistic XML (PXML) • A PXML documentD • represents (exponentially) many document instancesd • each with a probability Pr(d) • PXML document models • CIE: long-distance dependencies • MUX-DET: only hierarchical dependencies • MUX-DET can be expressed by CIE, but not (concisely) the other way round Other models can be reduced to the ones above, or behave similarly Agrégation de documents XML probabilistes

Aggregate Functions : finite bags of valuesdomain Examples: • count, countd: finitebags of anythingN • sum, avg: finite bags of rational numbersQ Similarly: min, max, parity, top K, ... Agrégation de documents XML probabilistes

Aggregate Queries Q = (q) Two Layers • nonaggregate query q(x) • returns set of nodes q(d) over instance d • aggregate function  • applied to the labels of nodes in q(d) • returns single value (q(d)) Agrégation de documents XML probabilistes

IT-personnel bonus * Single Path Queries Which bonuseshave been paid? Simple form of tree pattern queries Paths of node labels or *, connected by “child” and “descendant” edges Return the set of leaf nodes reachable from the root along such a path qbonus Agrégation de documents XML probabilistes

Single Path Aggregate Queries: Examples • SUM(qbonus) “What is the sum of all bonuses?” • MAX(qbonus) “What is maximal bonus that was paid?” Agrégation de documents XML probabilistes

Answer Distributions PXML document D, instances {d1,…, dn} SUM(qbonus) returns exactly one number for every di  SUM(qbonus) is a random variable  SUM(qbonus) induces a probability distribution over D f(s)=  Pr(di), SUM(qbonus)(di) = s the answer distribution Notation: SUM(qbonus)(D) or (q)(D) abstractly Agrégation de documents XML probabilistes

Special Case: Document Aggregation D with instances {d1,…, dn}, • Applying  to a regular documentdi : (di) := ({| c | c is a value on a leaf of di|}) • Applying  to the probabilistic document D: (D)(c) =  Pr(di) (di) = c yields again a distribution (D) Agrégation de documents XML probabilistes

Reduction to Document Aggregation (q)(D) = ? Step 1: Compute a smaller PXML document D' = q(D) containing only matching paths Step 2: Apply  to D' Theorem: (q)(D) = (D') Depends on document models and simple path queries Agrégation de documents XML probabilistes

name name J J John Rick Mary Applying qbonus = keep only the paths that match IT-personnel qbonus( ) person person bonus bonus J J Laptop PDA PDA J, M M M 37 50 25 50 30 44 15 … analogous for MUX-DET Agrégation de documents XML probabilistes

name name MUX Mary 0.3 0.7 John Rick Evaluating Single Path Queries/2 qbonus( ) IT-personnel person person bonus bonus MUX PDA 0.3 0.7 Laptop PDA MUX 0.6 0.4 37 50 25 50 DET 15 30 44 Agrégation de documents XML probabilistes

Problems Investigated PXML document D, constant c • Possible Value: Decide Pr((D) = c) > 0 • Probability Computation: Compute Pr((D) = c) • Moment Computation: Compute E((D)k) E is “expected value” Agrégation de documents XML probabilistes

Aggregation over CIE Agrégation de documents XML probabilistes

Aggregation over CIE/2 • Possible Value: “Too much propositional logic present” • Probability Computation: cannot be easier … • Moment Computation: • Difficult for MIN, COUNTD, AVG • Easy for COUNT and SUM: “Moments are sums, moments of COUNT and SUM are sums of sums, which can be rearranged …” Agrégation de documents XML probabilistes

Aggregation over MUX-DET Agrégation de documents XML probabilistes

COUNT, SUM, MIN are Easy ... … because they allow for divide and conquer evaluation: SUM {| a,b,c,d |} = SUM {| a,b |} + SUM {| c,d |}  is a monoid aggregatefunction if ({| a1,..., an|} = ({| a1 |}) ... ({| an |}) for some commutative monoid (M, ) and all a1,..., an in M Examples: • count, sum, min, parity, top K • countd, avg Agrégation de documents XML probabilistes

MUX p q ( ) = p( ) + q ( ) Other node ( ) = ( )  ( ) Convex Sums and Convolutions If  is a monoid function, answer distributions can be computed bottom up, using two operations: Convex Sum: depends on the monoid Convolution: Agrégation de documents XML probabilistes

Convolution of Distributions (M, ) monoid (D1), (D2) distributions of subdocuments (((D1)  (D2)) (c) = (D1)(c1) (D2)(c2) c1c2 = c Agrégation de documents XML probabilistes

Approximating Query Answers Over CIE, probability and moment computation can be hard How good are Monte-Carlo methods? Classical results (Hoeffding) imply: To achieve | E((D)k) – Estimate | <  with probability 1–  at most O(R2k -2 log 1/) samples are needed, where R = max |(d)|. Consequence: Given and , at most quadratically many samples are needed for E(COUNTD(D)). Agrégation de documents XML probabilistes

Probabilistic Aggregation: Related Work • Tree pattern queries over MUX-DET with HAVING constraints [Cohen/Kimelfeld/Sagiv] • Conjunctive queries with HAVING constraints over relational probabilistic databases [Re/Suciu] • Work on various special topics in the relational setting • probabilistic data streams • uncertain schema mappings Agrégation de documents XML probabilistes

Aggregates over PXML: Conclusion First results of an ongoing project • Map of the problem space • Largely complete investigation for single path queries: • Intractability for CIE • Hierarchical dependencies in MUX-DET can be exploited for monoid aggregation functions Some results carry over to other models, e.g., • Uncertain schema mappings (Dong/Halevy/Yu) Current work: • richer query languages, continuous distributions on leaves Agrégation de documents XML probabilistes

Agrégation de documents XML probabilistes

Agrégation de documents XML probabilistes

Presentation Transcript

2004 Florida Building Code Base documents

Tutorial for Creating an Electronic Application to AAHRPP

Documents That Think ™

The American Dream

CPO 365

Ricoh Aficio MP 2550 to 5001 Series

Tutorial 11 Creating XML Document

Text and Documents

Internal Trade

The Web of Data emerging industries

A Web of Knowledge for Historical Documents

Clustering Documents

Protecting and Sharing Documents

DOCUMENTS

Desktop Publishing

PRIVACY CONTRACTS AND DOCUMENTS

Une m é thode pour l‘ é volution de sch é mas XML pr é servant la validit é des documents

Introduction to XSLT