1 / 26

Provenance Semirings

Provenance Semirings. T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania. PODS 2007. Provenance. First studied in data warehousing Lineage [ Cui,Widom,Wiener 2000 ] Scientific applications (to assess quality of data) Why-Provenance [ Buneman,Khanna,Tan 2001 ]

tolla
Download Presentation

Provenance Semirings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Provenance Semirings T.J. Green, G. Karvounarakis, V. TannenUniversity of Pennsylvania PODS 2007

  2. Provenance • First studied in data warehousing • Lineage [Cui,Widom,Wiener 2000] • Scientific applications (to assess quality of data) • Why-Provenance [Buneman,Khanna,Tan 2001] • Our interest: P2P data sharing in the ORCHESTRA system (project headed by Zack Ives) • Trust conditions based on provenance • Deletion propagation PODS 2007

  3. Annotated relations • Provenance: an annotation on tuples • Our observation: propagating provenance/lineage through views is similar to querying • Incomplete Databases (conditional tables) • Probabilistic Databases (independent tuple tables) • Bag Semantics Databases (tuples with multiplicities) • Hence we look at queries on relations with annotated tuples PODS 2007

  4. Incomplete databases: boolean C-tables R boolean variables semantics: a set of instances { } , ; I(R)= , , , , , , PODS 2007

  5. Imielinski & Lipski (1984): queries on C -tables R union of conjunctive queries (UCQ) r s r q(x,z) :- R(x,_,z), R(_,_,z) q(x,z) :- R(x,y,_), R(_,y,z) r r q(R) p=true r=false s=true = PODS 2007

  6. Why-provenance/lineage Which input tuples contribute to the presence of a tuple in the output? same query q(R) R tuple ids [Cui,Widom,Wiener 2000] [Buneman,Khanna,Tan 2001] PODS 2007

  7. C –tables vs. Lineage c-table calculations lineage calculations The structure of the calculations is the same! PODS 2007

  8. Another analogy, with bag semantics R tuple multiplicities c-table calculations same query q(R) multiplicity calculations The structure of the calculations is the same! PODS 2007

  9. Abstracting the structure of these calculations These expressions capture the abstract structure of the calculations, which encodes the logical derivation of the output tuples We shall use these expressions as provenance abstract calculations PODS 2007

  10. Technical Development • Abstractly annotated relations (K-relations) and their relational algebra • K must be semiring • For provenance, K is semiring of polynomials • Datalog on K-relations • For provenance, K consists of (possibly infinite) formal power series PODS 2007

  11. K-relations • Annotations are elements from an algebraic structure (K,+,¢, 0, 1) • IfD is the domain of database values, an n-ary K-relationis a function: R: Dn! K Although the notation resembles arithmetic, these are abstract operations All possible tuples PODS 2007

  12. K-relations, annotated tables • K-relationcorresponds to table: R: Dn! K • If R(t)=k, then t“is annotated by k” • For all but finitely many tuples t, R(t) = 0 • we omit those tuples from the table representation PODS 2007

  13. Positive K-relational algebra • We define an RA+ on K-relations: • The ¢ corresponds to join: • The + corresponds to union and projection • 0and 1 are used for selection predicates • Details in the paper (but recall how we evaluated the UCQ q earlier and we will see another example later) PODS 2007

  14. RA+ identities imply semiring structure! • Common RA+ identities • Unionandjoinareassociative, commutative • Join distributesoverunion • etc. (but notidempotence!) These identities hold for RA+ onK-relations iff (K, +,¢, 0, 1) is a commutative semiring (K,+,0)is a commutative monoid (K,¢,1)is a commutative monoid ¢distributes over+, etc PODS 2007

  15. Calculations on annotated tables are particular cases PODS 2007

  16. Provenance Semirings • X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples) • N[X]: multivariate polynomials with coefficients in Nand indeterminates inX • (N[X], +, ¢, 0, 1)is the most “general” commutative semiring: its elements abstract calculations in all semirings • N[X] –relations are the relations with provenance! • The polynomials capture the propagation of provenance through (positive) relational algebra PODS 2007

  17. same lineage, different provenance A provenance calculation q(x,z) :- R(x, _,z), R(_, _,z) q(x,z) :- R(x,y, _), R(_ ,y,z) q(R) R Lineage • Not just why-but alsohow-provenance (encodes derivations)! • More informative than lineage PODS 2007

  18. Trust assesment q(x,z) :- R(x, _,z), R(_, _,z) q(x,z) :- R(x,y, _), R(_ ,y,z) q(R) R 2 alternatives, both need Moe, twice Needs both Moe andLarry One alternative needs Larry and Curly Two others only need Larry, twice p: justified by Moe r: justified by Larry s: justified by Curly Which output tuples can be trusted after Larry is jailed? PODS 2007

  19. More Technical Development • The semiring structure on annotations works out nicely for (positive) relational algebra. • What more do we need for Datalogqueries? • -continuous semirings (so fixed points exist)! • N is not -continuous, butN1≜ N[ {1} is • Here we show only what we need for Datalog provenance (formal power series) PODS 2007

  20. q(a d) q(a d) r2 r2 r2 r2 q(a b) q(b d) r1 r1 q(a d) R(a b) R(b d) r2 r2 q(d d) q(d d) q(a b) q(b d) r1 r1 r1 r1 R(a b) R(b d) R(d d) R(d d) Beyond RA+: Datalog r1: q(X, Y) :- R(X, Y) r2: q(X, Y) :- q(X, Z), q(Z, Y) R PODS 2007

  21. Provenance: Encoding Infinite Derivations • Polynomials do not suffice, since they are finite! • Instead, we use infinite formal power series • Nonetheless, provenance is finitely representable through a system of equations PODS 2007

  22. Provenance equations r1: q(X, Y) :- R(X, Y) r2: q(X, Y) :- q(X, Z), q(Z, Y) q(R) R Polynomials are the provenance of the immediate consequence operator (in RA+) The provenances x,y etc. are the power series that solve this system of equations (see next) PODS 2007

  23. Coefficients have the form 2k! k!(k+1)! Solutions: formal power series x =m + np y =n z =p v =s + s2 + 2s3 + 5s4 + 14s5 + … u =rv* w =r(m+np)(v*)2 where v*≜ 1 + v + v2 + v3 + … In general we need coefficients from:N1≜ N[ {1} PODS 2007

  24. Algorithmic results for Datalog provenance • Given tq(I), it is decidable whether the provenance of t is a proper (infinite) power series; • From CFG ambiguity, we know that testing whether all coefficients are · 1 is undecidable • However, given tq(I), and a monomial , the coefficient of in the power series that is the provenance oftis computable (including when it is 1) PODS 2007

  25. Related Work • Foundations: semirings/systems of equations/formal power series first used in CS in theory of formal languages [Chomsky,Schutzenberger 1963] • Our work is related to and shares similar goals with “Debugging schema mappings with routes” [Chiticariu,Tan VLDB2006], where “routes” are like minimal finite portions of our how-provenance • See also tutorial at SIGMOD tomorrow! PODS 2007

  26. Further work • Application: P2P data sharing in the ORCHESTRA system (thanks to our collaborator Zack Ives): • Need to express trust conditions based on provenance of tuples • Incremental propagation of deletions • Semiring provenance itself is incrementally maintainable • See demo of ORCHESTRA in SIGMOD on Thursday! • Future extension: full relational algebra. For difference we need semirings with “proper subtraction” PODS 2007

More Related