1 / 35

Circuits for Datalog Provenance

Circuits for Datalog Provenance. Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania. A Simple Example of Data Provenance. “ Boolean Provenance/Lineage ” as a Boolean formula Q is true on D   F Q,D is true

admon
Download Presentation

Circuits for Datalog Provenance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Circuits for Datalog Provenance Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

  2. A Simple Example of Data Provenance • “Boolean Provenance/Lineage” as a Boolean formula • Q is true on D FQ,D is true • Poly-size, Poly-time computable (data complexity) • But Q is a RA+ query • This talk: What if Q is a Datalog Program? y1 x1 z1 y2 x2 z2 y3 Database D Boolean query Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y) FQ,D = (x1y1z1)  (x1y2z2)  (x2y3z2)

  3. Motivation • Provenance • Reliability and repeatability • View management and deletion propagation • Trust and security management • Query answering in probabilistic database, …. • Datalog • Datalog is popular again! (two keynotes this ICDT/EDBT) • Data extraction in Web, declarative networking • Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna) • Finding suitable “Provenance for Datalog” is important • Both from theoretical and practical viewpoints • How do we compute, store, and interpret provenance for datalog programs efficiently and effectively?

  4. Overview of Our Results • Can we get poly-size Boolean formulas for datalog provenance? No, even if we allow unbounded time • Do we have a solution? Yes! Use Boolean Circuits! • What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07] It depends on the semiring

  5. Outline • Background • Circuits for Boolean Provenance • Circuits for General Provenance Semirings

  6. Outline • Background • Circuits for Boolean Provenance • Circuits for General Provenance Semirings

  7. Datalog • Datalog program for Transitive Closure and Single-source Reachability • EDB (base) relation for edges: R • IDB (derived) relations • Transitive closure (T) • Single-source reachability from vertex ‘a’ (S) T(x, y) :- R(x, y) T(x, y) :- R(x, z), T(z, y) S(x) :- T(a, x) EDB (Extensional Databases) IDB (Intensional Databases)

  8. Boolean Provenance PosBool(X)-Database • Tuples are annotated with variables from a set X • Here X = {x1, x2, y1, y2, ….} • For n tuples in X, 2n possible worlds by assignments : X  {True, False} • Useful in query evaluation on incomplete or probabilistic databases y1 x1 z1 y2 x2 z2 y3 PosBool(X)-database D

  9. RA+ over PosBool(X)-Database • Annotation propagates from input to output • Join = , Projection/Union =  • Output tuples are annotated by monotone Boolean formula • FQ,D is the annotation of the unique output tuple y1 x1 z1 y2 x2 z2 y3 PosBool(X)-Database D RA+Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y) FQ,D = (x1y1z1)  (x1y2z2)  (x2y3z2)

  10. Two Important Properties:RA+ over PosBool(X)-Database For all RA+ query Q, D, and assignment  • (Faithful Representation) Q(D)= [Q(D)] • (Poly-size overhead) The size of FQ,D is poly in |D| and can be computed in poly-time. y1 x1 True z1 False True y2 x2 False z2 True False y3 True PosBool(X)-Database D RA+Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y) = False FQ,D = (x1y1z1)  (x1y2z2)  (x2y3z2) = False

  11. Datalog over PosBool(X) Database T(a, b) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) • Semantics using Derivation Trees (Green et al. 2007) • Annotation of T(a, b): b p T(a, b) R(a, b) q a R(a, a) T(a, b) T(a, b) Trees  Leaves t of  R(a, b) Annot(t) R(a, a) T(a, b) = q = (q)  (pq) (ppq) … R(a, a) T(a, b) • Infinitely many trees • But always has a finite equivalent form … R(a, b) But not necessarily poly-size

  12. Lower Bound: Boolean formulas for Datalog Provenance on PosBool(X) Theorem: Given PosBool(X)-database D and datalog program P, provenance of tuples in P(D) cannot have a faithful representation using Booleanformulas of size polynomial in |D| Proof outline: • st-connectivity on n nodes requires n(logn)-size monotone Boolean formula • Karchmer-Wigderson, 1988 • Faithful representation requires: for all True/False assignments to X, • P(D)= [P(D)] • Reduce to the hard instance with right  when P = transitive closure Solution: Boolean Circuit!

  13. Outline • Background • Circuits for Boolean Provenance or PosBool(X) • Circuits for General Provenance Semirings

  14. Boolean Circuits b a • Circuit is a DAG • use common subexpressions • Boolean formula = tree • Leaf nodes: • EDB vars in X • Internal nodes •  : IDB/EDB vars used in one derivation • : Alternative derivations • Roots: • IDB vars T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) p q  XT(a, b)    XT(a, b) p q XR(a, a) XR(a, b)

  15. Upper Bound: Boolean Circuits for PosBool(X) Theorem: Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time)

  16. Proof Skecth Two key ideas from previous work 1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalog program P to EDB/IDB tuples[Green et al. 2007] • EDB tuples constants, IDB tuples variables • Iteratively solve this system of equations • Fixpoint = provenance for all IDB tuples 2. A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011] • N = #IDB tuples • Build a circuit with N+1 layers from the system of equations

  17. Illustration T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) Step1 : Build system of equations by all possible instantiations: x, y, z  a, b XT(a, a) = p  (p  XT(a, a)) XT(a, b) = q  (p  XT(a,b)) XS(b) = XT(a, b) XS(a) = XT(a, a) Step 2: Build a circuit with 4 + 1 layers (N = 4) … b p a q Const var

  18. Illustration Multiple roots for multiple IDB vars XT(a, a) = p  (p  XT(a, a)) XT(a, b) = q  (p  XT(a,b)) XS(b) = XT(a, b) XS(a) = XT(a, a) XT(a,a),2 XS(a),2 XS(b),2 XTa,a),2 XT(a,b),2 Level 2        Level 1 XS(a),1 XT(a,a),1 XS(b),1 XT(a,a),1 XT(a,b),1        XS(b),0 XS(a),0 XT(a,b),0 XT(a,a),0 XT(a,a),0 p false q false false false false Assign leaf IDB vars to false

  19. Optimizations • Store only two levels of circuit instead of N+1 levels • Evaluate iteratively • Embed circuit construction in semi-naïve evaluation • Check for new derivations, not only new IDB variables • Sound and Complete • Remove self-dependency of IDB vars • works for PosBool(X) and also some other semirings… XT(a, a)= p  (p  XT(a, a)) XT(a, b) = q  (p  XT(a,b)) XS(b) = XT(a, b) XS(a) = XT(a, a)

  20. Illustration (From here…) XT(a,a),2 XS(a),2 XS(b),2 XTa,a),2 XT(a,b),2 Level 2        Level 1 XS(a),1 XT(a,a),1 XS(b),1 XT(a,a),1 XT(a,b),1        XS(b),0 XS(a),0 XT(a,b),0 XT(a,a),0 XT(a,a),0 p false q false false false false

  21. Illustration (…To here) With all these optimizations XT(a,a),top XS(a),top XT(a,b),top Top Level    Bottom Level    q p XS(a),bottom XT(a,b),bottom XT(a,a),bottom

  22. Applications of PosBool(X)-Circuits • Linear-time deletion propagation (in circuit-size) • Approximation for probabilistic databases • even when only the circuit (and not the database) is available • Circuits can be computed “offline” • Only linear-time evaluation is required when needed (e.g. deletion propagation) • compared to storing and solving a system of equations iteratively, or • re-evaluating datalog program • Can use existing techniques for efficient and parallel circuit evaluation

  23. Outline • Background • Circuits for Boolean Provenance or PosBool(X) • Circuits for General Provenance Semirings

  24. Commutative Semirings • (K, +K, K, 0K, 1K) • domain K • +K, K : associative, commutative, have neutral elements 0K, 1K • K distributes over +K , i.e. a K (b +K c) = a K b +K a K c • 0K cancels any element in K, i.e. a K 0K = 0K K a = 0K Examples: • (B, , , False, True) • Set semantics • (N, +, , 0, 1) • Bag semantics • (N  {}, min, +, , 0) • Tropical semiring to compute cost (e.g. cost of a shortest path)

  25. Provenance Semirings • Generalization of PosBool(X) • (K, +K, K, 0K, 1K) • Tuples are annotated with variables from X • K is of the form Prov(X) • +K denotes alternative usage • K denotes joint usage • Examples: • (PosBool(X), , , False, True) • (Lin(X), , , , ) • tracks contributing tuples[Cui et. al. ’00] • (Why(X), , , , {}) • : pairwise union of subsets, tracks contributing tuples in alternative derivations [Buneman et. al. ’01]

  26. Provenance Specialization • Key property needed for applications like deletion propagation, trust management, cost computation, … • Prov(X) specializes correctly to K, if any valuation v : X  K extends uniquely to a homomorphism hv : Prov(X) K (which correctly maps +,  of Prov(X) to that of K) • Further, some provenance semirings are “more informative” than the others

  27. Provenance Semiring Hierarchy N[X] More informative Less informative Defined later N (bag) Sorp(X) Why(X) Tropical PosBool(X) Lin(X) Specializes correctly Security Boolean (set)

  28. Datalog Provenance for General Semirings PosBool(X) Trees  Leaves t of  Annot(t) k +k Trees  Leaves t of  Annot(t) General Prov(X) • Infinite sums should be well-defined • Need to consider “–continuous semirings” and “–continuous homomorphism”

  29. Provenance Semiring Hierarchy Need to add   N[[X]] and N Finite so -continuous N[X] N[[X]] : Most informative provenance semiring [Green et al. ’07] N (bag) Sorp(X) Why(X) Tropical PosBool(X) Lin(X) Security Boolean (set)

  30. How good is N[[X]] w.r.t. Size of Datalog Provenance? • Poly-size overhead is not valid because of infinite sum • But can outputs have finite annotations (with X,  , +) that specializes correctly to semirings with finite domains? Theorem: • It is not possible to annotate with finite provenance expressions • the output of datalog programs following N[[X]] -semantics • that specialize “correctly” to the semiring Why(X) Finite annotations won’t specialize correctly to Why(X) Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X) • Need more levels in the circuit from system of equations • Need a different argument for correctness

  31. Can we still have a good general semiring w.r.t. size? • We propose Sorp(X) • Most general absorptive semiring • a + a.b = a • N[X] but keep polynomials that are not “absorbed” by the others • e.g. pq + p2q3 pq p2q + pq2  p2q + pq2 • The same algorithm, proof, and optimizations to construct poly-size circuits hold • Circuits are more general than Boolean circuit • Specializes correctly to interesting semirings • Outputs can be annotated by poly-size circuits

  32. Provenance Semiring Hierarchy N[X] N (bag) Sorp(X) Why(X) Tropical PosBool(X) Lin(X) Security Boolean (set)

  33. Related Work • Data Provenance • e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08] • Circuits • Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g. [Arora-Barak ’09] (book) • Provenance for Datalog • System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07] • Poly-size c-tables with Boolean formulas for datalog with contradictions [Abiteboul et al. 2014]

  34. Conclusions • Circuits to represent and store Datalog Provenance • for PosBool(X) and other semirings • Semantics, Algorithms, Limitations, Applicability • Preliminary experiments support our results • we compared circuits for deletion propagation with iteratively solving system of equations and reevaluation of datalog from scratch • Future Work: • A complete implementation, evaluation, new applications

  35. Thank You Questions?

More Related