Circuits for datalog provenance
This presentation is the property of its rightful owner.
Sponsored Links
1 / 35

Circuits for Datalog Provenance PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on
  • Presentation posted in: General

Circuits for Datalog Provenance. Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania. A Simple Example of Data Provenance. “ Boolean Provenance/Lineage ” as a Boolean formula Q is true on D   F Q,D is true

Download Presentation

Circuits for Datalog Provenance

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Circuits for datalog provenance

Circuits for Datalog Provenance

Daniel Deutch

Tel Aviv Univ.

Tova Milo

Tel Aviv Univ.

Sudeepa Roy

Univ. of Washington

Val Tannen

Univ. of Pennsylvania


A simple example of data provenance

A Simple Example of Data Provenance

  • “Boolean Provenance/Lineage” as a Boolean formula

  • Q is true on D FQ,D is true

  • Poly-size, Poly-time computable (data complexity)

  • But Q is a RA+ query

  • This talk: What if Q is a Datalog Program?

y1

x1

z1

y2

x2

z2

y3

Database D

Boolean query Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y)

FQ,D = (x1y1z1)  (x1y2z2)  (x2y3z2)


Motivation

Motivation

  • Provenance

    • Reliability and repeatability

    • View management and deletion propagation

    • Trust and security management

    • Query answering in probabilistic database, ….

  • Datalog

    • Datalog is popular again! (two keynotes this ICDT/EDBT)

    • Data extraction in Web, declarative networking

    • Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna)

  • Finding suitable “Provenance for Datalog” is important

    • Both from theoretical and practical viewpoints

  • How do we compute, store, and interpret provenance for datalog programs efficiently and effectively?


Overview of our results

Overview of Our Results

  • Can we get poly-size Boolean formulas for datalog provenance?

    No, even if we allow unbounded time

  • Do we have a solution?

    Yes! Use Boolean Circuits!

  • What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07]

    It depends on the semiring


Outline

Outline

  • Background

  • Circuits for Boolean Provenance

  • Circuits for General Provenance Semirings


Outline1

Outline

  • Background

  • Circuits for Boolean Provenance

  • Circuits for General Provenance Semirings


Datalog

Datalog

  • Datalog program for Transitive Closure and Single-source Reachability

  • EDB (base) relation for edges: R

  • IDB (derived) relations

  • Transitive closure (T)

  • Single-source reachability from vertex ‘a’ (S)

T(x, y) :- R(x, y)

T(x, y) :- R(x, z), T(z, y)

S(x) :- T(a, x)

EDB

(Extensional Databases)

IDB

(Intensional Databases)


Boolean provenance posbool x database

Boolean Provenance PosBool(X)-Database

  • Tuples are annotated with variables from a set X

    • Here X = {x1, x2, y1, y2, ….}

  • For n tuples in X, 2n possible worlds by assignments

    : X  {True, False}

  • Useful in query evaluation on incomplete or probabilistic databases

y1

x1

z1

y2

x2

z2

y3

PosBool(X)-database D


Ra over posbool x database

RA+ over PosBool(X)-Database

  • Annotation propagates from input to output

    • Join = , Projection/Union = 

  • Output tuples are annotated by monotone Boolean formula

    • FQ,D is the annotation of the unique output tuple

y1

x1

z1

y2

x2

z2

y3

PosBool(X)-Database D

RA+Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y)

FQ,D = (x1y1z1)  (x1y2z2)  (x2y3z2)


Two important properties ra over posbool x database

Two Important Properties:RA+ over PosBool(X)-Database

For all RA+ query Q, D, and assignment 

  • (Faithful Representation) Q(D)= [Q(D)]

  • (Poly-size overhead) The size of FQ,D is poly in |D| and can be computed in poly-time.

y1

x1

True

z1

False

True

y2

x2

False

z2

True

False

y3

True

PosBool(X)-Database D

RA+Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y)

= False

FQ,D = (x1y1z1)  (x1y2z2)  (x2y3z2)

= False


Datalog over posbool x database

Datalog over PosBool(X) Database

T(a, b)

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

  • Semantics using Derivation Trees (Green et al. 2007)

  • Annotation of T(a, b):

b

p

T(a, b)

R(a, b)

q

a

R(a, a)

T(a, b)

T(a, b)

Trees 

Leaves t of 

R(a, b)

Annot(t)

R(a, a)

T(a, b)

= q

= (q)  (pq) (ppq) …

R(a, a)

T(a, b)

  • Infinitely many trees

  • But always has a finite equivalent form

R(a, b)

But not necessarily poly-size


Lower bound boolean formulas for datalog provenance on posbool x

Lower Bound: Boolean formulas for Datalog Provenance on PosBool(X)

Theorem:

Given PosBool(X)-database D and datalog program P,

provenance of tuples in P(D)

cannot have a faithful representation using

Booleanformulas of size polynomial in |D|

Proof outline:

  • st-connectivity on n nodes requires n(logn)-size monotone Boolean formula

    • Karchmer-Wigderson, 1988

  • Faithful representation requires: for all True/False assignments to X,

  • P(D)= [P(D)]

  • Reduce to the hard instance with right  when P = transitive closure

Solution: Boolean Circuit!


Outline2

Outline

  • Background

  • Circuits for Boolean Provenance or PosBool(X)

  • Circuits for General Provenance Semirings


Boolean circuits

Boolean Circuits

b

a

  • Circuit is a DAG

    • use common subexpressions

    • Boolean formula = tree

  • Leaf nodes:

    • EDB vars in X

  • Internal nodes

    •  :

      IDB/EDB vars used in one derivation

    • :

      Alternative derivations

  • Roots:

    • IDB vars

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

p

q

XT(a, b)

XT(a, b)

p

q

XR(a, a)

XR(a, b)


Upper bound boolean circuits for posbool x

Upper Bound: Boolean Circuits for PosBool(X)

Theorem:

Given any PosBool(X)-database D and datalog program P,

provenance of tuples in P(D) can be faithfully represented

using monotone Boolean Circuits of poly-size in |D|

(and can be computed in poly-time)


Proof skecth

Proof Skecth

Two key ideas from previous work

1. Datalog Provenance can be represented by a

system of equations by instantiating vars in the datalog

program P to EDB/IDB tuples[Green et al. 2007]

  • EDB tuples constants, IDB tuples variables

  • Iteratively solve this system of equations

  • Fixpoint = provenance for all IDB tuples

2. A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011]

  • N = #IDB tuples

  • Build a circuit with N+1 layers from the system of equations


Illustration

Illustration

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

Step1 : Build system of equations by all possible instantiations: x, y, z  a, b

XT(a, a) = p  (p  XT(a, a))

XT(a, b) = q  (p  XT(a,b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

Step 2: Build a circuit with 4 + 1 layers (N = 4) …

b

p

a

q

Const

var


Illustration1

Illustration

Multiple roots for

multiple IDB vars

XT(a, a) = p  (p  XT(a, a))

XT(a, b) = q  (p  XT(a,b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

XT(a,a),2

XS(a),2

XS(b),2

XTa,a),2

XT(a,b),2

Level 2

Level 1

XS(a),1

XT(a,a),1

XS(b),1

XT(a,a),1

XT(a,b),1

XS(b),0

XS(a),0

XT(a,b),0

XT(a,a),0

XT(a,a),0

p

false

q

false

false

false

false

Assign leaf IDB vars to false


Optimizations

Optimizations

  • Store only two levels of circuit instead of N+1 levels

    • Evaluate iteratively

  • Embed circuit construction in semi-naïve evaluation

    • Check for new derivations, not only new IDB variables

    • Sound and Complete

  • Remove self-dependency of IDB vars

    • works for PosBool(X) and also some other semirings…

      XT(a, a)= p  (p  XT(a, a))

      XT(a, b) = q  (p  XT(a,b))

      XS(b) = XT(a, b)

      XS(a) = XT(a, a)


Illustration from here

Illustration (From here…)

XT(a,a),2

XS(a),2

XS(b),2

XTa,a),2

XT(a,b),2

Level 2

Level 1

XS(a),1

XT(a,a),1

XS(b),1

XT(a,a),1

XT(a,b),1

XS(b),0

XS(a),0

XT(a,b),0

XT(a,a),0

XT(a,a),0

p

false

q

false

false

false

false


Illustration to here

Illustration (…To here)

With all these optimizations

XT(a,a),top

XS(a),top

XT(a,b),top

Top Level

Bottom Level

q

p

XS(a),bottom

XT(a,b),bottom

XT(a,a),bottom


Applications of posbool x circuits

Applications of PosBool(X)-Circuits

  • Linear-time deletion propagation (in circuit-size)

  • Approximation for probabilistic databases

    • even when only the circuit (and not the database) is available

  • Circuits can be computed “offline”

    • Only linear-time evaluation is required when needed (e.g. deletion propagation)

      • compared to storing and solving a system of equations iteratively, or

      • re-evaluating datalog program

  • Can use existing techniques for efficient and parallel circuit evaluation


Outline3

Outline

  • Background

  • Circuits for Boolean Provenance or PosBool(X)

  • Circuits for General Provenance Semirings


Commutative semirings

Commutative Semirings

  • (K, +K, K, 0K, 1K)

    • domain K

    • +K, K : associative, commutative, have neutral elements 0K, 1K

    • K distributes over +K , i.e. a K (b +K c) = a K b +K a K c

    • 0K cancels any element in K, i.e. a K 0K = 0K K a = 0K

      Examples:

    • (B, , , False, True)

      • Set semantics

    • (N, +, , 0, 1)

      • Bag semantics

    • (N  {}, min, +, , 0)

      • Tropical semiring to compute cost (e.g. cost of a shortest path)


Provenance semirings

Provenance Semirings

  • Generalization of PosBool(X)

  • (K, +K, K, 0K, 1K)

    • Tuples are annotated with variables from X

    • K is of the form Prov(X)

    • +K denotes alternative usage

    • K denotes joint usage

  • Examples:

    • (PosBool(X), , , False, True)

    • (Lin(X), , , , )

      • tracks contributing tuples[Cui et. al. ’00]

    • (Why(X), , , , {})

      • : pairwise union of subsets, tracks contributing tuples in alternative derivations

        [Buneman et. al. ’01]


Provenance specialization

Provenance Specialization

  • Key property needed for applications like deletion propagation, trust management, cost computation, …

  • Prov(X) specializes correctly to K,

    if any valuation v : X  K

    extends uniquely to a homomorphism hv : Prov(X) K

    (which correctly maps +,  of Prov(X) to that of K)

  • Further, some provenance semirings are “more informative” than the others


Provenance semiring hierarchy

Provenance Semiring Hierarchy

N[X]

More informative

Less informative

Defined later

N (bag)

Sorp(X)

Why(X)

Tropical

PosBool(X)

Lin(X)

Specializes correctly

Security

Boolean (set)


Datalog provenance for general semirings

Datalog Provenance for General Semirings

PosBool(X)

Trees 

Leaves t of 

Annot(t)

k

+k

Trees 

Leaves t of 

Annot(t)

General Prov(X)

  • Infinite sums should be well-defined

  • Need to consider “–continuous semirings” and “–continuous homomorphism”


Provenance semiring hierarchy1

Provenance Semiring Hierarchy

Need to add 

N[[X]] and N

Finite so

-continuous

N[X]

N[[X]] : Most informative

provenance semiring

[Green et al. ’07]

N (bag)

Sorp(X)

Why(X)

Tropical

PosBool(X)

Lin(X)

Security

Boolean (set)


How good is n x w r t size of datalog provenance

How good is N[[X]] w.r.t. Size of Datalog Provenance?

  • Poly-size overhead is not valid because of infinite sum

  • But can outputs have finite annotations (with X,  , +) that specializes correctly to semirings with finite domains?

Theorem:

  • It is not possible to annotate with finite provenance expressions

  • the output of datalog programs following N[[X]] -semantics

  • that specialize “correctly” to the semiring Why(X)

Finite annotations won’t specialize correctly to Why(X)

Theorem:

However, we can generate poly-size circuits in poly-time directly for Why(X)

  • Need more levels in the circuit from system of equations

  • Need a different argument for correctness


Can we still have a good general semiring w r t size

Can we still have a good general semiring w.r.t. size?

  • We propose Sorp(X)

    • Most general absorptive semiring

      • a + a.b = a

    • N[X] but keep polynomials that are not “absorbed” by the others

      • e.g. pq + p2q3 pq

        p2q + pq2  p2q + pq2

  • The same algorithm, proof, and optimizations to construct poly-size circuits hold

    • Circuits are more general than Boolean circuit

  • Specializes correctly to interesting semirings

  • Outputs can be annotated by poly-size circuits


Provenance semiring hierarchy2

Provenance Semiring Hierarchy

N[X]

N (bag)

Sorp(X)

Why(X)

Tropical

PosBool(X)

Lin(X)

Security

Boolean (set)


Related work

Related Work

  • Data Provenance

    • e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08]

  • Circuits

    • Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g. [Arora-Barak ’09] (book)

  • Provenance for Datalog

    • System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07]

    • Poly-size c-tables with Boolean formulas for datalog with contradictions [Abiteboul et al. 2014]


Conclusions

Conclusions

  • Circuits to represent and store Datalog Provenance

    • for PosBool(X) and other semirings

    • Semantics, Algorithms, Limitations, Applicability

    • Preliminary experiments support our results

      • we compared circuits for deletion propagation with iteratively solving system of equations and reevaluation of datalog from scratch

  • Future Work:

    • A complete implementation, evaluation, new applications


Circuits for datalog provenance

Thank You

Questions?


  • Login