# Circuits for Datalog Provenance - PowerPoint PPT Presentation

1 / 35

Circuits for Datalog Provenance. Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania. A Simple Example of Data Provenance. “ Boolean Provenance/Lineage ” as a Boolean formula Q is true on D   F Q,D is true

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Circuits for Datalog Provenance

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Circuits for Datalog Provenance

Daniel Deutch

Tel Aviv Univ.

Tova Milo

Tel Aviv Univ.

Sudeepa Roy

Univ. of Washington

Val Tannen

Univ. of Pennsylvania

### A Simple Example of Data Provenance

• “Boolean Provenance/Lineage” as a Boolean formula

• Q is true on D FQ,D is true

• Poly-size, Poly-time computable (data complexity)

• But Q is a RA+ query

• This talk: What if Q is a Datalog Program?

y1

x1

z1

y2

x2

z2

y3

Database D

Boolean query Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y)

FQ,D = (x1y1z1)  (x1y2z2)  (x2y3z2)

### Motivation

• Provenance

• Reliability and repeatability

• View management and deletion propagation

• Trust and security management

• Query answering in probabilistic database, ….

• Datalog

• Datalog is popular again! (two keynotes this ICDT/EDBT)

• Data extraction in Web, declarative networking

• Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna)

• Finding suitable “Provenance for Datalog” is important

• Both from theoretical and practical viewpoints

• How do we compute, store, and interpret provenance for datalog programs efficiently and effectively?

### Overview of Our Results

• Can we get poly-size Boolean formulas for datalog provenance?

No, even if we allow unbounded time

• Do we have a solution?

Yes! Use Boolean Circuits!

• What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07]

It depends on the semiring

### Outline

• Background

• Circuits for Boolean Provenance

• Circuits for General Provenance Semirings

### Outline

• Background

• Circuits for Boolean Provenance

• Circuits for General Provenance Semirings

### Datalog

• Datalog program for Transitive Closure and Single-source Reachability

• EDB (base) relation for edges: R

• IDB (derived) relations

• Transitive closure (T)

• Single-source reachability from vertex ‘a’ (S)

T(x, y) :- R(x, y)

T(x, y) :- R(x, z), T(z, y)

S(x) :- T(a, x)

EDB

(Extensional Databases)

IDB

(Intensional Databases)

### Boolean Provenance PosBool(X)-Database

• Tuples are annotated with variables from a set X

• Here X = {x1, x2, y1, y2, ….}

• For n tuples in X, 2n possible worlds by assignments

: X  {True, False}

• Useful in query evaluation on incomplete or probabilistic databases

y1

x1

z1

y2

x2

z2

y3

PosBool(X)-database D

### RA+ over PosBool(X)-Database

• Annotation propagates from input to output

• Join = , Projection/Union = 

• Output tuples are annotated by monotone Boolean formula

• FQ,D is the annotation of the unique output tuple

y1

x1

z1

y2

x2

z2

y3

PosBool(X)-Database D

RA+Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y)

FQ,D = (x1y1z1)  (x1y2z2)  (x2y3z2)

### Two Important Properties:RA+ over PosBool(X)-Database

For all RA+ query Q, D, and assignment 

• (Faithful Representation) Q(D)= [Q(D)]

• (Poly-size overhead) The size of FQ,D is poly in |D| and can be computed in poly-time.

y1

x1

True

z1

False

True

y2

x2

False

z2

True

False

y3

True

PosBool(X)-Database D

RA+Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y)

= False

FQ,D = (x1y1z1)  (x1y2z2)  (x2y3z2)

= False

### Datalog over PosBool(X) Database

T(a, b)

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

• Semantics using Derivation Trees (Green et al. 2007)

• Annotation of T(a, b):

b

p

T(a, b)

R(a, b)

q

a

R(a, a)

T(a, b)

T(a, b)

Trees 

Leaves t of 

R(a, b)

Annot(t)

R(a, a)

T(a, b)

= q

= (q)  (pq) (ppq) …

R(a, a)

T(a, b)

• Infinitely many trees

• But always has a finite equivalent form

R(a, b)

But not necessarily poly-size

### Lower Bound: Boolean formulas for Datalog Provenance on PosBool(X)

Theorem:

Given PosBool(X)-database D and datalog program P,

provenance of tuples in P(D)

cannot have a faithful representation using

Booleanformulas of size polynomial in |D|

Proof outline:

• st-connectivity on n nodes requires n(logn)-size monotone Boolean formula

• Karchmer-Wigderson, 1988

• Faithful representation requires: for all True/False assignments to X,

• P(D)= [P(D)]

• Reduce to the hard instance with right  when P = transitive closure

Solution: Boolean Circuit!

### Outline

• Background

• Circuits for Boolean Provenance or PosBool(X)

• Circuits for General Provenance Semirings

### Boolean Circuits

b

a

• Circuit is a DAG

• use common subexpressions

• Boolean formula = tree

• Leaf nodes:

• EDB vars in X

• Internal nodes

•  :

IDB/EDB vars used in one derivation

• :

Alternative derivations

• Roots:

• IDB vars

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

p

q

XT(a, b)

XT(a, b)

p

q

XR(a, a)

XR(a, b)

### Upper Bound: Boolean Circuits for PosBool(X)

Theorem:

Given any PosBool(X)-database D and datalog program P,

provenance of tuples in P(D) can be faithfully represented

using monotone Boolean Circuits of poly-size in |D|

(and can be computed in poly-time)

### Proof Skecth

Two key ideas from previous work

1. Datalog Provenance can be represented by a

system of equations by instantiating vars in the datalog

program P to EDB/IDB tuples[Green et al. 2007]

• EDB tuples constants, IDB tuples variables

• Iteratively solve this system of equations

• Fixpoint = provenance for all IDB tuples

2. A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011]

• N = #IDB tuples

• Build a circuit with N+1 layers from the system of equations

### Illustration

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

Step1 : Build system of equations by all possible instantiations: x, y, z  a, b

XT(a, a) = p  (p  XT(a, a))

XT(a, b) = q  (p  XT(a,b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

Step 2: Build a circuit with 4 + 1 layers (N = 4) …

b

p

a

q

Const

var

### Illustration

Multiple roots for

multiple IDB vars

XT(a, a) = p  (p  XT(a, a))

XT(a, b) = q  (p  XT(a,b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

XT(a,a),2

XS(a),2

XS(b),2

XTa,a),2

XT(a,b),2

Level 2

Level 1

XS(a),1

XT(a,a),1

XS(b),1

XT(a,a),1

XT(a,b),1

XS(b),0

XS(a),0

XT(a,b),0

XT(a,a),0

XT(a,a),0

p

false

q

false

false

false

false

Assign leaf IDB vars to false

### Optimizations

• Store only two levels of circuit instead of N+1 levels

• Evaluate iteratively

• Embed circuit construction in semi-naïve evaluation

• Check for new derivations, not only new IDB variables

• Sound and Complete

• Remove self-dependency of IDB vars

• works for PosBool(X) and also some other semirings…

XT(a, a)= p  (p  XT(a, a))

XT(a, b) = q  (p  XT(a,b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

XT(a,a),2

XS(a),2

XS(b),2

XTa,a),2

XT(a,b),2

Level 2

Level 1

XS(a),1

XT(a,a),1

XS(b),1

XT(a,a),1

XT(a,b),1

XS(b),0

XS(a),0

XT(a,b),0

XT(a,a),0

XT(a,a),0

p

false

q

false

false

false

false

### Illustration (…To here)

With all these optimizations

XT(a,a),top

XS(a),top

XT(a,b),top

Top Level

Bottom Level

q

p

XS(a),bottom

XT(a,b),bottom

XT(a,a),bottom

### Applications of PosBool(X)-Circuits

• Linear-time deletion propagation (in circuit-size)

• Approximation for probabilistic databases

• even when only the circuit (and not the database) is available

• Circuits can be computed “offline”

• Only linear-time evaluation is required when needed (e.g. deletion propagation)

• compared to storing and solving a system of equations iteratively, or

• re-evaluating datalog program

• Can use existing techniques for efficient and parallel circuit evaluation

### Outline

• Background

• Circuits for Boolean Provenance or PosBool(X)

• Circuits for General Provenance Semirings

### Commutative Semirings

• (K, +K, K, 0K, 1K)

• domain K

• +K, K : associative, commutative, have neutral elements 0K, 1K

• K distributes over +K , i.e. a K (b +K c) = a K b +K a K c

• 0K cancels any element in K, i.e. a K 0K = 0K K a = 0K

Examples:

• (B, , , False, True)

• Set semantics

• (N, +, , 0, 1)

• Bag semantics

• (N  {}, min, +, , 0)

• Tropical semiring to compute cost (e.g. cost of a shortest path)

### Provenance Semirings

• Generalization of PosBool(X)

• (K, +K, K, 0K, 1K)

• Tuples are annotated with variables from X

• K is of the form Prov(X)

• +K denotes alternative usage

• K denotes joint usage

• Examples:

• (PosBool(X), , , False, True)

• (Lin(X), , , , )

• tracks contributing tuples[Cui et. al. ’00]

• (Why(X), , , , {})

• : pairwise union of subsets, tracks contributing tuples in alternative derivations

[Buneman et. al. ’01]

### Provenance Specialization

• Key property needed for applications like deletion propagation, trust management, cost computation, …

• Prov(X) specializes correctly to K,

if any valuation v : X  K

extends uniquely to a homomorphism hv : Prov(X) K

(which correctly maps +,  of Prov(X) to that of K)

### Provenance Semiring Hierarchy

N[X]

Less informative

Defined later

N (bag)

Sorp(X)

Why(X)

Tropical

PosBool(X)

Lin(X)

Specializes correctly

Security

Boolean (set)

### Datalog Provenance for General Semirings

PosBool(X)

Trees 

Leaves t of 

Annot(t)

k

+k

Trees 

Leaves t of 

Annot(t)

General Prov(X)

• Infinite sums should be well-defined

• Need to consider “–continuous semirings” and “–continuous homomorphism”

### Provenance Semiring Hierarchy

N[[X]] and N

Finite so

-continuous

N[X]

N[[X]] : Most informative

provenance semiring

[Green et al. ’07]

N (bag)

Sorp(X)

Why(X)

Tropical

PosBool(X)

Lin(X)

Security

Boolean (set)

### How good is N[[X]] w.r.t. Size of Datalog Provenance?

• Poly-size overhead is not valid because of infinite sum

• But can outputs have finite annotations (with X,  , +) that specializes correctly to semirings with finite domains?

Theorem:

• It is not possible to annotate with finite provenance expressions

• the output of datalog programs following N[[X]] -semantics

• that specialize “correctly” to the semiring Why(X)

Finite annotations won’t specialize correctly to Why(X)

Theorem:

However, we can generate poly-size circuits in poly-time directly for Why(X)

• Need more levels in the circuit from system of equations

• Need a different argument for correctness

### Can we still have a good general semiring w.r.t. size?

• We propose Sorp(X)

• Most general absorptive semiring

• a + a.b = a

• N[X] but keep polynomials that are not “absorbed” by the others

• e.g. pq + p2q3 pq

p2q + pq2  p2q + pq2

• The same algorithm, proof, and optimizations to construct poly-size circuits hold

• Circuits are more general than Boolean circuit

• Specializes correctly to interesting semirings

• Outputs can be annotated by poly-size circuits

N[X]

N (bag)

Sorp(X)

Why(X)

Tropical

PosBool(X)

Lin(X)

Security

Boolean (set)

### Related Work

• Data Provenance

• e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08]

• Circuits

• Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g. [Arora-Barak ’09] (book)

• Provenance for Datalog

• System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07]

• Poly-size c-tables with Boolean formulas for datalog with contradictions [Abiteboul et al. 2014]

### Conclusions

• Circuits to represent and store Datalog Provenance

• for PosBool(X) and other semirings

• Semantics, Algorithms, Limitations, Applicability

• Preliminary experiments support our results

• we compared circuits for deletion propagation with iteratively solving system of equations and reevaluation of datalog from scratch

• Future Work:

• A complete implementation, evaluation, new applications

Thank You

Questions?