Loading in 5 sec....

Circuits for Datalog ProvenancePowerPoint Presentation

Circuits for Datalog Provenance

- By
**admon** - Follow User

- 103 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Circuits for Datalog Provenance' - admon

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Circuits for Datalog Provenance

Daniel Deutch

Tel Aviv Univ.

Tova Milo

Tel Aviv Univ.

Sudeepa Roy

Univ. of Washington

Val Tannen

Univ. of Pennsylvania

A Simple Example of Data Provenance

- “Boolean Provenance/Lineage” as a Boolean formula
- Q is true on D FQ,D is true
- Poly-size, Poly-time computable (data complexity)
- But Q is a RA+ query
- This talk: What if Q is a Datalog Program?

y1

x1

z1

y2

x2

z2

y3

Database D

Boolean query Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

Motivation

- Provenance
- Reliability and repeatability
- View management and deletion propagation
- Trust and security management
- Query answering in probabilistic database, ….

- Datalog
- Datalog is popular again! (two keynotes this ICDT/EDBT)
- Data extraction in Web, declarative networking
- Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna)

- Finding suitable “Provenance for Datalog” is important
- Both from theoretical and practical viewpoints

- How do we compute, store, and interpret provenance for datalog programs efficiently and effectively?

Overview of Our Results

- Can we get poly-size Boolean formulas for datalog provenance?
No, even if we allow unbounded time

- Do we have a solution?
Yes! Use Boolean Circuits!

- What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07]
It depends on the semiring

Outline

- Background
- Circuits for Boolean Provenance
- Circuits for General Provenance Semirings

Outline

- Background
- Circuits for Boolean Provenance
- Circuits for General Provenance Semirings

Datalog

- Datalog program for Transitive Closure and Single-source Reachability
- EDB (base) relation for edges: R
- IDB (derived) relations
- Transitive closure (T)
- Single-source reachability from vertex ‘a’ (S)

T(x, y) :- R(x, y)

T(x, y) :- R(x, z), T(z, y)

S(x) :- T(a, x)

EDB

(Extensional Databases)

IDB

(Intensional Databases)

Boolean Provenance PosBool(X)-Database

- Tuples are annotated with variables from a set X
- Here X = {x1, x2, y1, y2, ….}

- For n tuples in X, 2n possible worlds by assignments
: X {True, False}

- Useful in query evaluation on incomplete or probabilistic databases

y1

x1

z1

y2

x2

z2

y3

PosBool(X)-database D

RA+ over PosBool(X)-Database

- Annotation propagates from input to output
- Join = , Projection/Union =

- Output tuples are annotated by monotone Boolean formula
- FQ,D is the annotation of the unique output tuple

y1

x1

z1

y2

x2

z2

y3

PosBool(X)-Database D

RA+Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

Two Important Properties:RA+ over PosBool(X)-Database

For all RA+ query Q, D, and assignment

- (Faithful Representation) Q(D)= [Q(D)]
- (Poly-size overhead) The size of FQ,D is poly in |D| and can be computed in poly-time.

y1

x1

True

z1

False

True

y2

x2

False

z2

True

False

y3

True

PosBool(X)-Database D

RA+Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

= False

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

= False

Datalog over PosBool(X) Database

T(a, b)

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

- Semantics using Derivation Trees (Green et al. 2007)
- Annotation of T(a, b):

b

p

T(a, b)

R(a, b)

q

a

R(a, a)

T(a, b)

T(a, b)

Trees

Leaves t of

R(a, b)

Annot(t)

R(a, a)

T(a, b)

= q

= (q) (pq) (ppq) …

R(a, a)

T(a, b)

- Infinitely many trees
- But always has a finite equivalent form

…

R(a, b)

But not necessarily poly-size

Lower Bound: Boolean formulas for Datalog Provenance on PosBool(X)

Theorem:

Given PosBool(X)-database D and datalog program P,

provenance of tuples in P(D)

cannot have a faithful representation using

Booleanformulas of size polynomial in |D|

Proof outline:

- st-connectivity on n nodes requires n(logn)-size monotone Boolean formula
- Karchmer-Wigderson, 1988

- Faithful representation requires: for all True/False assignments to X,
- P(D)= [P(D)]
- Reduce to the hard instance with right when P = transitive closure

Solution: Boolean Circuit!

Outline

- Background
- Circuits for Boolean Provenance or PosBool(X)
- Circuits for General Provenance Semirings

Boolean Circuits

b

a

- Circuit is a DAG
- use common subexpressions
- Boolean formula = tree

- Leaf nodes:
- EDB vars in X

- Internal nodes
- :
IDB/EDB vars used in one derivation

- :
Alternative derivations

- :
- Roots:
- IDB vars

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

p

q

XT(a, b)

XT(a, b)

p

q

XR(a, a)

XR(a, b)

Upper Bound: Boolean Circuits for PosBool(X)

Theorem:

Given any PosBool(X)-database D and datalog program P,

provenance of tuples in P(D) can be faithfully represented

using monotone Boolean Circuits of poly-size in |D|

(and can be computed in poly-time)

Proof Skecth

Two key ideas from previous work

1. Datalog Provenance can be represented by a

system of equations by instantiating vars in the datalog

program P to EDB/IDB tuples[Green et al. 2007]

- EDB tuples constants, IDB tuples variables
- Iteratively solve this system of equations
- Fixpoint = provenance for all IDB tuples

2. A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011]

- N = #IDB tuples
- Build a circuit with N+1 layers from the system of equations

Illustration

T(x, y) :- R(x, y)

T(x, y) :- R(x, y), T(y, z)

S(x) :- T(a, x)

Step1 : Build system of equations by all possible instantiations: x, y, z a, b

XT(a, a) = p (p XT(a, a))

XT(a, b) = q (p XT(a,b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

Step 2: Build a circuit with 4 + 1 layers (N = 4) …

b

p

a

q

Const

var

Illustration

Multiple roots for

multiple IDB vars

XT(a, a) = p (p XT(a, a))

XT(a, b) = q (p XT(a,b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

XT(a,a),2

XS(a),2

XS(b),2

XTa,a),2

XT(a,b),2

Level 2

Level 1

XS(a),1

XT(a,a),1

XS(b),1

XT(a,a),1

XT(a,b),1

XS(b),0

XS(a),0

XT(a,b),0

XT(a,a),0

XT(a,a),0

p

false

q

false

false

false

false

Assign leaf IDB vars to false

Optimizations

- Store only two levels of circuit instead of N+1 levels
- Evaluate iteratively

- Embed circuit construction in semi-naïve evaluation
- Check for new derivations, not only new IDB variables
- Sound and Complete

- Remove self-dependency of IDB vars
- works for PosBool(X) and also some other semirings…
XT(a, a)= p (p XT(a, a))

XT(a, b) = q (p XT(a,b))

XS(b) = XT(a, b)

XS(a) = XT(a, a)

- works for PosBool(X) and also some other semirings…

Illustration (From here…)

XT(a,a),2

XS(a),2

XS(b),2

XTa,a),2

XT(a,b),2

Level 2

Level 1

XS(a),1

XT(a,a),1

XS(b),1

XT(a,a),1

XT(a,b),1

XS(b),0

XS(a),0

XT(a,b),0

XT(a,a),0

XT(a,a),0

p

false

q

false

false

false

false

Illustration (…To here)

With all these optimizations

XT(a,a),top

XS(a),top

XT(a,b),top

Top Level

Bottom Level

q

p

XS(a),bottom

XT(a,b),bottom

XT(a,a),bottom

Applications of PosBool(X)-Circuits

- Linear-time deletion propagation (in circuit-size)
- Approximation for probabilistic databases
- even when only the circuit (and not the database) is available

- Circuits can be computed “offline”
- Only linear-time evaluation is required when needed (e.g. deletion propagation)
- compared to storing and solving a system of equations iteratively, or
- re-evaluating datalog program

- Only linear-time evaluation is required when needed (e.g. deletion propagation)
- Can use existing techniques for efficient and parallel circuit evaluation

Outline

- Background
- Circuits for Boolean Provenance or PosBool(X)
- Circuits for General Provenance Semirings

Commutative Semirings

- (K, +K, K, 0K, 1K)
- domain K
- +K, K : associative, commutative, have neutral elements 0K, 1K
- K distributes over +K , i.e. a K (b +K c) = a K b +K a K c
- 0K cancels any element in K, i.e. a K 0K = 0K K a = 0K
Examples:

- (B, , , False, True)
- Set semantics

- (N, +, , 0, 1)
- Bag semantics

- (N {}, min, +, , 0)
- Tropical semiring to compute cost (e.g. cost of a shortest path)

Provenance Semirings

- Generalization of PosBool(X)
- (K, +K, K, 0K, 1K)
- Tuples are annotated with variables from X
- K is of the form Prov(X)
- +K denotes alternative usage
- K denotes joint usage

- Examples:
- (PosBool(X), , , False, True)
- (Lin(X), , , , )
- tracks contributing tuples[Cui et. al. ’00]

- (Why(X), , , , {})
- : pairwise union of subsets, tracks contributing tuples in alternative derivations
[Buneman et. al. ’01]

- : pairwise union of subsets, tracks contributing tuples in alternative derivations

Provenance Specialization

- Key property needed for applications like deletion propagation, trust management, cost computation, …
- Prov(X) specializes correctly to K,
if any valuation v : X K

extends uniquely to a homomorphism hv : Prov(X) K

(which correctly maps +, of Prov(X) to that of K)

- Further, some provenance semirings are “more informative” than the others

Provenance Semiring Hierarchy

N[X]

More informative

Less informative

Defined later

N (bag)

Sorp(X)

Why(X)

Tropical

PosBool(X)

Lin(X)

Specializes correctly

Security

Boolean (set)

Datalog Provenance for General Semirings

PosBool(X)

Trees

Leaves t of

Annot(t)

k

+k

Trees

Leaves t of

Annot(t)

General Prov(X)

- Infinite sums should be well-defined
- Need to consider “–continuous semirings” and “–continuous homomorphism”

Provenance Semiring Hierarchy

Need to add

N[[X]] and N

Finite so

-continuous

N[X]

N[[X]] : Most informative

provenance semiring

[Green et al. ’07]

N (bag)

Sorp(X)

Why(X)

Tropical

PosBool(X)

Lin(X)

Security

Boolean (set)

How good is N[[X]] w.r.t. Size of Datalog Provenance?

- Poly-size overhead is not valid because of infinite sum
- But can outputs have finite annotations (with X, , +) that specializes correctly to semirings with finite domains?

Theorem:

- It is not possible to annotate with finite provenance expressions
- the output of datalog programs following N[[X]] -semantics
- that specialize “correctly” to the semiring Why(X)

Finite annotations won’t specialize correctly to Why(X)

Theorem:

However, we can generate poly-size circuits in poly-time directly for Why(X)

- Need more levels in the circuit from system of equations
- Need a different argument for correctness

Can we still have a good general semiring w.r.t. size?

- We propose Sorp(X)
- Most general absorptive semiring
- a + a.b = a

- N[X] but keep polynomials that are not “absorbed” by the others
- e.g. pq + p2q3 pq
p2q + pq2 p2q + pq2

- e.g. pq + p2q3 pq

- Most general absorptive semiring
- The same algorithm, proof, and optimizations to construct poly-size circuits hold
- Circuits are more general than Boolean circuit

- Specializes correctly to interesting semirings
- Outputs can be annotated by poly-size circuits

Provenance Semiring Hierarchy

N[X]

N (bag)

Sorp(X)

Why(X)

Tropical

PosBool(X)

Lin(X)

Security

Boolean (set)

Related Work

- Data Provenance
- e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08]

- Circuits
- Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g. [Arora-Barak ’09] (book)

- Provenance for Datalog
- System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07]
- Poly-size c-tables with Boolean formulas for datalog with contradictions [Abiteboul et al. 2014]

Conclusions

- Circuits to represent and store Datalog Provenance
- for PosBool(X) and other semirings
- Semantics, Algorithms, Limitations, Applicability
- Preliminary experiments support our results
- we compared circuits for deletion propagation with iteratively solving system of equations and reevaluation of datalog from scratch

- Future Work:
- A complete implementation, evaluation, new applications

Questions?

Download Presentation

Connecting to Server..