On Provenance of Queries on Linked Web Data

On Provenance of Queries on Linked Web Data 1,2Yannis Theoharis, 2Irini Fundulaki, 3,2Grigoris Karvounarakis and 1,2Vassilis Christophides 1Institute of Computer Science, FORTH and 2Computer Science Department, University of Crete 3LogicBox, USA

What is “Linked Data” W3C Linking Open Data publishvarious open datasets as RDF on the Web set RDFtypedlinks between data items from different data sources.

Motivation: Linked Data Processing Data is: fetched from heterogeneous sources integrated materialized in RDF made available via SPARQL • Range of computations • SPARQL queries • Complex programs (logic or procedular)

Provenance Aware Applications Trust assessment trustworthiness Access control confidentiality level Data cleaning validity Curated databases source data origin All these applications need to represent and store the relation of the input with the output of data processes gain efficiency impossible without provenance

Data Provenance Models • Annotation Models: • annotation computation coupled with a particular application and a particular assignment of source data annotations R1R2 R1 R2 t: trusted f: untrusted t f t f query recomputation! • Abstract Provenance Models: • abstract provenance tokens and operators are substituted by appropriate concrete tokens for a particular application and assignment R1R2 R1 R2 t t t f t Λf t Λ t

This Talk • “Can previous work on abstract provenance models be leveraged for SPARQL” ? • NO: due to the OPTIONAL (similar to the SQL left outer join) operator • YES: for the positive (without OPTIONAL) fragment of SPARQL • We present our ongoing work on a SPARQL abstract provenance model. • Challenge: to capture the form of negation that OPTIONAL introduces

Outline SPARQL algebra Abstract Provenance Models for Positive SPARQL Limitations of Previous Models Towards a SPARQL Provenance Model

SPARQL (1/2) • SPARQL: W3C Recommendation language to Query RDF data. mappings Select { … } Compose Filter triple patterns (?x, ?y,e) mappings mappings {(?x,d),(?y,b)} { … } {(?x,f),(?y,g)} Construct/ Describe variables constant (?x, ?y, e) Ω1 μ1 μ2

SPARQL (2/2) SPARQL algebra defines 5 operators on mapping bags Unary ops: π(projection), σ (selection, also called filtering) Binary ops: U(union) (join) (optional) μand μ’are compatible (μ ~ μ’), if they agree in their common variables μ1 ~ μ4 μ3 ~ μ4 μ2 ~ μ4 Positive SPARQL (SPARQL+) Ω1Ω2 Ω1Ω2 Ω2 Ω2 Ω1 Ω1 Ω σ?x=a(Ω) π?x (Ω) Ω1UΩ2 Ω2 Ω1 Ω1\Ω2 Ω1Ω2 μ4 μ3 μ5 = μ1 Uμ4 μ6=μ3 Uμ4 μ4 = μ1 Uμ3 μ2 μ1 μ2 μ1 μ2 μ3 μ1 μ2 μ2 μ1 μ2 μ1 ?z is unbound in μ1 card(μ1) = 2 card(μ2) = 1

Outline SPARQL algebra AbstractProvenance Models for Positive SPARQL Limitations of Previous Models Towards a SPARQL Provenance Model

Abstract Provenance Models Compose Filter triple patterns (?x, ?y,e) Provenance Select mappings mappings mappings { … } {(?x,d),(?y,b)} Most informative How Trio Why Lineage { … } {(?x,f),(?y,g)} • Abstract provenance models encode the query operators in different level of detail • Expressiveness vs efficiency (annotation storage and computation time) Less informative

Abstract Provenance Models for SPARQL+ • Previous models are defined for positive relational algebra • Positive relational operators are monotonic • The addition (removal) of a tuple can only result in additional (removed) tuples in the output • This also holds forSPARQL+ (projection, union, join) • Previous models suffice for SPARQL+

Outline SPARQL algebra Abstract Provenance Modelsfor Positive SPARQL Limitations of Previous Models Towards a SPARQL Provenance Model

Boolean trust assessment (SPARQL) Trusted: μ1,μ2,μ4 Trusted: μ1,μ2,μ3,μ4 Ω1 Ω2 Ω1\Ω2 Ω1\Ω2 μ1 μ2 μ3μ4 μ1μ2 μ2 Ω1Ω2 Ω1Ω2 boolean trust semantics set semantics on trusted mappings μ5μ2 μ1 μ2 • and \ are not monotonic: • μ3becomes untrusted μ5becomes untrusted and μ1becomes trusted in Ω1Ω2

Perm Ω1 Ω1Ω2 Ω1\Ω2 μ1 μ2 Ω2 Intuitively, (f, g) is in Ω1\Ω2 because it is not compatible with neitherμ3norμ4 μ3μ4 (d, b, c) is in Ω1\Ω2 due to the join between μ1and μ3 • If μ3 becomes untrusted, Perm • infers that (d, b, c) becomes untrusted, but • cannot infer that (d, b, -) should become trusted

RDF Meta Knowledge & M-semirings Ω1\Ω2 Ω1 μ2 μ1 μ2 t t t t Ω2 Ω1Ω2 μ3μ4 f t μ5μ2 f f t t • Like Perm, RDF Meta Knowledge and M-semirings infer that μ5 is untrusted but can not infer that μ1: (d, b, -) is trusted.

Outline SPARQL algebra Abstract Provenance Models for Positive SPARQL Limitations of Previous Models Towards a SPARQL Provenance Model

A Third Operation for Compatibility (1/2) Take care about compatible mappings Only one between μ1, μ5can appear in the result Keep provenance information for both of them ! Ω1 Ω1Ω2 =(Ω1Ω2) U (Ω1 \ Ω2) μ1 μ2 μ5μ1μ2 t t (tΛf) = f (tΛt) = t t f ? f, if μ1 ~μ3 and c3 = t t Ω2 A(μ1,μ3) = t,else μ3μ4 t t f t

A Third Operation for Compatibility (2/2) Ais a binary operator on mappings Determines whether the mapping exist in the result or not If yes, its provenance equals the positive provenance part, e.g. c1for c1*A(μ1,μ3) In general, Ω1Ω2 =(Ω1Ω2) U (Ω1 \ Ω2) μ5μ1μ2 0, if μ1 ~μ3 and c3 ≠ 0 A(μ1,μ3) = 1,else 0: the neutral element for + 1: the neutral element for *

SPARQL Provenance Operators Two types of operators on provenance tokens, i.e. + and * (for SPARQL+) on mappings, i.e. A (for and \) Good news: Every triple of the dataset is uniquely annotated. Why not to use annotations as mapping identifiers in A? Due to the projection operator…

Enrich Tokens with Schema Information Use tokens (c1, c2…) as mapping ids in A expressions But, μ1 ~μ2might hold, while π?y,?z (μ1) ~π ?y,?z (μ2) Tokens don’t suffice, keep pairs token-schema Ω π?y,?z(Ω) 0, if μ1 ~μ2 and c2 ≠ 0 0, if πS1(μ1) ~πS2(μ2) and c2 ≠ 0 A(c1,c2)= A( (c1,S1), (c2, S2) )= μ1 μ2 1,else 1,else

Towards a SPARQL Provenance Model • Define an algebra on token-schema pairs • 3 operations • 2 for SPARQL operators • 1 for compatibility • What if there is no projection (or projection is not allowed to be pushed down) ? • annotations suffice (no need for schema information), • still in need of the compatibility operator • What if there is no Optional? • previous models suffice, e.g. How

Future Work SPARQL Provenance Model Extent model expressiveness to capture other computations on Linked Data Logic explanations Implementation

Questions ?

On Provenance of Queries on Linked Web Data