Using Datalog for Rule-Based Reasoning over Web Data: Challenges and Next Steps

Using Datalog for Rule-Based Reasoning over Web Data: Challenges and Next Steps Axel Polleres Digital Enterprise Research Institute, NationaI University of Ireland, Galway Joint work with Aidan Hogan, Andreas Harth, Stefan Decker

This talk is about… the Semantic Web… … in particular: • practical Web Reasoning & how/why we apply Datalog there Misquoting Jim Hendler: “A Little Datalog goes a long way”

The Web of Data Structured Knowledge on the Web… … in the order of Billions of statements … growing fast! … March 2008 March 2009

Search Engines for the Web of Data • Promise: • … query answering over RDF Web data • Typical assumptions for Search engines remain: • expected sub-second response times • obvious “garbage” should be filtered/ignored

Simplified “added value” proposition of Semantic Search… “explicit” data RDF “implicit” data? Via inference using OWL2, RDF Schema! Fig 1: RDF Web Dataset 5

Problem: Synonymous Omissions Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book. amazon:MSLamfoaf:made amazon:Compilers. dblp:M_S_Lamfoaf:made dblp:SystArrayOptCompilers. 6

Problem: Different “Ontologies” used Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book. amazon:Compilers dc:creatorex:MSLam . 7

Solution: Publish Complete Data? Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book. amazon:Compilers dc:creator amazon:MSLam . amazon:MSLam foaf:made amazon:Compilers . 8

Solution: Ask query in all possible ways? Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book. UNION ?Book dc:creator amazon:MSLam . amazon:Compilers dc:creator ex:MSLam . 9

Solution: Exploit OWL and RDFS… Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book. amazon:Compilers dc:creator amazon:MSLam . dc:creator owl:inverseOf foaf:made . dblp:M_S_Lam foaf:made dblp:SystArrayOptCompilers. amazon:MSLam owl:sameAs dblp:M_S_Lam . amazon:MSLam foaf:made amazon:Compilers . amazon:MSLam foaf:made dblp:SystArrayOptCompilers. 10

Inference over OWL and RDFS… • Two of the “mainstream” directions: • DL fragments of OWL/RDFS: OWL Lite, OWL DL, OWL2DL, etc. • reduce Web Data to DL facts (A-Box) and terminological axioms (T-Box) • Use DL reasoner to answer queries • Datalog-reducible fragments of OWL: RDFS, DLP, pD*, OWL2RL, • Encode semantics of OWL/RDFS into Datalog rules • Both assertional and terminological knowledge remains just facts. • Apply fwd- or bwd-chaining inference amazon:Compilers dc:creator amazon:MSLam . amazon:MSLam foaf:made amazon:SystArrayOptCompilers. amazon:MSLam owl:sameAs dblp:M_S_Lam . foaf:made rdfs:domain foaf:Person dc:creator owl:inverseOf foaf:made .

Inference over OWL and RDFS… • Two of the “mainstream” directions: • Datalog-reducible fragments of OWL: RDFS, OWL-, DLP, pD*, OWL2RL, amazon:Compilers dc:creator ex:MSLam . amazon:MSLam foaf:made amazon:SystArrayOptCompilers . amazon:MSLam owl:sameAs ex:MSLam . foaf:made rdfs:domain foaf:Person . dc:creator owl:inverseOf foaf:made . • ?s rdf:type ?c . :- ?p1 rdfs:domain ?c . ?s ?p1 ?o . • ?o ?p2 ?s . :- ?p1 owl:inverseOf ?p2 .?s ?p1 ?o . • ?p2 owl:inverseOf ?p1 .:- ?p1 owl:inverseOf ?p2 .

Web Reasoning: Rule Based Approach Why we focus on the Datalog approach? • Massive A-Box/fact base • Popular Web ontologies (T-Box) is fairly small/inexpressive… • FWD-chaining shall allow storing/indexing implicit answers for quick retrieval… Hope: Datalog/Rules scale well for instance retrieval OWL2RL enough for most Web ontologies … but how feasible is that? 13

Web Reasoning: Observations & Challenges Scalability: • Massive A-Box: Tens of billions of statements (for the moment) • Near linear scale required Noisy data: • Inconsistencies galore • NoisyData • “Ontology hijacking” 14

(Accidental) Inconsistencies… FOAF Ontology: foaf:Person disjointWith foaf:Organisation. foaf:homepage rdf:type owl:inverseFunctionalProperty. W3.org: W3C foaf:homepage <http://www.w3.org> W3C rdf:type foaf:Organisation. Source1 (faulty): TimBernersLee foaf:homepage <http://www.w3.org> TimBernersLee rdf:type foaf:Person. • ?s1owl:differentFrom?s2.:- ?s1 rdf:type ?c1 . ?s2 rdf:type ?c2 . • ?c1 owl:disjointWith ?c2 . • ?s1owl:sameAs ?s2.:- ?s1 ?p ?o . ?s2 ?p ?o. • ?p rdf:type owl:inverseFunctionalProperty. • ERROR :- ?x owl:sameAS ?y . ?x owl:differentFrom ?y.

Noisy Data foaf:mbox_sha1sum a owl:InverseFunctionalProperty . ?xfoaf:mbox_sha1sum 08445a31a78661b5c746feff39a9db6e4e2cc5cf . ?s1owl:sameAs ?s2.:- ?s1 ?p ?o . ?s2 ?p ?o. • ?p rdf:type owl:inverseFunctionalProperty. 105?s1/?s2bindings in body • 1010 inferred pair-wise and reflexive owl:sameAsstatements 16

“Ontology Hijacking” More Noise: From http://www.eiao.net/rdf/1.0 <owl:Property rdf:about="http://www.w3.org/1999/02/22-rdf-syntax-ns#type"> <rdfs:label xml:lang="en">type</rdfs:label> <rdfs:comment xml:lang="en">Type of resource</rdfs:comment> <rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#testRun"/> <rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#pageSurvey"/> <rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#siteSurvey"/> <rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#scenario"/> <rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#rangeLocation"/> <rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#startPointer"/> <rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#endPointer"/> <rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#header"/> <rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#runs"/> </owl:Property> Ontology hijacking: A non-authoritative source trying to redefine existing properties & classes. rdf:type rdfs:domain eiao:testRun . rdf:type rdfs:domain eiao:pageSurvey . rdf:type rdfs:domain eiao:siteSurvey . rdf:type rdfs:domain eiao:Scenario . rdf:type rdfs:domain eiao:rangeLocation . rdf:type rdfs:domain eiao:startPointer . rdf:type rdfs:domain eiao:endPointer . rdf:type rdfs:domain eiao:header . rdf:type rdfs:domain eiao:runs . 17

“Ontology Hijacking” OWL 2 RL domain: • ?s rdf:type ?c . :- ?p1 rdfs:domain ?c . ?s ?p1 ?o . Adds 92 x |N| triples, where N is the set of “normal” rdf:type triples in the data! rdf:type rdfs:domain eiao:testRun . rdf:type rdfs:domain eiao:pageSurvey . rdf:type rdfs:domain eiao:siteSurvey . rdf:type rdfs:domain eiao:Scenario . rdf:type rdfs:domain eiao:rangeLocation . rdf:type rdfs:domain eiao:startPointer . rdf:type rdfs:domain eiao:endPointer . rdf:type rdfs:domain eiao:header . rdf:type rdfs:domain eiao:runs . 18

SAOR: Scalable Authoritative OWL Reasoner No systems available that can deal with that … Goals: Scalability • Separate TBox data – in memory • Reduced Output • Incomplete reasoning! Web tolerance • Consider authority of TBox • Incomplete reasoning! 19

Scalable Reasoning: In-mem T-Box • Main optimisation: Store T-Box in memory • By far, the most commonly accessed segment of data for reasoning • Quite small (1-2%) • e.g. from a 100M statement Web crawl • ABOX:3,753,791 X ?s foaf:name ?o . vs. • TBOX:<20 X foaf:name ?p ?o .+?s ?p foaf:name . 20

Scalable Reasoning: Scans • Scan 1: Scan input data, separate T-Box statements, load T-Box statements into memory • Scan 2: Scan all on-disk data, join with in-memory T-Box. • With in-mem T-Box, avoid A-Box joins for many *not all* rules • A-Box joins too expensive on large volumes of data

Scalable Reasoning: No A Box Joins • Execution of three rules: OWL 2 RL ruleprp-spo1 • ?x ?p2 ?y. :- ?p1 rdfs:subPropertyOf ?p2 . ?x ?p1 ?y. • OWL 2 RL rulecax-sco ?x rdf:type ?c2 . :- ?c1 rdfs:subClassOf ?c2. ?x rdf:type ?c1. OWL 2 RL ruleprp-spo1 • ?y rdf:type ?c . :- ?p rdfs:range ?c . ?x ?p ?y . ON-DISK A-BOX ... ... ex:me foaf:homepage ex:home . ... IN-MEM T-BOX ON-DISK OUTPUT ... ... ex:me foaf:page ex:home . ex:me foaf:isPrimaryTopicOf ex:home . ex:home rdf:type foaf:Document . ex:home rdf:type wordnet:Document . ... 22

Scalable Reasoning: Joins • We focus on these rules that don’t need A-Box joins: • [48 rules/76 OWL2RL rules] • Covers e.g. all of RDF Schema! • This fragment can easily be distributed! • However: some rules do require A-Box joins, e.g. ?x owl:sameAs ?:- ?x owl:sameAs ?y . ?y owl:sameAs ?z . Handle with BW-chaining (Storing pivot element lists.) ?x1 owl:sameAs ?x2. :- ?p a owl:InverseFunctionalProperty. ?x1 ?p ?o. ?x2 ?p ?o. . Currently ignored, see examples above, we currently work on statistical approach for ifp. • No A-Box joins for SAOR reasoning over >1B statements as deployed in SWSE, we ran experiments for a smaller dataset on full OWL2RL • using in-memory transitivity indexes, semi-naïve evaluation transitive properties (not that many) 23

Web Tolerance: AuthoritativeReasoning • We check authority (on the T-Box statements only) to make inferences! • Document Dauthoritative for class/property X iff: • X not identified by URI, OR • De-referenced URI of X coincides with or redirects to D • Borrowing from the idea of DL to separate T-Boxand A-Box we enable authority checking by so called split-rules : • Split-rule: Antecedent divided in T-Box and A-Box statements. • Split-rule Application: At least one of the A-Box/T-Box join variables needs to be spoken about authoritatively, for the rule to fire. • Example: ?s rdf:type ?d . :- ?c rdfs:subClassOf ?d . ?s rdf:type ?c . 24

Web Tolerance: AuthoritativeReasoning • Example: • FOAF ontology authoritative for foaf:Person✓ • MY spec not authoritative for foaf:Person✘ • Only allow extension in authoritative documents • my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓ • BUT: Reduce obscure memberships • foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘ • ALSO: Protect specifications • foaf:mbox rdf:type owl:SymmetricProperty . (MY spec) ✘ • Similarly for other rules. • In-memory T-Box only stores statements that are authoritative for rule execution. ?s rdf:type ?d . :- ?c rdfs:subClassOf ?d .?s rdf:type ?c . 25

Runtime… • no A-Box joins + • authoritative split rule application • Linear scale for most rules • single machine:1.1bn in => 1.1+1.9bn out, <10 hours • Can be paralellized! [Weaver,Hendler 2009],[Urbani et al. 2009] =>113 minutes with A-Box joins… … only scale up to ~100M statements so far

We would, if we could… • Use ranking of statements [Harth et al. ISWC2009] to rank inferences. • Ongoing work with Piero Bonatti: • Rank inferences (by aggregation) s p o : f(v1,… vn) :- t1:v1 … tn:vn Base on Annotated programs [Kifer & Subrahmanian, JLP, 1992] • Main Difficulty: • many possible inferences for the same statement, aggregation prevents cheap file-scans we currently rely on.

Summary (So, why should you care?) • We need to care about scale • Throw away what we don’t need – our choices are motivated empirically: • T-Box separation + filescans • Split rules notion + Authoritativeness keep “noise explosion” low • … but applicable in similar domains? • Admittedly: rather a restriction of Datalog1.0 to scale with rules of certain shape • But also: More Datalog on the Semantic Web horizon! • W3C RIF: Web standard for rule exchange … RIF safe Core = safe Datalog with built-ins • W3C SPARQL 1.0 translatable to Datalogstrat,not, [Polleres 2007, Angles and Gutierrez 2008, Ianni et al. 2009] SPARQL 1.1 additional features well-investigated in Datalog! • Annotations/Rank potentially boost accuracy of query results, other annotation domains: time, provenance, etc.

Le Fin… Techniques used in Running search engines…

Ok, here it is… 2RL Core

Evaluation: Authoritative Reasoning

Scalable Reasoning: Joins • However: some rules do require A-Box joins • We employ on-disk hashtables. ON-DISK A-BOX ... ... ex:me foaf:homepage ex:home . ... ... ex:moi foaf:homepage ex:home . ... ... IN-MEM T-BOX ON-DISK HASHTABLE ON-DISK OUTPUT ... ... ex:me owl:sameAs ex:moi . ... ... 33

Scalable Reasoning: Equality • Use canonical ‘pivot’ identifiersDuring Scan 2: • Maintain on-disk hashtable with equality chains • Re-write G2 hashtable keys to reflect new equivalences • Scan 3: Scan input and inferred data, re-write according to owl:sameAs closure. ex:me owl:sameAs ex2:me . ⇒ex:home owl:sameAs ex2:home .

Rules Overview G0: 1 rule: only T-Box in antecedent(No A-Box) G1: 17 rules: at least one T-Box statement, only one A-Box statement in antecedent(No A-Box joins) G2: 7 rules: at least one T-Box statement, multiple A-Box statements in antecedent(A-Box joins) G3: 4 rules: only A-Box in antecedent(No T-Box) ANTECEDENT ⇒ CONSEQUENT ?P owl:inverseOf ?Q .?s ?P ?o . ⇒ ?o ?Q ?s . ≥1 TBOX 1 ABOX⇒ ABOX ?P a :TransitiveProperty . ?x ?P ?y . ?y ?P ?z . ⇒ ?x ?p ?z . >1 TBOX >1 ABOX⇒ ABOX ?x :sameAs ?y . ?x ?P ?o . ⇒ ?y ?p ?o . 0 TBOX >1 ABOX ⇒ ABOX

Evaluation: Scalable Reasoning G0,G1,G2,G3 151M OUT ~16 HR G0,G1,G2,G3 On-disk hashtables begin to struggle G0,G1 142M OUT <1 HR

Using Datalog for Rule-Based Reasoning over Web Data: Challenges and Next Steps