Linked data integration using reasoning
This presentation is the property of its rightful owner.
Sponsored Links
1 / 86

Linked Data Integration (using reasoning) PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on
  • Presentation posted in: General

Linked Data Integration (using reasoning). Aidan Hogan. Day 3 Session 2. What is reasoning?. Reasoning: Conceptual Overview. (Loosely) Deriving novel conclusions from existing knowledge Deductive reasoning : inferring new facts from existing rules and facts

Download Presentation

Linked Data Integration (using reasoning)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Linked data integration using reasoning

Linked Data Integration (using reasoning)

Aidan Hogan

Day 3

Session 2


Linked data integration using reasoning

What is reasoning?


Reasoning conceptual overview

Reasoning: Conceptual Overview

(Loosely) Deriving novel conclusions from existing knowledge

Deductive reasoning: inferring new facts from existing rules and facts

Given rule: All Kia cars are made in Korea;

Given premise (fact): Fred’s car is a Kia

Entails fact:Fred’s car was made in Korea

Inductive reasoning: learning new rules from existing facts and entailments (typically what us humans do: build imprecise rules from models)

Given model of existing facts: All Kia cars I’ve seen have four wheels

~Entails rule: All Kias have four wheels

Given fact:Fred’s car is a Kia

Entails (probabilistic fact): Fred’s car likely has four wheels

Abductive reasoning: guess a premise from a conclusion (similar in principle to a form of inductive reasoning)

Given entailment: Fred’s car is Korean and has four wheels?

Rule: Kias and Hyundais are Korean and typically have four wheels

Guessed premise:Fred’s car is a Kia or a Hyundai?


Reasoning clearing up terms

Reasoning: Clearing up terms

  • Semantics: formally defined meanings of terms

    • KiaCar ⊑ KoreanCar ⊓ FourWheels

  • Entailments: the conclusions which follow from formal semantics

  • Inference: a procedure to compute entailments

    • (or the entailments that can be computed therefrom)


Reasoning example conceptual tasks

Reasoning: Example Conceptual Tasks

  • Conjunctive Query Answering:generate new answers for questions

    • Include Fred as an answer to “give me friends of mine who own Korean cars”

  • Subsumption checking: identify subclass relationships

    • Is the class KiaCar a subset of KoreanCar if all Kias are manufactured in Seoul?

  • Class-Satisfiability checking: identify if a class can have a membership

    • Can there be something which is a KoreanCar and a EuropeanCar?

  • Consistency checking: identify formally conflicting information

    • Fred tells me his Kia is European; is this correct?

  • Instance checking: identify if an individual is a member of a given class

    • Is Freds Kia an Asian Car?


Linked data integration using reasoning

RDFS/OWL reasoning?


Reasoning rdfs and owl deductive

Reasoning: RDFS and OWL (deductive)

  • Formal semantics of RDFS and OWL can be leveraged for reasoning.

  • :KiaCar rdfs:subClassOf :KoreanCar ,

  • [ owl:hasValue :Seoul ; owl:onProperty :manufacturedIn ]

  • :FredsCar a :KiaCar .

  • Implies

  • :FredsCar a KoreanCar ; :manufacturedIn :Seoul .


Reasoning owl 2

Reasoning: OWL (2)

  • Eight sub-languages of OWL!

  • Why eight?(^^)

  • Direct Semantics (based on Description Logics):

    • OWL DL (NExpTime for non-QA tasks), OWL Lite (ExpTime for non-QA)

    • OWL 2 EL, OWL 2 QL, OWL 2 RL (All PTime for most tasks for non-QA)

    • OWL 2 RL (2NExpTime for non-QA)

    • Emphasis on soundness and completeness

    • Tableaux-based algorithms

      • Based on KB-satisfiability checking

    • Syntactic restrictions to preserve complexity

      • e.g., no datatype inverse-functional properties

  • RDF-Based Semantics (layered directly on top of RDFS)

    • OWL Full, OWL 2 Full

    • All tasks are undecidable!

      • No complete, correct inference procedure can exist for the reasoning tasks

      • Incomplete reasoning possible through rules…

Opinion: OWL/OWL 2 useful stuff, but an extremely complex standard!!!


Rdfs and owl 2 rl entailment rules

RDFS and OWL 2 RL: Entailment rules

  • RDFS entailment rules provide sound, complete RDFS reasoning

  • OWL 2 RL/RDF provide partial support for OWL 2 RDF-based semantics

  • Monotonic rules which are guarded

  • Positive subset of datalog with a fixed ternary predicate

  • Rules have cubic complexity (with trivial exceptions aside)

    • Due to the arity of triples (3)


Linked data integration using reasoning

Rules

IF⇒THEN

Body/Antecedent/Condition

Head/Consequent

?c1 rdfs:subClassOf ?c2 .

?x rdf:type ?c1 .

⇒?x rdf:type ?c2 .

  • foaf:Person rdfs:subClassOf foaf:Agent .

  • timbl:me rdf:type foaf:Person .

  • ⇒timbl:me rdf:type foaf:Agent .

Schema/Terminology/

Ontological

Instance/Assertional


Linked data integration using reasoning

Rules (Inconsistencies [a.k.a. Contradictions])

IF⇒THEN

Body/Antecedent/Condition

Head/Consequent

?c1 owl:disjointWith ?c2 .

?x rdf:type ?c1 .

?x rdf:type ?c2 .

⇒false

  • foaf:Person owl:disjointWith foaf:Organization .

  • w3c rdf:type foaf:Organization .

  • w3c rdf:type foaf:Person .

  • ⇒false


Linked data integration using reasoning

Linked Data reasoning?

…integration use-case!


Linked data reasoning

Linked Data Reasoning

explicit data

implicit data

How can consumers query the implicit data


Linked data integration using reasoning

…so what’s The Problem?…

…heterogeneity

…need to integrate data from different sources


Take query answering

Take Query Answering…

foaf:page

Gimmewebpages

relating to

Tim Berners-Lee

timbl:i

timbl:ifoaf:page?pages .


Hetereogenity in schema

Hetereogenity inschema…

webpage: properties

= rdfs:subPropertyOf

mo:musicBrainz

= owl:inverseOf

doap:homepage

mo:myspace

foaf:homepage

foaf:weblog

foaf:primaryTopic

foaf:isPrimaryTopicOf

foaf:page

foaf:topic


Linked data rdfs and owl linked vocabularies

Linked Data, RDFS and OWL: Linked Vocabularies

SKOS

Image from http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg:; Giasson, Bergman


Linked data integration using reasoning

Hetereogenity in naming…

Tim Berners-Lee: URIs

dblp:100007

timbl:i

db:Tim-Berners_Lee

identica:45563

= owl:sameAs

fb:en.tim_berners-lee

adv:timbl


Returning to our simple query

Returning to our simple query…

mo:myspace

foaf:primaryTopic

foaf:page

foaf:topic

SKOS

doap:homepage

foaf:homepage

Gimmewebpages

relating to

Tim Berners-Lee

foaf:isPrimaryTopicOf

identica:45563

adv:timbl

db:Tim-Berners_Lee

dblp:100007

fb:en.tim_berners-lee

timbl:i

timbl:ifoaf:page?pages .

...7 x 6 = 42 possible patterns


Linked data integration using reasoning

…reasoning to the rescue?


Linked data integration using reasoning

Challenges…

…what (OWL) reasoning is feasible for Linked Data?


Linked data reasoning challenges

Linked Data Reasoning: Challenges

Scalable

Expressive

Domain-Agnostic

Robust


Linked data integration using reasoning

Linked Data Reasoning: Challenges

  • Scalability

    • At least tens of billions of statements (for the moment)

      • Near linear scale!!!

  • Noisy data

    • Inconsistencies galore

    • Publishing errors


Linked data integration using reasoning

What about noise? …

…need to consider the provenance of Web data


Linked data integration using reasoning

Noisy Data: Omnipotent Being

  • Web data is noisy.

  • Proof:

  • 08445a31a78661b5c746feff39a9db6e4e2cc5cf

  • sha1-sum of ‘mailto:’

  • common value for foaf:mbox_sha1sum

    • An inverse-functional (uniquely identifying) property!!!

    • Any person who shares the same value will be considered the same

  • Q.E.D.


Linked data integration using reasoning

Noisy Data: Redefining everything

  • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0)

  • rdf:type rdf:type owl:Property .

  • rdf:type rdfs:label [email protected] .

  • rdf:type rdfs:comment “Type of resource” .

  • rdf:type rdfs:domain eiao:testRun .

  • rdf:type rdfs:domain eiao:pageSurvey .

  • rdf:type rdfs:domain eiao:siteSurvey .

  • rdf:type rdfs:domain eiao:scenario .

  • rdf:type rdfs:domain eiao:rangeLocation .

  • rdf:type rdfs:domain eiao:startPointer .

  • rdf:type rdfs:domain eiao:endPointer .

  • rdf:type rdfs:domain eiao:header .

  • rdf:type rdfs:domain eiao:runs .


Linked data integration using reasoning

Noisy Data: Inconsistency

w3c rdf:type foaf:Organization .

w3c rdf:type foaf:Person .

foaf:Person owl:disjointWith foaf:Organization .


Linked data integration using reasoning

Consider source of schema data

Class/property URIs dereference to their authoritative document

FOAF spec authoritative for foaf:Person✓

MY spec not authoritative for foaf:Person✘

Allow “extension” in third-party documents

my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓

BUT: Reduce obscure memberships

foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘

ALSO: Protect specifications

foaf:knows a owl:SymmetricProperty . (MY spec) ✘

AuthoritativeReasoning


Linked data integration using reasoning

Noisy Data: Redefining everything

  • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0)

  • rdf:type rdf:type owl:Property .

  • rdf:type rdfs:label [email protected] .

  • rdf:type rdfs:comment “Type of resource” .

  • rdf:type rdfs:domain eiao:testRun .

  • rdf:type rdfs:domain eiao:pageSurvey .

  • rdf:type rdfs:domain eiao:siteSurvey .

  • rdf:type rdfs:domain eiao:scenario .

  • rdf:type rdfs:domain eiao:rangeLocation .

  • rdf:type rdfs:domain eiao:startPointer .

  • rdf:type rdfs:domain eiao:endPointer .

  • rdf:type rdfs:domain eiao:header .

  • rdf:type rdfs:domain eiao:runs .

Not Authoritative


Linked data integration using reasoning

Authoritative Reasoning: read more …w/ essential plugs

Gong Cheng, Yuzhong Qu.

"Integrating Lightweight Reasoning into Class-Based Query Refinement for Object Search." ASWC 2008.

Aidan Hogan, Andreas Harth, Axel Polleres.

"Scalable Authoritative OWL Reasoning for the Web." IJSWIS 2009.

Aidan Hogan, Jeff Z. Pan, Axel Polleres and Stefan Decker.

"SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples." ISWC 2010.

My thesis: http://aidanhogan.com/docs/thesis/


Linked data integration using reasoning

Alternative to Authoritative Reasoning?

  • Quarantined reasoning!

  • Separate and cache hierarchy of schema documents/dependencies…


Linked data integration using reasoning

Quarantined Reasoning [Delbru et al.; 2008]


Linked data integration using reasoning

Quarantined Reasoning [Delbru et al.; 2008]


Linked data integration using reasoning

Quarantined Reasoning [Delbru et al.; 2008]


Linked data integration using reasoning

Quarantined Reasoning [Delbru et al.; 2008]

A-Box / Instance Data

(e.g, a FOAF file)

T-Box / Ontology Data

(e.g., the FOAF ontology and its indirect imports)


Linked data integration using reasoning

Noisy Data: Redefining everything

  • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0)

  • rdf:type rdf:type owl:Property .

  • rdf:type rdfs:label [email protected] .

  • rdf:type rdfs:comment “Type of resource” .

  • rdf:type rdfs:domain eiao:testRun .

  • rdf:type rdfs:domain eiao:pageSurvey .

  • rdf:type rdfs:domain eiao:siteSurvey .

  • rdf:type rdfs:domain eiao:scenario .

  • rdf:type rdfs:domain eiao:rangeLocation .

  • rdf:type rdfs:domain eiao:startPointer .

  • rdf:type rdfs:domain eiao:endPointer .

  • rdf:type rdfs:domain eiao:header .

  • rdf:type rdfs:domain eiao:runs .

Not In Here


Linked data integration using reasoning

Quarantined Reasoning: read more

R. Delbru, A. Polleres, G. Tummarello and S. Decker.

"Context Dependent Reasoning for Semantic Documents in Sindice. “ 4th International Workshop on Scalable Semantic Web Knowledge Base Systems, 2008.


Linked data integration using reasoning

Resolving Inconsistency?

  • Use links-analysis (PageRank) to rank documents and triples

  • Use annotated reasoning to rank inferences

  • Repair each consistency by removing the weakest triple

    • Read more:

    • Piero A. Bonatti, Aidan Hogan, Axel Polleres and Luigi Sauro. "Robust and Scalable Linked Data Reasoning Incorporating Provenance and Trust Annotations". In the Journal of Web Semantics (in press).


Linked data integration using reasoning

What about scale? …

…using positive (monotonic) rules.

Expressive reasoning (also) possible through tableaux, but yet to demonstrate desired scale


Linked data integration using reasoning

Materialisation

  • Forward-chaining Materialisation

    • Avoid runtime expense

      • Users taught impatience by Google

    • Pre-compute for quick retrieval

    • Web-scale systems should scale well

      • More data = more disk-space/machines

Don't materialise

too much!

One size does

not fit all!


Linked data integration using reasoning

  • OUTPUT:

  • Flat file of (partial) inferred triples (quads)

  • INPUT:

  • Flat file of triples (quads)


Linked data integration using reasoning

What rules?

  • Let’s look at a recent corpus of Linked Data and see what schema’s inside

  • (and what the rulesets support)

    • Open-domain crawl May 2010

    • 1.1 billion quadruples

    • 3.985 million sources (docs)

    • 780 pay-level domains (e.g., dbpedia.org)

    • Ran “special” PageRank over documents

    • 86 thousand docs contained some RDFS/OWL schema data (2.2% of docs... but <0.2% of triples)

    • Summated ranks of docs using each primitive


Survey of linked data schema top 15 ranks

Survey of Linked Data schema: Top 15 ranks

#AxiomRank(Σ)RDFSHorstO2R

  • rdfs:subClassOf 0.295 ✓✓✓

  • rdfs:range0.294 ✓✓✓

  • rdfs:domain0.292 ✓✓✓

  • rdfs:subPropertyOf0.090 ✓✓✓

  • owl:FunctionalProperty0.063 ✘✓✓

  • owl:disjointWith0.049 ✘✘✓

  • owl:inverseOf0.047 ✘✓✓

  • owl:unionOf0.035 ✘✘✓

  • owl:SymmetricProperty0.033 ✘✓✓

  • owl:TransitiveProperty0.030 ✘✓✓

  • owl:equivalentClass0.021 ✘✓ ✓

  • owl:InverseFunctionalProperty0.030 ✘✓✓

  • owl:equivalentProperty0.030 ✘✓✓

  • owl:someValuesFrom0.030 ✘✓✓

  • owl:hasValue0.028 ✘✓✓


Scalable reasoning in mem t box

ScalableReasoning: In-mem T-Box

Main optimisation: Store T-Box in memory

T-Box: (loosely) data describing classes and properties.

Aka. schemata/vocabularies/ontologies/terminologies.

E.g.,

foaf:topic owl:inverseOf foaf:page .

sioc:UserAccount rdfs:subClassOf foaf:OnlineAccount .

Most commonly accessed datafor reasoning

Quite small (~0.1% for our Linked Data corpus)

High selectivity (if you prefer)

A-Box:Lots?s foaf:page ?o . vs.

T-Box:Fewfoaf:page ?p ?o .+?s ?p foaf:page .


Linked data integration using reasoning

Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory

Do T-Box level reasoning if required (semi-naïve)

Scan 2: Scan all on-disk data, join with in-memory T-Box.

ScalableReasoning: Two Scans


Linked data integration using reasoning

Scalable Reasoning: No A-Box Joins

ON-DISKA-BOX

  • Execution of three rules:

    OWL 2 RL ruleprp-inv1

    ?p1 owl:inverseOf ?p2 .

    ?x ?p1 ?y .

    ⇒ ?y ?p2 ?x .

    OWL 2 RL ruleprp-rng

    ?p rdfs:range ?c .

    ?x ?p ?y.

    ⇒ ?y a ?c .

    OWL 2 RL ruleprp-spo1

    ?p1 rdfs:subPropertyOf ?p2 .

    ?x ?p1 ?y.

    ⇒ ?x ?p2 ?y .

...

ex:me foaf:homepage ex:hp .

...

IN-MEMT-BOX

ON-DISK OUTPUT

...

ex:hp rdf:type foaf:Document .

ex:me foaf:page ex:hp .

ex:hp foaf:topic ex:me .

...


Linked data integration using reasoning

Scalable Reasoning: A-Box joins?

  • However: some rules do require A-Box joins

    • ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z .

      ⇒ ?x ?p ?z .

    • Difficult to engineer a scalable solution (which reaches a fixpoint) for Linked Data(?)

    • Can lead to quadratic inferences

  • A lot of useful reasoning still possible without A-Box joins…


Features not requiring a box joins

Features not requiring A-Box joins

  • rdfs:subClassOf 0.295 ✓

  • rdfs:range0.294 ✓

  • rdfs:domain0.292 ✓

  • rdfs:subPropertyOf0.090 ✓

  • owl:FunctionalProperty0.063 ✘

  • owl:disjointWith0.049 ✘

  • owl:inverseOf0.047 ✓

  • owl:unionOf0.035 ✓

  • owl:SymmetricProperty0.033 ✓

  • owl:equivalentClass0.021 ✓

  • owl:InverseFunctionalProperty0.030✘

  • owl:equivalentProperty0.030 ✓

  • owl:someValuesFrom0.030 ✓/✘


Linked data integration using reasoning

Reasoning Performance (1 machine)


Scalable distributed reasoning

Scalable Distributed Reasoning

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

EXTRACTT-BOX

EXTRACT T-BOX

EXTRACTT-BOX

EXTRACTT-BOX

EXTRACTT-BOX

COLLECTT-BOX

COLLECTT-BOX

COLLECTT-BOX

COLLECTT-BOX

COLLECTT-BOX

SAMET-BOX

SAMET-BOX

SAMET-BOX

SAMET-BOX

SAMET-BOX

...

...

...

...

...

DIFF.A-BOX

DIFF.A-BOX

DIFF.A-BOX

DIFF.A-BOX

DIFF.A-BOX

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

LOCAL OUTPUT

LOCAL OUTPUT

LOCAL OUTPUT

LOCAL OUTPUT

LOCAL OUTPUT

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me rdf:type ex:Awesome .

...

...

ex:me ex:presented ex:ThisTalk

...

...

ex:me ex:presented ex:ThisTalk

...

...

ex:me ex:presented ex:ThisTal


Reasoning performance distribution

Reasoning Performance: Distribution

9 machines: Total 3.35 hours


Linked data integration using reasoning

Distributed Reasoning: read more

Aidan Hogan, Jeff Z. Pan, Axel Polleres, Stefan Decker: SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples. International Semantic Web Conference (1) 2010: 337-353

Jesse Weaver, James A. Hendler: Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples. International Semantic Web Conference 2009: 682-697

Jacopo Urbani, Spyros Kotoulas, Eyal Oren, Frank van Harmelen: Scalable Distributed Reasoning Using MapReduce. International Semantic Web Conference 2009: 634-649

Jacopo Urbani, Spyros Kotoulas, Jason Maassen, Frank van Harmelen, Henri E. Bal: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples. ESWC (1) 2010: 213-227

A-Box Joins


Linked data integration using reasoning

…what about owl:sameAs?


Linked data integration using reasoning

Consolidation for Linked Data

54


Linked data integration using reasoning

Consolidation: Baseline

timbl:i

identica:45563

dbpedia:Berners-Lee

  • Use provided owl:sameAs mappings in the data

    timbl:i owl:sameas identica:45563 .

    dbpedia:Berners-Lee owl:sameas identica:45563 .

  • Store “equivalences” found

    timbl:i->

    identica:45563->

    dbpedia:Berners-Lee->


Linked data integration using reasoning

Consolidation: Baseline

timbl:i

identica:45563

dbpedia:Berners-Lee

  • For each set of equivalent identifiers, choose a canonical term


Linked data integration using reasoning

Canonicalisation

timbl:i rdf:type foaf:Person .

identica:48404 foaf:knows identica:45563 .

dbpedia:Berners-Leedpo:birthDate “1955-06-08”^^xsd:date .

dbpedia:Berners-Lee rdf:type foaf:Person .

identica:48404 foaf:knows dbpedia:Berners-Lee .

dbpedia:Berners-Leedpo:birthDate “1955-06-08”^^xsd:date .

timbl:i

identica:45563

dbpedia:Berners-Lee

Afterwards, rewrite identifiers to their canonical version:


Linked data integration using reasoning

ExtendedConsolidation

  • Infer owl:sameAs through reasoning (OWL 2 RL/RDF)

    • explicit owl:sameAs (again)

    • owl:InverseFunctionalProperty

    • owl:FunctionalProperty

    • owl:cardinality 1 / owl:maxCardinality 1

      foaf:homepage a owl:InverseFunctionalProperty .

      timbl:i foaf:homepage w3c:timblhomepage .

      adv:timbl foaf:homepage w3c:timblhomepage .

      timbl:i owl:sameas adv:timbl .

      …then apply consolidation as before


Linked data integration using reasoning

Consolidation: Results

For our Linked Data corpus:

  • ~12 million explicit owl:sameAs triples (as before)

  • ~8.7 million thru. owl:InverseFunctionalProperty

  • ~106 thousand thru. owl:FunctionalProperty

  • none thru. owl:cardinality/owl:maxCardinality

    In terms of equivalences found (baseline vs. extended):

  • ~2.8 million sets of equivalent identifiers

    • (1.31x baseline)

  • ~14.86 million identifiers involved

    • (2.58x baseline)

  • ~5.8 million URIs

    • !!(1.014x baseline)!!


Linked data integration using reasoning

Linked Data Reasoning Wrap-Up

Heterogeneity poses a significant problem for consuming Linked Data

  • Heterogenity in schema

  • Heterogenity in naming

    …but we can use the mappings provided by publishers to integrate heterogeneous Linked Data corpora (with a little caution)

  • Lightweight rule-based reasoning can go a long way

  • Deceit/Noise ≠ End Of World

    • Consider source of data!

  • Inconsistency ≠ End Of World

    • Useful for finding noise in fact!

  • Explicit owl:sameAs vs. extended consolidation:

    • Extended consolidation mostly (but not entirely) for consolidating blank-nodes from older FOAF exporters


Indexing rdf

Indexing RDF

Aidan Hogan

Day 3

Session 2


Linked data integration using reasoning

…how do we index RDF for queries?


Linked data integration using reasoning

RDF Index Designs (1/4): Horizontal table-per-class

Class: Car

  • Pros:

    • Fast for certain queries, esp. “star shaped” queries

    • Little redundancy in cells

  • Cons:

    • Becomes very sparse for larger schema

      • Lots of nulls needed

    • Special handling needed for multi-valued attributes

Class: Person


Linked data integration using reasoning

RDF Index Designs (2/4): Vertical triple table

  • Pros:

    • No more nulls needed

    • Flexible for updates (even to schema)

    • Multi-valued attributes no problem

  • Cons:

    • Lot’s of self-joins

    • Lot’s of redundancy in the cells


Linked data integration using reasoning

RDF Index Designs (3/4): Vertical table per prop.

Property: model

Property:ownsCar

Class: car

Property: type

  • Pros:

    • Less redundancy

  • Cons:

    • Potentially many tables


Linked data integration using reasoning

RDF Index Designs (4): Hybrid

Property: seeAlso

Class: Car

  • Pros:

    • ~Depends

  • Cons:

    • Likely to be more costly to manage

Class: Person

Property: img


Linked data integration using reasoning

…high-level approaches?


Linked data integration using reasoning

Native vs. RDB-style storage

  • RDB-based indexes

    • Store data in a relational database

    • Typically B+Trees or similar RDB technology

    • Sometimes horizontal (RDB-like) schema

    • Mostly vertical (RDF-like) tables

    • 4store, AllegroGraph, Bigdata, BigOWLIM, Hexastore, Jena SDB, Mulgara, Redland, Virtuoso, etc.

  • Native RDF stores

    • Custom storage solutions

    • HPRD, Jena TDB, RDF3X, SIREn, Voldemort, YARS2

    • YARS2: Sparse indexes

    • SIREn: IR-style indexes over Lucene

  • Distinction not always clear-cut!


Yars2 example native storage

YARS2: Example Native Storage

  • Combination of in-memory and on-disk storage

  • Read optimised

  • Bulk-load (just sort)


Linked data integration using reasoning

…indexing patterns?


Linked data integration using reasoning

Triple stores vs. Quad stores

  • Triple stores

    • Only service simple RDF triple patterns

    • RDF-3X, SIREn, 3store, etc.

      • ?s rdf:type foaf:Person .

      • aidan ?p galway .

      • ?s ?p ?o .

  • Quad stores

    • Also service patterns involving named graphs

    • Typical for indexing data from multiple sources

    • Needed for SPARQL querying!!

      • GRAPH ?g {?s rdf:type foaf:Person}

      • GRAPH foaf.rdf {aidan ?p galway }

      • FROM graph1.rdf … WHERE { ?s ?p ?o . }

    • Virtuoso, BigOWLIM, Jena TDB/SDB, YARS2, 4store, Hexastore, etc.


Linked data integration using reasoning

Building a full Quad index

  • (subject, predicate, object, graph)

    • graph sometimes called context

  • 2^4 = 16 patterns to service!


Linked data integration using reasoning

Six prefix-indexes for quads

  • Requires six different indexes to service all 16 quad patterns

    • assuming prefix lookups


Linked data integration using reasoning

…common index optimisations?


Linked data integration using reasoning

Object IDs

Data Table

Dictionary

  • Pros:

    • Can load more data in memory

    • Faster to compute joins

    • Smaller on-disk footprint

  • Cons:

    • Maintain a potentially massive dictionary

    • Slower to externalise streaming results


Linked data integration using reasoning

Join re-ordering/selection (in brief)

x4,000

(2)

x40

(1)

  • Equi-joins are commutative

    • What ordering to execute them in?

  • Choice of various techniques

    • Nested-loop join

    • Hash join

    • Index join

  • Use selectivity estimates…

    • Other techniques known from databases!

      ?person foaf:based_near dbpedia:Korea .

      aidan foaf:knows ?person .


Speed up replication

Speed-Up: Replication

  • Pros:

    • Speed-up response times

    • Better fault-tolerance

  • Cons:

    • Expensive!

    • Updates?

animation: four animantions: first, too much data for one machine, add more machines, possible to store all data

second, too much


Scale up distribution

Scale-Up: Distribution

  • Pros:

    • Handle more data

    • Commodity hardware ~cheap

  • Cons:

    • Joins expensive to compute

    • More complex architecture and maintenance

animation: four animantions: first, too much data for one machine, add more machines, possible to store all data

second, too much


Hash based distribution

Hash-based Distribution

kmi:tom ?p ?o ?c

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf

compute hash mod 4

  • Pros:

    • Can route query directly to the machine

  • Cons:

    • Load-balancing, esp. for predicates and values of rdf:type

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf


Random distribution

Random Distribution

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf

random distribution

  • Pros:

    • No load-balancing issues

  • Cons:

    • At query-time, don’t know which machine to ask…


Query flooding

Query Flooding

?s foaf:interest ?p ?o

Q

random distribution

-

-

-

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf


Linked data integration using reasoning

…existing engines?


Linked data integration using reasoning

More besides!!!

RDB-based indexes

  • 4store,

  • AllegroGraph,

  • Bigdata,

  • BigOWLIM,

  • Hexastore,

  • Jena SDB,

  • Mulgara,

  • Redland,

  • Virtuoso, etc.

    Native RDF stores

  • HPRD,

  • Jena TDB,

  • RDF3X,

  • SIREn,

  • Voldemort,

  • YARS2, etc.


Linked data integration using reasoning

Berlin SPARQL Benchmark (v3)

  • Benchmark of common SPARQL engines

  • Set of assorted SPARQL queries and fixed data

  • Results for query-mixes per hour:

Christian Bizer, Andreas Schultz: The Berlin SPARQL Benchmark. 

Int. J. Semantic Web Inf. Syst. 5(2): 1-24 (2009)


Linked data integration using reasoning

Indexing wrap-up

  • Lot’s of work in the area!

    • Native stores vs. RDB-style stores

    • Triple stores vs. Quad stores

  • Optimisations

    • OIDs

    • Replication

    • Distribution

    • Join Selection/Reordering, etc.

  • No definitive solution…


Closing indexing quote

Closing Indexing Quote

“In previous papers, some of us predicted the end of ‘one size fits all’ as a commercial relational DBMS paradigm.

“These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1-2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets.”

[Stonebraker et al.; 2007]


  • Login