Linked data integration using reasoning
This presentation is the property of its rightful owner.
Sponsored Links
1 / 86

Linked Data Integration (using reasoning) PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

Linked Data Integration (using reasoning). Aidan Hogan. Day 3 Session 2. What is reasoning?. Reasoning: Conceptual Overview. (Loosely) Deriving novel conclusions from existing knowledge Deductive reasoning : inferring new facts from existing rules and facts

Download Presentation

Linked Data Integration (using reasoning)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Linked Data Integration (using reasoning)

Aidan Hogan

Day 3

Session 2


What is reasoning?


Reasoning: Conceptual Overview

(Loosely) Deriving novel conclusions from existing knowledge

Deductive reasoning: inferring new facts from existing rules and facts

Given rule: All Kia cars are made in Korea;

Given premise (fact): Fred’s car is a Kia

Entails fact:Fred’s car was made in Korea

Inductive reasoning: learning new rules from existing facts and entailments (typically what us humans do: build imprecise rules from models)

Given model of existing facts: All Kia cars I’ve seen have four wheels

~Entails rule: All Kias have four wheels

Given fact:Fred’s car is a Kia

Entails (probabilistic fact): Fred’s car likely has four wheels

Abductive reasoning: guess a premise from a conclusion (similar in principle to a form of inductive reasoning)

Given entailment: Fred’s car is Korean and has four wheels?

Rule: Kias and Hyundais are Korean and typically have four wheels

Guessed premise:Fred’s car is a Kia or a Hyundai?


Reasoning: Clearing up terms

  • Semantics: formally defined meanings of terms

    • KiaCar ⊑ KoreanCar ⊓ FourWheels

  • Entailments: the conclusions which follow from formal semantics

  • Inference: a procedure to compute entailments

    • (or the entailments that can be computed therefrom)


Reasoning: Example Conceptual Tasks

  • Conjunctive Query Answering:generate new answers for questions

    • Include Fred as an answer to “give me friends of mine who own Korean cars”

  • Subsumption checking: identify subclass relationships

    • Is the class KiaCar a subset of KoreanCar if all Kias are manufactured in Seoul?

  • Class-Satisfiability checking: identify if a class can have a membership

    • Can there be something which is a KoreanCar and a EuropeanCar?

  • Consistency checking: identify formally conflicting information

    • Fred tells me his Kia is European; is this correct?

  • Instance checking: identify if an individual is a member of a given class

    • Is Freds Kia an Asian Car?


RDFS/OWL reasoning?


Reasoning: RDFS and OWL (deductive)

  • Formal semantics of RDFS and OWL can be leveraged for reasoning.

  • :KiaCar rdfs:subClassOf :KoreanCar ,

  • [ owl:hasValue :Seoul ; owl:onProperty :manufacturedIn ]

  • :FredsCar a :KiaCar .

  • Implies

  • :FredsCar a KoreanCar ; :manufacturedIn :Seoul .


Reasoning: OWL (2)

  • Eight sub-languages of OWL!

  • Why eight?(^^)

  • Direct Semantics (based on Description Logics):

    • OWL DL (NExpTime for non-QA tasks), OWL Lite (ExpTime for non-QA)

    • OWL 2 EL, OWL 2 QL, OWL 2 RL (All PTime for most tasks for non-QA)

    • OWL 2 RL (2NExpTime for non-QA)

    • Emphasis on soundness and completeness

    • Tableaux-based algorithms

      • Based on KB-satisfiability checking

    • Syntactic restrictions to preserve complexity

      • e.g., no datatype inverse-functional properties

  • RDF-Based Semantics (layered directly on top of RDFS)

    • OWL Full, OWL 2 Full

    • All tasks are undecidable!

      • No complete, correct inference procedure can exist for the reasoning tasks

      • Incomplete reasoning possible through rules…

Opinion: OWL/OWL 2 useful stuff, but an extremely complex standard!!!


RDFS and OWL 2 RL: Entailment rules

  • RDFS entailment rules provide sound, complete RDFS reasoning

  • OWL 2 RL/RDF provide partial support for OWL 2 RDF-based semantics

  • Monotonic rules which are guarded

  • Positive subset of datalog with a fixed ternary predicate

  • Rules have cubic complexity (with trivial exceptions aside)

    • Due to the arity of triples (3)


Rules

IF⇒THEN

Body/Antecedent/Condition

Head/Consequent

?c1 rdfs:subClassOf ?c2 .

?x rdf:type ?c1 .

⇒?x rdf:type ?c2 .

  • foaf:Person rdfs:subClassOf foaf:Agent .

  • timbl:me rdf:type foaf:Person .

  • ⇒timbl:me rdf:type foaf:Agent .

Schema/Terminology/

Ontological

Instance/Assertional


Rules (Inconsistencies [a.k.a. Contradictions])

IF⇒THEN

Body/Antecedent/Condition

Head/Consequent

?c1 owl:disjointWith ?c2 .

?x rdf:type ?c1 .

?x rdf:type ?c2 .

⇒false

  • foaf:Person owl:disjointWith foaf:Organization .

  • w3c rdf:type foaf:Organization .

  • w3c rdf:type foaf:Person .

  • ⇒false


Linked Data reasoning?

…integration use-case!


Linked Data Reasoning

explicit data

implicit data

How can consumers query the implicit data


…so what’s The Problem?…

…heterogeneity

…need to integrate data from different sources


Take Query Answering…

foaf:page

Gimmewebpages

relating to

Tim Berners-Lee

timbl:i

timbl:ifoaf:page?pages .


Hetereogenity inschema…

webpage: properties

= rdfs:subPropertyOf

mo:musicBrainz

= owl:inverseOf

doap:homepage

mo:myspace

foaf:homepage

foaf:weblog

foaf:primaryTopic

foaf:isPrimaryTopicOf

foaf:page

foaf:topic


Linked Data, RDFS and OWL: Linked Vocabularies

SKOS

Image from http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg:; Giasson, Bergman


Hetereogenity in naming…

Tim Berners-Lee: URIs

dblp:100007

timbl:i

db:Tim-Berners_Lee

identica:45563

= owl:sameAs

fb:en.tim_berners-lee

adv:timbl


Returning to our simple query…

mo:myspace

foaf:primaryTopic

foaf:page

foaf:topic

SKOS

doap:homepage

foaf:homepage

Gimmewebpages

relating to

Tim Berners-Lee

foaf:isPrimaryTopicOf

identica:45563

adv:timbl

db:Tim-Berners_Lee

dblp:100007

fb:en.tim_berners-lee

timbl:i

timbl:ifoaf:page?pages .

...7 x 6 = 42 possible patterns


…reasoning to the rescue?


Challenges…

…what (OWL) reasoning is feasible for Linked Data?


Linked Data Reasoning: Challenges

Scalable

Expressive

Domain-Agnostic

Robust


Linked Data Reasoning: Challenges

  • Scalability

    • At least tens of billions of statements (for the moment)

      • Near linear scale!!!

  • Noisy data

    • Inconsistencies galore

    • Publishing errors


What about noise? …

…need to consider the provenance of Web data


Noisy Data: Omnipotent Being

  • Web data is noisy.

  • Proof:

  • 08445a31a78661b5c746feff39a9db6e4e2cc5cf

  • sha1-sum of ‘mailto:’

  • common value for foaf:mbox_sha1sum

    • An inverse-functional (uniquely identifying) property!!!

    • Any person who shares the same value will be considered the same

  • Q.E.D.


Noisy Data: Redefining everything

  • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0)

  • rdf:type rdf:type owl:Property .

  • rdf:type rdfs:label [email protected] .

  • rdf:type rdfs:comment “Type of resource” .

  • rdf:type rdfs:domain eiao:testRun .

  • rdf:type rdfs:domain eiao:pageSurvey .

  • rdf:type rdfs:domain eiao:siteSurvey .

  • rdf:type rdfs:domain eiao:scenario .

  • rdf:type rdfs:domain eiao:rangeLocation .

  • rdf:type rdfs:domain eiao:startPointer .

  • rdf:type rdfs:domain eiao:endPointer .

  • rdf:type rdfs:domain eiao:header .

  • rdf:type rdfs:domain eiao:runs .


Noisy Data: Inconsistency

w3c rdf:type foaf:Organization .

w3c rdf:type foaf:Person .

foaf:Person owl:disjointWith foaf:Organization .


Consider source of schema data

Class/property URIs dereference to their authoritative document

FOAF spec authoritative for foaf:Person✓

MY spec not authoritative for foaf:Person✘

Allow “extension” in third-party documents

my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓

BUT: Reduce obscure memberships

foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘

ALSO: Protect specifications

foaf:knows a owl:SymmetricProperty . (MY spec) ✘

AuthoritativeReasoning


Noisy Data: Redefining everything

  • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0)

  • rdf:type rdf:type owl:Property .

  • rdf:type rdfs:label [email protected] .

  • rdf:type rdfs:comment “Type of resource” .

  • rdf:type rdfs:domain eiao:testRun .

  • rdf:type rdfs:domain eiao:pageSurvey .

  • rdf:type rdfs:domain eiao:siteSurvey .

  • rdf:type rdfs:domain eiao:scenario .

  • rdf:type rdfs:domain eiao:rangeLocation .

  • rdf:type rdfs:domain eiao:startPointer .

  • rdf:type rdfs:domain eiao:endPointer .

  • rdf:type rdfs:domain eiao:header .

  • rdf:type rdfs:domain eiao:runs .

Not Authoritative


Authoritative Reasoning: read more …w/ essential plugs

Gong Cheng, Yuzhong Qu.

"Integrating Lightweight Reasoning into Class-Based Query Refinement for Object Search." ASWC 2008.

Aidan Hogan, Andreas Harth, Axel Polleres.

"Scalable Authoritative OWL Reasoning for the Web." IJSWIS 2009.

Aidan Hogan, Jeff Z. Pan, Axel Polleres and Stefan Decker.

"SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples." ISWC 2010.

My thesis: http://aidanhogan.com/docs/thesis/


Alternative to Authoritative Reasoning?

  • Quarantined reasoning!

  • Separate and cache hierarchy of schema documents/dependencies…


Quarantined Reasoning [Delbru et al.; 2008]


Quarantined Reasoning [Delbru et al.; 2008]


Quarantined Reasoning [Delbru et al.; 2008]


Quarantined Reasoning [Delbru et al.; 2008]

A-Box / Instance Data

(e.g, a FOAF file)

T-Box / Ontology Data

(e.g., the FOAF ontology and its indirect imports)


Noisy Data: Redefining everything

  • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0)

  • rdf:type rdf:type owl:Property .

  • rdf:type rdfs:label [email protected] .

  • rdf:type rdfs:comment “Type of resource” .

  • rdf:type rdfs:domain eiao:testRun .

  • rdf:type rdfs:domain eiao:pageSurvey .

  • rdf:type rdfs:domain eiao:siteSurvey .

  • rdf:type rdfs:domain eiao:scenario .

  • rdf:type rdfs:domain eiao:rangeLocation .

  • rdf:type rdfs:domain eiao:startPointer .

  • rdf:type rdfs:domain eiao:endPointer .

  • rdf:type rdfs:domain eiao:header .

  • rdf:type rdfs:domain eiao:runs .

Not In Here


Quarantined Reasoning: read more

R. Delbru, A. Polleres, G. Tummarello and S. Decker.

"Context Dependent Reasoning for Semantic Documents in Sindice. “ 4th International Workshop on Scalable Semantic Web Knowledge Base Systems, 2008.


Resolving Inconsistency?

  • Use links-analysis (PageRank) to rank documents and triples

  • Use annotated reasoning to rank inferences

  • Repair each consistency by removing the weakest triple

    • Read more:

    • Piero A. Bonatti, Aidan Hogan, Axel Polleres and Luigi Sauro. "Robust and Scalable Linked Data Reasoning Incorporating Provenance and Trust Annotations". In the Journal of Web Semantics (in press).


What about scale? …

…using positive (monotonic) rules.

Expressive reasoning (also) possible through tableaux, but yet to demonstrate desired scale


Materialisation

  • Forward-chaining Materialisation

    • Avoid runtime expense

      • Users taught impatience by Google

    • Pre-compute for quick retrieval

    • Web-scale systems should scale well

      • More data = more disk-space/machines

Don't materialise

too much!

One size does

not fit all!


  • OUTPUT:

  • Flat file of (partial) inferred triples (quads)

  • INPUT:

  • Flat file of triples (quads)


What rules?

  • Let’s look at a recent corpus of Linked Data and see what schema’s inside

  • (and what the rulesets support)

    • Open-domain crawl May 2010

    • 1.1 billion quadruples

    • 3.985 million sources (docs)

    • 780 pay-level domains (e.g., dbpedia.org)

    • Ran “special” PageRank over documents

    • 86 thousand docs contained some RDFS/OWL schema data (2.2% of docs... but <0.2% of triples)

    • Summated ranks of docs using each primitive


Survey of Linked Data schema: Top 15 ranks

#AxiomRank(Σ)RDFSHorstO2R

  • rdfs:subClassOf 0.295 ✓✓✓

  • rdfs:range0.294 ✓✓✓

  • rdfs:domain0.292 ✓✓✓

  • rdfs:subPropertyOf0.090 ✓✓✓

  • owl:FunctionalProperty0.063 ✘✓✓

  • owl:disjointWith0.049 ✘✘✓

  • owl:inverseOf0.047 ✘✓✓

  • owl:unionOf0.035 ✘✘✓

  • owl:SymmetricProperty0.033 ✘✓✓

  • owl:TransitiveProperty0.030 ✘✓✓

  • owl:equivalentClass0.021 ✘✓ ✓

  • owl:InverseFunctionalProperty0.030 ✘✓✓

  • owl:equivalentProperty0.030 ✘✓✓

  • owl:someValuesFrom0.030 ✘✓✓

  • owl:hasValue0.028 ✘✓✓


ScalableReasoning: In-mem T-Box

Main optimisation: Store T-Box in memory

T-Box: (loosely) data describing classes and properties.

Aka. schemata/vocabularies/ontologies/terminologies.

E.g.,

foaf:topic owl:inverseOf foaf:page .

sioc:UserAccount rdfs:subClassOf foaf:OnlineAccount .

Most commonly accessed datafor reasoning

Quite small (~0.1% for our Linked Data corpus)

High selectivity (if you prefer)

A-Box:Lots?s foaf:page ?o . vs.

T-Box:Fewfoaf:page ?p ?o .+?s ?p foaf:page .


Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory

Do T-Box level reasoning if required (semi-naïve)

Scan 2: Scan all on-disk data, join with in-memory T-Box.

ScalableReasoning: Two Scans


Scalable Reasoning: No A-Box Joins

ON-DISKA-BOX

  • Execution of three rules:

    OWL 2 RL ruleprp-inv1

    ?p1 owl:inverseOf ?p2 .

    ?x ?p1 ?y .

    ⇒ ?y ?p2 ?x .

    OWL 2 RL ruleprp-rng

    ?p rdfs:range ?c .

    ?x ?p ?y.

    ⇒ ?y a ?c .

    OWL 2 RL ruleprp-spo1

    ?p1 rdfs:subPropertyOf ?p2 .

    ?x ?p1 ?y.

    ⇒ ?x ?p2 ?y .

...

ex:me foaf:homepage ex:hp .

...

IN-MEMT-BOX

ON-DISK OUTPUT

...

ex:hp rdf:type foaf:Document .

ex:me foaf:page ex:hp .

ex:hp foaf:topic ex:me .

...


Scalable Reasoning: A-Box joins?

  • However: some rules do require A-Box joins

    • ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z .

      ⇒ ?x ?p ?z .

    • Difficult to engineer a scalable solution (which reaches a fixpoint) for Linked Data(?)

    • Can lead to quadratic inferences

  • A lot of useful reasoning still possible without A-Box joins…


Features not requiring A-Box joins

  • rdfs:subClassOf 0.295 ✓

  • rdfs:range0.294 ✓

  • rdfs:domain0.292 ✓

  • rdfs:subPropertyOf0.090 ✓

  • owl:FunctionalProperty0.063 ✘

  • owl:disjointWith0.049 ✘

  • owl:inverseOf0.047 ✓

  • owl:unionOf0.035 ✓

  • owl:SymmetricProperty0.033 ✓

  • owl:equivalentClass0.021 ✓

  • owl:InverseFunctionalProperty0.030✘

  • owl:equivalentProperty0.030 ✓

  • owl:someValuesFrom0.030 ✓/✘


Reasoning Performance (1 machine)


Scalable Distributed Reasoning

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

EXTRACTT-BOX

EXTRACT T-BOX

EXTRACTT-BOX

EXTRACTT-BOX

EXTRACTT-BOX

COLLECTT-BOX

COLLECTT-BOX

COLLECTT-BOX

COLLECTT-BOX

COLLECTT-BOX

SAMET-BOX

SAMET-BOX

SAMET-BOX

SAMET-BOX

SAMET-BOX

...

...

...

...

...

DIFF.A-BOX

DIFF.A-BOX

DIFF.A-BOX

DIFF.A-BOX

DIFF.A-BOX

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

LOCAL OUTPUT

LOCAL OUTPUT

LOCAL OUTPUT

LOCAL OUTPUT

LOCAL OUTPUT

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me rdf:type ex:Awesome .

...

...

ex:me ex:presented ex:ThisTalk

...

...

ex:me ex:presented ex:ThisTalk

...

...

ex:me ex:presented ex:ThisTal


Reasoning Performance: Distribution

9 machines: Total 3.35 hours


Distributed Reasoning: read more

Aidan Hogan, Jeff Z. Pan, Axel Polleres, Stefan Decker: SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples. International Semantic Web Conference (1) 2010: 337-353

Jesse Weaver, James A. Hendler: Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples. International Semantic Web Conference 2009: 682-697

Jacopo Urbani, Spyros Kotoulas, Eyal Oren, Frank van Harmelen: Scalable Distributed Reasoning Using MapReduce. International Semantic Web Conference 2009: 634-649

Jacopo Urbani, Spyros Kotoulas, Jason Maassen, Frank van Harmelen, Henri E. Bal: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples. ESWC (1) 2010: 213-227

A-Box Joins


…what about owl:sameAs?


Consolidation for Linked Data

54


Consolidation: Baseline

timbl:i

identica:45563

dbpedia:Berners-Lee

  • Use provided owl:sameAs mappings in the data

    timbl:i owl:sameas identica:45563 .

    dbpedia:Berners-Lee owl:sameas identica:45563 .

  • Store “equivalences” found

    timbl:i->

    identica:45563->

    dbpedia:Berners-Lee->


Consolidation: Baseline

timbl:i

identica:45563

dbpedia:Berners-Lee

  • For each set of equivalent identifiers, choose a canonical term


Canonicalisation

timbl:i rdf:type foaf:Person .

identica:48404 foaf:knows identica:45563 .

dbpedia:Berners-Leedpo:birthDate “1955-06-08”^^xsd:date .

dbpedia:Berners-Lee rdf:type foaf:Person .

identica:48404 foaf:knows dbpedia:Berners-Lee .

dbpedia:Berners-Leedpo:birthDate “1955-06-08”^^xsd:date .

timbl:i

identica:45563

dbpedia:Berners-Lee

Afterwards, rewrite identifiers to their canonical version:


ExtendedConsolidation

  • Infer owl:sameAs through reasoning (OWL 2 RL/RDF)

    • explicit owl:sameAs (again)

    • owl:InverseFunctionalProperty

    • owl:FunctionalProperty

    • owl:cardinality 1 / owl:maxCardinality 1

      foaf:homepage a owl:InverseFunctionalProperty .

      timbl:i foaf:homepage w3c:timblhomepage .

      adv:timbl foaf:homepage w3c:timblhomepage .

      timbl:i owl:sameas adv:timbl .

      …then apply consolidation as before


Consolidation: Results

For our Linked Data corpus:

  • ~12 million explicit owl:sameAs triples (as before)

  • ~8.7 million thru. owl:InverseFunctionalProperty

  • ~106 thousand thru. owl:FunctionalProperty

  • none thru. owl:cardinality/owl:maxCardinality

    In terms of equivalences found (baseline vs. extended):

  • ~2.8 million sets of equivalent identifiers

    • (1.31x baseline)

  • ~14.86 million identifiers involved

    • (2.58x baseline)

  • ~5.8 million URIs

    • !!(1.014x baseline)!!


Linked Data Reasoning Wrap-Up

Heterogeneity poses a significant problem for consuming Linked Data

  • Heterogenity in schema

  • Heterogenity in naming

    …but we can use the mappings provided by publishers to integrate heterogeneous Linked Data corpora (with a little caution)

  • Lightweight rule-based reasoning can go a long way

  • Deceit/Noise ≠ End Of World

    • Consider source of data!

  • Inconsistency ≠ End Of World

    • Useful for finding noise in fact!

  • Explicit owl:sameAs vs. extended consolidation:

    • Extended consolidation mostly (but not entirely) for consolidating blank-nodes from older FOAF exporters


Indexing RDF

Aidan Hogan

Day 3

Session 2


…how do we index RDF for queries?


RDF Index Designs (1/4): Horizontal table-per-class

Class: Car

  • Pros:

    • Fast for certain queries, esp. “star shaped” queries

    • Little redundancy in cells

  • Cons:

    • Becomes very sparse for larger schema

      • Lots of nulls needed

    • Special handling needed for multi-valued attributes

Class: Person


RDF Index Designs (2/4): Vertical triple table

  • Pros:

    • No more nulls needed

    • Flexible for updates (even to schema)

    • Multi-valued attributes no problem

  • Cons:

    • Lot’s of self-joins

    • Lot’s of redundancy in the cells


RDF Index Designs (3/4): Vertical table per prop.

Property: model

Property:ownsCar

Class: car

Property: type

  • Pros:

    • Less redundancy

  • Cons:

    • Potentially many tables


RDF Index Designs (4): Hybrid

Property: seeAlso

Class: Car

  • Pros:

    • ~Depends

  • Cons:

    • Likely to be more costly to manage

Class: Person

Property: img


…high-level approaches?


Native vs. RDB-style storage

  • RDB-based indexes

    • Store data in a relational database

    • Typically B+Trees or similar RDB technology

    • Sometimes horizontal (RDB-like) schema

    • Mostly vertical (RDF-like) tables

    • 4store, AllegroGraph, Bigdata, BigOWLIM, Hexastore, Jena SDB, Mulgara, Redland, Virtuoso, etc.

  • Native RDF stores

    • Custom storage solutions

    • HPRD, Jena TDB, RDF3X, SIREn, Voldemort, YARS2

    • YARS2: Sparse indexes

    • SIREn: IR-style indexes over Lucene

  • Distinction not always clear-cut!


YARS2: Example Native Storage

  • Combination of in-memory and on-disk storage

  • Read optimised

  • Bulk-load (just sort)


…indexing patterns?


Triple stores vs. Quad stores

  • Triple stores

    • Only service simple RDF triple patterns

    • RDF-3X, SIREn, 3store, etc.

      • ?s rdf:type foaf:Person .

      • aidan ?p galway .

      • ?s ?p ?o .

  • Quad stores

    • Also service patterns involving named graphs

    • Typical for indexing data from multiple sources

    • Needed for SPARQL querying!!

      • GRAPH ?g {?s rdf:type foaf:Person}

      • GRAPH foaf.rdf {aidan ?p galway }

      • FROM graph1.rdf … WHERE { ?s ?p ?o . }

    • Virtuoso, BigOWLIM, Jena TDB/SDB, YARS2, 4store, Hexastore, etc.


Building a full Quad index

  • (subject, predicate, object, graph)

    • graph sometimes called context

  • 2^4 = 16 patterns to service!


Six prefix-indexes for quads

  • Requires six different indexes to service all 16 quad patterns

    • assuming prefix lookups


…common index optimisations?


Object IDs

Data Table

Dictionary

  • Pros:

    • Can load more data in memory

    • Faster to compute joins

    • Smaller on-disk footprint

  • Cons:

    • Maintain a potentially massive dictionary

    • Slower to externalise streaming results


Join re-ordering/selection (in brief)

x4,000

(2)

x40

(1)

  • Equi-joins are commutative

    • What ordering to execute them in?

  • Choice of various techniques

    • Nested-loop join

    • Hash join

    • Index join

  • Use selectivity estimates…

    • Other techniques known from databases!

      ?person foaf:based_near dbpedia:Korea .

      aidan foaf:knows ?person .


Speed-Up: Replication

  • Pros:

    • Speed-up response times

    • Better fault-tolerance

  • Cons:

    • Expensive!

    • Updates?

animation: four animantions: first, too much data for one machine, add more machines, possible to store all data

second, too much


Scale-Up: Distribution

  • Pros:

    • Handle more data

    • Commodity hardware ~cheap

  • Cons:

    • Joins expensive to compute

    • More complex architecture and maintenance

animation: four animantions: first, too much data for one machine, add more machines, possible to store all data

second, too much


Hash-based Distribution

kmi:tom ?p ?o ?c

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf

compute hash mod 4

  • Pros:

    • Can route query directly to the machine

  • Cons:

    • Load-balancing, esp. for predicates and values of rdf:type

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf


Random Distribution

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf

random distribution

  • Pros:

    • No load-balancing issues

  • Cons:

    • At query-time, don’t know which machine to ask…


Query Flooding

?s foaf:interest ?p ?o

Q

random distribution

-

-

-

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf


…existing engines?


More besides!!!

RDB-based indexes

  • 4store,

  • AllegroGraph,

  • Bigdata,

  • BigOWLIM,

  • Hexastore,

  • Jena SDB,

  • Mulgara,

  • Redland,

  • Virtuoso, etc.

    Native RDF stores

  • HPRD,

  • Jena TDB,

  • RDF3X,

  • SIREn,

  • Voldemort,

  • YARS2, etc.


Berlin SPARQL Benchmark (v3)

  • Benchmark of common SPARQL engines

  • Set of assorted SPARQL queries and fixed data

  • Results for query-mixes per hour:

Christian Bizer, Andreas Schultz: The Berlin SPARQL Benchmark. 

Int. J. Semantic Web Inf. Syst. 5(2): 1-24 (2009)


Indexing wrap-up

  • Lot’s of work in the area!

    • Native stores vs. RDB-style stores

    • Triple stores vs. Quad stores

  • Optimisations

    • OIDs

    • Replication

    • Distribution

    • Join Selection/Reordering, etc.

  • No definitive solution…


Closing Indexing Quote

“In previous papers, some of us predicted the end of ‘one size fits all’ as a commercial relational DBMS paradigm.

“These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1-2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets.”

[Stonebraker et al.; 2007]


  • Login