linked data integration using reasoning
Download
Skip this Video
Download Presentation
Linked Data Integration (using reasoning)

Loading in 2 Seconds...

play fullscreen
1 / 86

Linked Data Integration (using reasoning) - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

Linked Data Integration (using reasoning). Aidan Hogan. Day 3 Session 2. What is reasoning?. Reasoning: Conceptual Overview. (Loosely) Deriving novel conclusions from existing knowledge Deductive reasoning : inferring new facts from existing rules and facts

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Linked Data Integration (using reasoning)' - drago


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
reasoning conceptual overview
Reasoning: Conceptual Overview

(Loosely) Deriving novel conclusions from existing knowledge

Deductive reasoning: inferring new facts from existing rules and facts

Given rule: All Kia cars are made in Korea;

Given premise (fact): Fred’s car is a Kia

Entails fact:Fred’s car was made in Korea

Inductive reasoning: learning new rules from existing facts and entailments (typically what us humans do: build imprecise rules from models)

Given model of existing facts: All Kia cars I’ve seen have four wheels

~Entails rule: All Kias have four wheels

Given fact:Fred’s car is a Kia

Entails (probabilistic fact): Fred’s car likely has four wheels

Abductive reasoning: guess a premise from a conclusion (similar in principle to a form of inductive reasoning)

Given entailment: Fred’s car is Korean and has four wheels?

Rule: Kias and Hyundais are Korean and typically have four wheels

Guessed premise:Fred’s car is a Kia or a Hyundai?

reasoning clearing up terms
Reasoning: Clearing up terms
  • Semantics: formally defined meanings of terms
    • KiaCar ⊑ KoreanCar ⊓ FourWheels
  • Entailments: the conclusions which follow from formal semantics
  • Inference: a procedure to compute entailments
    • (or the entailments that can be computed therefrom)
reasoning example conceptual tasks
Reasoning: Example Conceptual Tasks
  • Conjunctive Query Answering:generate new answers for questions
    • Include Fred as an answer to “give me friends of mine who own Korean cars”
  • Subsumption checking: identify subclass relationships
    • Is the class KiaCar a subset of KoreanCar if all Kias are manufactured in Seoul?
  • Class-Satisfiability checking: identify if a class can have a membership
    • Can there be something which is a KoreanCar and a EuropeanCar?
  • Consistency checking: identify formally conflicting information
    • Fred tells me his Kia is European; is this correct?
  • Instance checking: identify if an individual is a member of a given class
    • Is Freds Kia an Asian Car?
reasoning rdfs and owl deductive
Reasoning: RDFS and OWL (deductive)
  • Formal semantics of RDFS and OWL can be leveraged for reasoning.
  • :KiaCar rdfs:subClassOf :KoreanCar ,
  • [ owl:hasValue :Seoul ; owl:onProperty :manufacturedIn ]
  • :FredsCar a :KiaCar .
  • Implies
  • :FredsCar a KoreanCar ; :manufacturedIn :Seoul .
reasoning owl 2
Reasoning: OWL (2)
  • Eight sub-languages of OWL!
  • Why eight?(^^)
  • Direct Semantics (based on Description Logics):
    • OWL DL (NExpTime for non-QA tasks), OWL Lite (ExpTime for non-QA)
    • OWL 2 EL, OWL 2 QL, OWL 2 RL (All PTime for most tasks for non-QA)
    • OWL 2 RL (2NExpTime for non-QA)
    • Emphasis on soundness and completeness
    • Tableaux-based algorithms
      • Based on KB-satisfiability checking
    • Syntactic restrictions to preserve complexity
      • e.g., no datatype inverse-functional properties
  • RDF-Based Semantics (layered directly on top of RDFS)
    • OWL Full, OWL 2 Full
    • All tasks are undecidable!
      • No complete, correct inference procedure can exist for the reasoning tasks
      • Incomplete reasoning possible through rules…

Opinion: OWL/OWL 2 useful stuff, but an extremely complex standard!!!

rdfs and owl 2 rl entailment rules
RDFS and OWL 2 RL: Entailment rules
  • RDFS entailment rules provide sound, complete RDFS reasoning
  • OWL 2 RL/RDF provide partial support for OWL 2 RDF-based semantics
  • Monotonic rules which are guarded
  • Positive subset of datalog with a fixed ternary predicate
  • Rules have cubic complexity (with trivial exceptions aside)
    • Due to the arity of triples (3)
slide10
Rules

IF⇒THEN

Body/Antecedent/Condition

Head/Consequent

?c1 rdfs:subClassOf ?c2 .

?x rdf:type ?c1 .

⇒?x rdf:type ?c2 .

  • foaf:Person rdfs:subClassOf foaf:Agent .
  • timbl:me rdf:type foaf:Person .
  • ⇒timbl:me rdf:type foaf:Agent .

Schema/Terminology/

Ontological

Instance/Assertional

slide11
Rules (Inconsistencies [a.k.a. Contradictions])

IF⇒THEN

Body/Antecedent/Condition

Head/Consequent

?c1 owl:disjointWith ?c2 .

?x rdf:type ?c1 .

?x rdf:type ?c2 .

⇒false

  • foaf:Person owl:disjointWith foaf:Organization .
  • w3c rdf:type foaf:Organization .
  • w3c rdf:type foaf:Person .
  • ⇒false
slide12
Linked Data reasoning?

…integration use-case!

linked data reasoning
Linked Data Reasoning

explicit data

implicit data

How can consumers query the implicit data

slide14
…so what’s The Problem?…

…heterogeneity

…need to integrate data from different sources

take query answering
Take Query Answering…

foaf:page

Gimmewebpages

relating to

Tim Berners-Lee

timbl:i

timbl:ifoaf:page?pages .

hetereogenity in schema
Hetereogenity inschema…

webpage: properties

= rdfs:subPropertyOf

mo:musicBrainz

= owl:inverseOf

doap:homepage

mo:myspace

foaf:homepage

foaf:weblog

foaf:primaryTopic

foaf:isPrimaryTopicOf

foaf:page

foaf:topic

linked data rdfs and owl linked vocabularies
Linked Data, RDFS and OWL: Linked Vocabularies

SKOS

Image from http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg:; Giasson, Bergman

slide18
Hetereogenity in naming…

Tim Berners-Lee: URIs

dblp:100007

timbl:i

db:Tim-Berners_Lee

identica:45563

= owl:sameAs

fb:en.tim_berners-lee

adv:timbl

returning to our simple query
Returning to our simple query…

mo:myspace

foaf:primaryTopic

foaf:page

foaf:topic

SKOS

doap:homepage

foaf:homepage

Gimmewebpages

relating to

Tim Berners-Lee

foaf:isPrimaryTopicOf

identica:45563

adv:timbl

db:Tim-Berners_Lee

dblp:100007

fb:en.tim_berners-lee

timbl:i

timbl:ifoaf:page?pages .

...7 x 6 = 42 possible patterns

slide21
Challenges…

…what (OWL) reasoning is feasible for Linked Data?

linked data reasoning challenges
Linked Data Reasoning: Challenges

Scalable

Expressive

Domain-Agnostic

Robust

slide23
Linked Data Reasoning: Challenges
  • Scalability
    • At least tens of billions of statements (for the moment)
      • Near linear scale!!!
  • Noisy data
    • Inconsistencies galore
    • Publishing errors
slide24
What about noise? …

…need to consider the provenance of Web data

slide25
Noisy Data: Omnipotent Being
  • Web data is noisy.
  • Proof:
  • 08445a31a78661b5c746feff39a9db6e4e2cc5cf
  • sha1-sum of ‘mailto:’
  • common value for foaf:mbox_sha1sum
    • An inverse-functional (uniquely identifying) property!!!
    • Any person who shares the same value will be considered the same
  • Q.E.D.
slide26
Noisy Data: Redefining everything
  • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0)
  • rdf:type rdf:type owl:Property .
  • rdf:type rdfs:label “type”@en .
  • rdf:type rdfs:comment “Type of resource” .
  • rdf:type rdfs:domain eiao:testRun .
  • rdf:type rdfs:domain eiao:pageSurvey .
  • rdf:type rdfs:domain eiao:siteSurvey .
  • rdf:type rdfs:domain eiao:scenario .
  • rdf:type rdfs:domain eiao:rangeLocation .
  • rdf:type rdfs:domain eiao:startPointer .
  • rdf:type rdfs:domain eiao:endPointer .
  • rdf:type rdfs:domain eiao:header .
  • rdf:type rdfs:domain eiao:runs .
slide27
Noisy Data: Inconsistency

w3c rdf:type foaf:Organization .

w3c rdf:type foaf:Person .

foaf:Person owl:disjointWith foaf:Organization .

slide28
Consider source of schema data

Class/property URIs dereference to their authoritative document

FOAF spec authoritative for foaf:Person✓

MY spec not authoritative for foaf:Person✘

Allow “extension” in third-party documents

my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓

BUT: Reduce obscure memberships

foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘

ALSO: Protect specifications

foaf:knows a owl:SymmetricProperty . (MY spec) ✘

AuthoritativeReasoning

slide29
Noisy Data: Redefining everything
  • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0)
  • rdf:type rdf:type owl:Property .
  • rdf:type rdfs:label “type”@en .
  • rdf:type rdfs:comment “Type of resource” .
  • rdf:type rdfs:domain eiao:testRun .
  • rdf:type rdfs:domain eiao:pageSurvey .
  • rdf:type rdfs:domain eiao:siteSurvey .
  • rdf:type rdfs:domain eiao:scenario .
  • rdf:type rdfs:domain eiao:rangeLocation .
  • rdf:type rdfs:domain eiao:startPointer .
  • rdf:type rdfs:domain eiao:endPointer .
  • rdf:type rdfs:domain eiao:header .
  • rdf:type rdfs:domain eiao:runs .

Not Authoritative

slide30
Authoritative Reasoning: read more …w/ essential plugs

Gong Cheng, Yuzhong Qu.

"Integrating Lightweight Reasoning into Class-Based Query Refinement for Object Search." ASWC 2008.

Aidan Hogan, Andreas Harth, Axel Polleres.

"Scalable Authoritative OWL Reasoning for the Web." IJSWIS 2009.

Aidan Hogan, Jeff Z. Pan, Axel Polleres and Stefan Decker.

"SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples." ISWC 2010.

My thesis: http://aidanhogan.com/docs/thesis/

slide31
Alternative to Authoritative Reasoning?
  • Quarantined reasoning!
  • Separate and cache hierarchy of schema documents/dependencies…
slide35
Quarantined Reasoning [Delbru et al.; 2008]

A-Box / Instance Data

(e.g, a FOAF file)

T-Box / Ontology Data

(e.g., the FOAF ontology and its indirect imports)

slide36
Noisy Data: Redefining everything
  • More proof (courtesy ofhttp://www.eiao.net/rdf/1.0)
  • rdf:type rdf:type owl:Property .
  • rdf:type rdfs:label “type”@en .
  • rdf:type rdfs:comment “Type of resource” .
  • rdf:type rdfs:domain eiao:testRun .
  • rdf:type rdfs:domain eiao:pageSurvey .
  • rdf:type rdfs:domain eiao:siteSurvey .
  • rdf:type rdfs:domain eiao:scenario .
  • rdf:type rdfs:domain eiao:rangeLocation .
  • rdf:type rdfs:domain eiao:startPointer .
  • rdf:type rdfs:domain eiao:endPointer .
  • rdf:type rdfs:domain eiao:header .
  • rdf:type rdfs:domain eiao:runs .

Not In Here

slide37
Quarantined Reasoning: read more

R. Delbru, A. Polleres, G. Tummarello and S. Decker.

"Context Dependent Reasoning for Semantic Documents in Sindice. “ 4th International Workshop on Scalable Semantic Web Knowledge Base Systems, 2008.

slide38
Resolving Inconsistency?
  • Use links-analysis (PageRank) to rank documents and triples
  • Use annotated reasoning to rank inferences
  • Repair each consistency by removing the weakest triple
    • Read more:
    • Piero A. Bonatti, Aidan Hogan, Axel Polleres and Luigi Sauro. "Robust and Scalable Linked Data Reasoning Incorporating Provenance and Trust Annotations". In the Journal of Web Semantics (in press).
slide39
What about scale? …

…using positive (monotonic) rules.

Expressive reasoning (also) possible through tableaux, but yet to demonstrate desired scale

slide40
Materialisation
  • Forward-chaining Materialisation
    • Avoid runtime expense
      • Users taught impatience by Google
    • Pre-compute for quick retrieval
    • Web-scale systems should scale well
      • More data = more disk-space/machines

Don't materialise

too much!

One size does

not fit all!

slide41
OUTPUT:
  • Flat file of (partial) inferred triples (quads)
  • INPUT:
  • Flat file of triples (quads)
slide42
What rules?
  • Let’s look at a recent corpus of Linked Data and see what schema’s inside
  • (and what the rulesets support)
    • Open-domain crawl May 2010
    • 1.1 billion quadruples
    • 3.985 million sources (docs)
    • 780 pay-level domains (e.g., dbpedia.org)
    • Ran “special” PageRank over documents
    • 86 thousand docs contained some RDFS/OWL schema data (2.2% of docs... but <0.2% of triples)
    • Summated ranks of docs using each primitive
survey of linked data schema top 15 ranks
Survey of Linked Data schema: Top 15 ranks

# Axiom Rank(Σ) RDFS Horst O2R

  • rdfs:subClassOf 0.295 ✓✓✓
  • rdfs:range 0.294 ✓✓✓
  • rdfs:domain 0.292 ✓✓✓
  • rdfs:subPropertyOf 0.090 ✓✓✓
  • owl:FunctionalProperty 0.063 ✘✓✓
  • owl:disjointWith 0.049 ✘✘✓
  • owl:inverseOf 0.047 ✘✓✓
  • owl:unionOf 0.035 ✘✘✓
  • owl:SymmetricProperty 0.033 ✘✓✓
  • owl:TransitiveProperty 0.030 ✘✓✓
  • owl:equivalentClass 0.021 ✘✓ ✓
  • owl:InverseFunctionalProperty 0.030 ✘✓✓
  • owl:equivalentProperty 0.030 ✘✓✓
  • owl:someValuesFrom 0.030 ✘✓✓
  • owl:hasValue 0.028 ✘✓✓
scalable reasoning in mem t box
ScalableReasoning: In-mem T-Box

Main optimisation: Store T-Box in memory

T-Box: (loosely) data describing classes and properties.

Aka. schemata/vocabularies/ontologies/terminologies.

E.g.,

foaf:topic owl:inverseOf foaf:page .

sioc:UserAccount rdfs:subClassOf foaf:OnlineAccount .

Most commonly accessed datafor reasoning

Quite small (~0.1% for our Linked Data corpus)

High selectivity (if you prefer)

A-Box:Lots?s foaf:page ?o . vs.

T-Box:Fewfoaf:page ?p ?o .+?s ?p foaf:page .

slide45
Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory

Do T-Box level reasoning if required (semi-naïve)

Scan 2: Scan all on-disk data, join with in-memory T-Box.

ScalableReasoning: Two Scans

slide46
Scalable Reasoning: No A-Box Joins

ON-DISKA-BOX

  • Execution of three rules:

OWL 2 RL ruleprp-inv1

?p1 owl:inverseOf ?p2 .

?x ?p1 ?y .

⇒ ?y ?p2 ?x .

OWL 2 RL ruleprp-rng

?p rdfs:range ?c .

?x ?p ?y.

⇒ ?y a ?c .

OWL 2 RL ruleprp-spo1

?p1 rdfs:subPropertyOf ?p2 .

?x ?p1 ?y.

⇒ ?x ?p2 ?y .

...

ex:me foaf:homepage ex:hp .

...

IN-MEMT-BOX

ON-DISK OUTPUT

...

ex:hp rdf:type foaf:Document .

ex:me foaf:page ex:hp .

ex:hp foaf:topic ex:me .

...

slide47
Scalable Reasoning: A-Box joins?
  • However: some rules do require A-Box joins
    • ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z .

⇒ ?x ?p ?z .

    • Difficult to engineer a scalable solution (which reaches a fixpoint) for Linked Data(?)
    • Can lead to quadratic inferences
  • A lot of useful reasoning still possible without A-Box joins…
features not requiring a box joins
Features not requiring A-Box joins
  • rdfs:subClassOf 0.295 ✓
  • rdfs:range 0.294 ✓
  • rdfs:domain 0.292 ✓
  • rdfs:subPropertyOf 0.090 ✓
  • owl:FunctionalProperty 0.063 ✘
  • owl:disjointWith 0.049 ✘
  • owl:inverseOf 0.047 ✓
  • owl:unionOf 0.035 ✓
  • owl:SymmetricProperty 0.033 ✓
  • owl:equivalentClass 0.021 ✓
  • owl:InverseFunctionalProperty 0.030 ✘
  • owl:equivalentProperty 0.030 ✓
  • owl:someValuesFrom 0.030 ✓/✘
scalable distributed reasoning
Scalable Distributed Reasoning

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

EXTRACTT-BOX

EXTRACT T-BOX

EXTRACTT-BOX

EXTRACTT-BOX

EXTRACTT-BOX

COLLECTT-BOX

COLLECTT-BOX

COLLECTT-BOX

COLLECTT-BOX

COLLECTT-BOX

SAMET-BOX

SAMET-BOX

SAMET-BOX

SAMET-BOX

SAMET-BOX

...

...

...

...

...

DIFF.A-BOX

DIFF.A-BOX

DIFF.A-BOX

DIFF.A-BOX

DIFF.A-BOX

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me ex:presented ex:ThisTalk

...

...

LOCAL OUTPUT

LOCAL OUTPUT

LOCAL OUTPUT

LOCAL OUTPUT

LOCAL OUTPUT

...

...

ex:me ex:presented ex:ThisTalk

...

...

...

ex:me rdf:type ex:Awesome .

...

...

ex:me ex:presented ex:ThisTalk

...

...

ex:me ex:presented ex:ThisTalk

...

...

ex:me ex:presented ex:ThisTal

reasoning performance distribution
Reasoning Performance: Distribution

9 machines: Total 3.35 hours

slide52
Distributed Reasoning: read more

Aidan Hogan, Jeff Z. Pan, Axel Polleres, Stefan Decker: SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples. International Semantic Web Conference (1) 2010: 337-353

Jesse Weaver, James A. Hendler: Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples. International Semantic Web Conference 2009: 682-697

Jacopo Urbani, Spyros Kotoulas, Eyal Oren, Frank van Harmelen: Scalable Distributed Reasoning Using MapReduce. International Semantic Web Conference 2009: 634-649

Jacopo Urbani, Spyros Kotoulas, Jason Maassen, Frank van Harmelen, Henri E. Bal: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples. ESWC (1) 2010: 213-227

A-Box Joins

slide55
Consolidation: Baseline

timbl:i

identica:45563

dbpedia:Berners-Lee

  • Use provided owl:sameAs mappings in the data

timbl:i owl:sameas identica:45563 .

dbpedia:Berners-Lee owl:sameas identica:45563 .

  • Store “equivalences” found

timbl:i ->

identica:45563 ->

dbpedia:Berners-Lee ->

slide56
Consolidation: Baseline

timbl:i

identica:45563

dbpedia:Berners-Lee

  • For each set of equivalent identifiers, choose a canonical term
slide57
Canonicalisation

timbl:i rdf:type foaf:Person .

identica:48404 foaf:knows identica:45563 .

dbpedia:Berners-Leedpo:birthDate “1955-06-08”^^xsd:date .

dbpedia:Berners-Lee rdf:type foaf:Person .

identica:48404 foaf:knows dbpedia:Berners-Lee .

dbpedia:Berners-Leedpo:birthDate “1955-06-08”^^xsd:date .

timbl:i

identica:45563

dbpedia:Berners-Lee

Afterwards, rewrite identifiers to their canonical version:

slide58
ExtendedConsolidation
  • Infer owl:sameAs through reasoning (OWL 2 RL/RDF)
    • explicit owl:sameAs (again)
    • owl:InverseFunctionalProperty
    • owl:FunctionalProperty
    • owl:cardinality 1 / owl:maxCardinality 1

foaf:homepage a owl:InverseFunctionalProperty .

timbl:i foaf:homepage w3c:timblhomepage .

adv:timbl foaf:homepage w3c:timblhomepage .

timbl:i owl:sameas adv:timbl .

…then apply consolidation as before

slide59
Consolidation: Results

For our Linked Data corpus:

  • ~12 million explicit owl:sameAs triples (as before)
  • ~8.7 million thru. owl:InverseFunctionalProperty
  • ~106 thousand thru. owl:FunctionalProperty
  • none thru. owl:cardinality/owl:maxCardinality

In terms of equivalences found (baseline vs. extended):

  • ~2.8 million sets of equivalent identifiers
    • (1.31x baseline)
  • ~14.86 million identifiers involved
    • (2.58x baseline)
  • ~5.8 million URIs
    • !!(1.014x baseline)!!
slide60
Linked Data Reasoning Wrap-Up

Heterogeneity poses a significant problem for consuming Linked Data

  • Heterogenity in schema
  • Heterogenity in naming

…but we can use the mappings provided by publishers to integrate heterogeneous Linked Data corpora (with a little caution)

  • Lightweight rule-based reasoning can go a long way
  • Deceit/Noise ≠ End Of World
    • Consider source of data!
  • Inconsistency ≠ End Of World
    • Useful for finding noise in fact!
  • Explicit owl:sameAs vs. extended consolidation:
    • Extended consolidation mostly (but not entirely) for consolidating blank-nodes from older FOAF exporters
indexing rdf
Indexing RDF

Aidan Hogan

Day 3

Session 2

slide63
RDF Index Designs (1/4): Horizontal table-per-class

Class: Car

  • Pros:
    • Fast for certain queries, esp. “star shaped” queries
    • Little redundancy in cells
  • Cons:
    • Becomes very sparse for larger schema
      • Lots of nulls needed
    • Special handling needed for multi-valued attributes

Class: Person

slide64
RDF Index Designs (2/4): Vertical triple table
  • Pros:
    • No more nulls needed
    • Flexible for updates (even to schema)
    • Multi-valued attributes no problem
  • Cons:
    • Lot’s of self-joins
    • Lot’s of redundancy in the cells
slide65
RDF Index Designs (3/4): Vertical table per prop.

Property: model

Property:ownsCar

Class: car

Property: type

  • Pros:
    • Less redundancy
  • Cons:
    • Potentially many tables
slide66
RDF Index Designs (4): Hybrid

Property: seeAlso

Class: Car

  • Pros:
    • ~Depends
  • Cons:
    • Likely to be more costly to manage

Class: Person

Property: img

slide68
Native vs. RDB-style storage
  • RDB-based indexes
    • Store data in a relational database
    • Typically B+Trees or similar RDB technology
    • Sometimes horizontal (RDB-like) schema
    • Mostly vertical (RDF-like) tables
    • 4store, AllegroGraph, Bigdata, BigOWLIM, Hexastore, Jena SDB, Mulgara, Redland, Virtuoso, etc.
  • Native RDF stores
    • Custom storage solutions
    • HPRD, Jena TDB, RDF3X, SIREn, Voldemort, YARS2
    • YARS2: Sparse indexes
    • SIREn: IR-style indexes over Lucene
  • Distinction not always clear-cut!
yars2 example native storage
YARS2: Example Native Storage
  • Combination of in-memory and on-disk storage
  • Read optimised
  • Bulk-load (just sort)
slide71
Triple stores vs. Quad stores
  • Triple stores
    • Only service simple RDF triple patterns
    • RDF-3X, SIREn, 3store, etc.
      • ?s rdf:type foaf:Person .
      • aidan ?p galway .
      • ?s ?p ?o .
  • Quad stores
    • Also service patterns involving named graphs
    • Typical for indexing data from multiple sources
    • Needed for SPARQL querying!!
      • GRAPH ?g {?s rdf:type foaf:Person}
      • GRAPH foaf.rdf {aidan ?p galway }
      • FROM graph1.rdf … WHERE { ?s ?p ?o . }
    • Virtuoso, BigOWLIM, Jena TDB/SDB, YARS2, 4store, Hexastore, etc.
slide72
Building a full Quad index
  • (subject, predicate, object, graph)
    • graph sometimes called context
  • 2^4 = 16 patterns to service!
slide73
Six prefix-indexes for quads
  • Requires six different indexes to service all 16 quad patterns
    • assuming prefix lookups
slide75
Object IDs

Data Table

Dictionary

  • Pros:
    • Can load more data in memory
    • Faster to compute joins
    • Smaller on-disk footprint
  • Cons:
    • Maintain a potentially massive dictionary
    • Slower to externalise streaming results
slide76
Join re-ordering/selection (in brief)

x4,000

(2)

x40

(1)

  • Equi-joins are commutative
    • What ordering to execute them in?
  • Choice of various techniques
    • Nested-loop join
    • Hash join
    • Index join
  • Use selectivity estimates…
    • Other techniques known from databases!

?person foaf:based_near dbpedia:Korea .

aidan foaf:knows ?person .

speed up replication
Speed-Up: Replication
  • Pros:
    • Speed-up response times
    • Better fault-tolerance
  • Cons:
    • Expensive!
    • Updates?

animation: four animantions: first, too much data for one machine, add more machines, possible to store all data

second, too much

scale up distribution
Scale-Up: Distribution
  • Pros:
    • Handle more data
    • Commodity hardware ~cheap
  • Cons:
    • Joins expensive to compute
    • More complex architecture and maintenance

animation: four animantions: first, too much data for one machine, add more machines, possible to store all data

second, too much

hash based distribution
Hash-based Distribution

kmi:tom ?p ?o ?c

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf

compute hash mod 4

  • Pros:
    • Can route query directly to the machine
  • Cons:
    • Load-balancing, esp. for predicates and values of rdf:type

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf

random distribution
Random Distribution

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf

random distribution

  • Pros:
    • No load-balancing issues
  • Cons:
    • At query-time, don’t know which machine to ask…
query flooding
Query Flooding

?s foaf:interest ?p ?o

Q

random distribution

-

-

-

kmi:tom foaf:interest wikipedia:Beer kmi:tomfoaf.rdf

slide83
More besides!!!

RDB-based indexes

  • 4store,
  • AllegroGraph,
  • Bigdata,
  • BigOWLIM,
  • Hexastore,
  • Jena SDB,
  • Mulgara,
  • Redland,
  • Virtuoso, etc.

Native RDF stores

  • HPRD,
  • Jena TDB,
  • RDF3X,
  • SIREn,
  • Voldemort,
  • YARS2, etc.
slide84
Berlin SPARQL Benchmark (v3)
  • Benchmark of common SPARQL engines
  • Set of assorted SPARQL queries and fixed data
  • Results for query-mixes per hour:

Christian Bizer, Andreas Schultz: The Berlin SPARQL Benchmark. 

Int. J. Semantic Web Inf. Syst. 5(2): 1-24 (2009)

slide85
Indexing wrap-up
  • Lot’s of work in the area!
    • Native stores vs. RDB-style stores
    • Triple stores vs. Quad stores
  • Optimisations
    • OIDs
    • Replication
    • Distribution
    • Join Selection/Reordering, etc.
  • No definitive solution…
closing indexing quote
Closing Indexing Quote

“In previous papers, some of us predicted the end of ‘one size fits all’ as a commercial relational DBMS paradigm.

“These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1-2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets.”

[Stonebraker et al.; 2007]

ad