katja hose ralf schenkel mpi informatik saarland university n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Towards Benefit-Based RDF Source Selection for SPARQL Queries PowerPoint Presentation
Download Presentation
Towards Benefit-Based RDF Source Selection for SPARQL Queries

Loading in 2 Seconds...

play fullscreen
1 / 17

Towards Benefit-Based RDF Source Selection for SPARQL Queries - PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on

Katja Hose Ralf Schenkel MPI Informatik Saarland University. Towards Benefit-Based RDF Source Selection for SPARQL Queries. Context: Linked Data Cloud. General Knowledge Bases: DBPedia, Freebase, YAGO

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Towards Benefit-Based RDF Source Selection for SPARQL Queries' - kiona


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
context linked data cloud
Context: Linked Data Cloud
  • General Knowledge Bases: DBPedia, Freebase, YAGO
  • Domain-specific knowledge: Biology, Geo, Government, Publications, Movies, Songs, …
  • Linked Open Data as large integrated knowledge base

> 31 billion triples in the LOD cloud, 325 sources

DBPedia: 3.6 million entities, 1.2 billion triples

Ralf Schenkel

sparql querying semantic data
SPARQL: Querying Semantic Data

SPARQL example (simplified – no prefixes etc.):

SELECT ?a WHERE

{ ?a dc:authorOf ?p.

?p dc:publishedAt sigmod:2012/SWIM .}

Source selection problem:Which of the 325 sources to query?

  • For each triple pattern, select all sources that have answers
  • 2 major solution branches:
  • Use ASK queries
  • Use source statistics

Ralf Schenkel

existing approaches ask void
Existing Approaches: ASK, VoID
  • Ask each source if results exists for triple pattern

ASK {?p dc:publishedAt sigmod:2012/SWIM}

  • Send query to all relevant sources

VoID:

  • Summary for each source::DBLP_L3S void:propertyPartition [ void:property dc:publishedAt; void:triples 400000 ];
  • Less precise than ASK, but cheaper

More on Database Techniques for LOD:Tutorial by A. Harth, K. Hose, R. Schenkel on Thursday 10:30-12:00

Ralf Schenkel

focus of this talk source overlap
Focus of this Talk: Source Overlap

Many sources contain the same facts

Many duplicate results

Many unnecessary requests

Obvious problem: overlapping sources

Ralf Schenkel

example for overlapping sources
Example for Overlapping Sources
  • 6 results overall
  • 2 sources enough to retrieve all results
  • Source 1 alone is „optimal“ if
    • only one access possible
    • or 5 results are enough

Source 1

Source 2

Source 3

Our contribution: Determine „optimal“set of sources without seeing the results

Ralf Schenkel

problem definition
Problem Definition

Given SPARQL query with triple patterns P and possible sources S, compute query plan qpPS(which pattern is executed at which source)such that

  • all results are retrieved with a minimal number of requests to sources (minimal exact plan)
  • as many results as possible are retrieved with |qp|≤max (maximize recall)
  • as little requests as possible are performed to retrieve at least r results (minimal approximate plan)

Ralf Schenkel

high level solution overview
High-Level Solution Overview
  • Extend ASK operation to provide concise yet expressive summary of result bindings of each variable (instead of boolean yes/no)
  • Estimate source overlap with summaries
  • Select sources based on benefit

Functional properties of summaries for sets:

    • Size of set (number of distinct elements)
    • Size of union of two sets
    • Size of intersection of two sets
    • Summary smaller than the data
    • Data not be reproducible from the summary

Examples: Bloom Filters, kmv synopsis, …

Ralf Schenkel

bloom filters
Bloom Filters
  • Represent elements in set by k bits in bitvector, determined by hash function
  • Summary of union/intersection by union/intersection of bit vectors
  • Estimation for number of elements in underlying set of vector with t 1-bits

Example (k=2): {dblp:swim12/p1, dblp:swim12/p2}

hash1

hash2

hash2

hash1

0

0

0

1

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

Ralf Schenkel

source selection for single triple pattern
Source Selection for Single Triple Pattern
  • Benefit of a source: number of new results it can contribute
  • Incremental selection algorithm:
    • Maintain summary for union of results from sources already selected
    • Estimate source benefit from summary
    • Select source with highest benefit
  • Stop when target (# results or # requests) reached
  • Finally: Evaluate triple pattern at all selected sources; select more sources if too few results

Ralf Schenkel

example single triple pattern
Example (Single Triple Pattern)

6:

0:

5:

2:

2:

5:

5:

3:

5:

3:

0

1

1

1

0

1

0

1

1

1

0

0

0

0

0

0

0

0

0

0

1

1

1

0

0

1

1

1

1

0

0

1

0

0

1

1

1

1

0

0

1

1

0

1

1

1

0

0

0

0

1

1

0

0

1

1

1

1

1

0

Source 1

Source 2

Source 3

1. ASK each source

2. Select source with highest number of results

3. Stop if stopping condition is met (recall or number of results)

4. Compute benefit for each remaining source

Source 2: 2 -

= 1

Source 3: 3 -

= 1

5. Select source with highest benefit

6. Continue with step 3

current result summary

Ralf Schenkel

star shaped queries
Star-Shaped Queries

Multiple triple patterns with a single identical variable

  • Not enough to consider each triple pattern separately
  • Need to focus on the intersection of the result sets
  • Extended incremental algorithm:
    • Init: Pick one source for each triple pattern with most results
    • Benefit of evaluating a triple pattern at a source: number of new results in the intersection
    • Estimated by intersection of per-pattern summaries (union of summaries from each selected source)

?x imdb:gender „female“.?x imdb:bornIn dbpedia:Germany.?x imdb:actedIn imdb:Titanic.

Ralf Schenkel

complex queries
Complex Queries

Queries with >1 variable and >1 triple patterns

  • Summaries not applicable for whole query:
    • no connection of summaries for variables ?m and ?p
    • Do new bindings for ?p join with existing bindings for ?m ?
  • But: separate source selection for each pattern possible
  • Plus: exclude join candidates at execution time reduces effort for nested-loop joins run full query at sources if no cross-joins possible

imdb:Tom_Cruise imdb:actedIn ?m.?m imdb:producedBy ?p.

best: 3 local joins

naive:3x3 joins

improved:6 joins

Ralf Schenkel

experimental evaluation setup
Experimental Evaluation: Setup
  • RDF Dataset from first 100,000 IMDB moviesand their actors and directors
  • Generate overlapping partitions
    • For movies based on genre (28 partitions)
    • For persons based on birthplace and birthdate (22 p.)
  • Queries:
    • 20 single triple patterns
    • 20 star-shaped queries
  • Consider minimal exact plan
  • Bloom filters of different sizes, kmv synopsis

Ralf Schenkel

triple pattern queries
Triple Pattern Queries

Much fewer requests while retrieving (almost) all results

Ralf Schenkel

star shaped queries1
Star-Shaped Queries

Good Efficiency, but effectiveness sometimes suboptimal

Ralf Schenkel

conclusions and future work
Conclusions and Future Work
  • Benefit-Aware query routing can improve query performance for Linked Data
  • Additional benefit for join processing

Future Work:

  • Integration of sameAs links
  • More general notions of benefit:
    • Transfer time
    • Access cost
    • Data quality

Ralf Schenkel