Towards Benefit-Based RDF Source Selection for SPARQL Queries

Katja Hose Ralf Schenkel MPI Informatik Saarland University Towards Benefit-Based RDF Source Selection for SPARQL Queries

Context: Linked Data Cloud • General Knowledge Bases: DBPedia, Freebase, YAGO • Domain-specific knowledge: Biology, Geo, Government, Publications, Movies, Songs, … • Linked Open Data as large integrated knowledge base > 31 billion triples in the LOD cloud, 325 sources DBPedia: 3.6 million entities, 1.2 billion triples Ralf Schenkel

SPARQL: Querying Semantic Data SPARQL example (simplified – no prefixes etc.): SELECT ?a WHERE { ?a dc:authorOf ?p. ?p dc:publishedAt sigmod:2012/SWIM .} Source selection problem:Which of the 325 sources to query? • For each triple pattern, select all sources that have answers • 2 major solution branches: • Use ASK queries • Use source statistics Ralf Schenkel

Existing Approaches: ASK, VoID • Ask each source if results exists for triple pattern ASK {?p dc:publishedAt sigmod:2012/SWIM} • Send query to all relevant sources VoID: • Summary for each source::DBLP_L3S void:propertyPartition [ void:property dc:publishedAt; void:triples 400000 ]; • Less precise than ASK, but cheaper   More on Database Techniques for LOD:Tutorial by A. Harth, K. Hose, R. Schenkel on Thursday 10:30-12:00 Ralf Schenkel

Focus of this Talk: Source Overlap Many sources contain the same facts Many duplicate results Many unnecessary requests Obvious problem: overlapping sources Ralf Schenkel

Example for Overlapping Sources • 6 results overall • 2 sources enough to retrieve all results • Source 1 alone is „optimal“ if • only one access possible • or 5 results are enough Source 1 Source 2 Source 3 Our contribution: Determine „optimal“set of sources without seeing the results Ralf Schenkel

Problem Definition Given SPARQL query with triple patterns P and possible sources S, compute query plan qpPS(which pattern is executed at which source)such that • all results are retrieved with a minimal number of requests to sources (minimal exact plan) • as many results as possible are retrieved with |qp|≤max (maximize recall) • as little requests as possible are performed to retrieve at least r results (minimal approximate plan) Ralf Schenkel

High-Level Solution Overview • Extend ASK operation to provide concise yet expressive summary of result bindings of each variable (instead of boolean yes/no) • Estimate source overlap with summaries • Select sources based on benefit Functional properties of summaries for sets: • Size of set (number of distinct elements) • Size of union of two sets • Size of intersection of two sets • Summary smaller than the data • Data not be reproducible from the summary Examples: Bloom Filters, kmv synopsis, … Ralf Schenkel

Bloom Filters • Represent elements in set by k bits in bitvector, determined by hash function • Summary of union/intersection by union/intersection of bit vectors • Estimation for number of elements in underlying set of vector with t 1-bits Example (k=2): {dblp:swim12/p1, dblp:swim12/p2} hash1 hash2 hash2 hash1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 Ralf Schenkel

Source Selection for Single Triple Pattern • Benefit of a source: number of new results it can contribute • Incremental selection algorithm: • Maintain summary for union of results from sources already selected • Estimate source benefit from summary • Select source with highest benefit • Stop when target (# results or # requests) reached • Finally: Evaluate triple pattern at all selected sources; select more sources if too few results Ralf Schenkel

Example (Single Triple Pattern) 6: 0: 5: 2: 2: 5: 5: 3: 5: 3: 0 1 1 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 1 0 Source 1 Source 2 Source 3 1. ASK each source 2. Select source with highest number of results 3. Stop if stopping condition is met (recall or number of results) 4. Compute benefit for each remaining source Source 2: 2 -  = 1 Source 3: 3 -  = 1 5. Select source with highest benefit 6. Continue with step 3  current result summary Ralf Schenkel

Star-Shaped Queries Multiple triple patterns with a single identical variable • Not enough to consider each triple pattern separately • Need to focus on the intersection of the result sets • Extended incremental algorithm: • Init: Pick one source for each triple pattern with most results • Benefit of evaluating a triple pattern at a source: number of new results in the intersection • Estimated by intersection of per-pattern summaries (union of summaries from each selected source) ?x imdb:gender „female“.?x imdb:bornIn dbpedia:Germany.?x imdb:actedIn imdb:Titanic. Ralf Schenkel

Complex Queries Queries with >1 variable and >1 triple patterns • Summaries not applicable for whole query: • no connection of summaries for variables ?m and ?p • Do new bindings for ?p join with existing bindings for ?m ? • But: separate source selection for each pattern possible • Plus: exclude join candidates at execution time reduces effort for nested-loop joins run full query at sources if no cross-joins possible imdb:Tom_Cruise imdb:actedIn ?m.?m imdb:producedBy ?p. best: 3 local joins naive:3x3 joins improved:6 joins Ralf Schenkel

Experimental Evaluation: Setup • RDF Dataset from first 100,000 IMDB moviesand their actors and directors • Generate overlapping partitions • For movies based on genre (28 partitions) • For persons based on birthplace and birthdate (22 p.) • Queries: • 20 single triple patterns • 20 star-shaped queries • Consider minimal exact plan • Bloom filters of different sizes, kmv synopsis Ralf Schenkel

Triple Pattern Queries Much fewer requests while retrieving (almost) all results Ralf Schenkel

Star-Shaped Queries Good Efficiency, but effectiveness sometimes suboptimal Ralf Schenkel

Conclusions and Future Work • Benefit-Aware query routing can improve query performance for Linked Data • Additional benefit for join processing Future Work: • Integration of sameAs links • More general notions of benefit: • Transfer time • Access cost • Data quality Ralf Schenkel

Towards Benefit-Based RDF Source Selection for SPARQL Queries

Towards Benefit-Based RDF Source Selection for SPARQL Queries

Presentation Transcript

SPARQ2L : Towards Supporting Subgraph Extraction Queries in RDF Databases

SPARQL Intro: A query language for RDF

SPARQL : Simple Protocol and RDF Query Language

Federated Information Management with OWL/RDF/SPARQL

SPARQL Intro: A query language for RDF

RDF Aggregate Queries and Views

Chapter 3 Querying RDF stores with SPARQL

SPARQL Query Language for RDF

SPARQL SPARQL Protocol and RDF Query Language

SPARQL All slides are adapted from the W3C Recommendation SPARQL Query Language for RDF

SPARQL AN RDF Query Language

SPARQL Query Language for RDF

SPARQL Query Language for RDF

SPARQL Query Language for RDF

SPARQL - A query language for RDF(s)

Source Selection

RDF and SPARQL

Chapter 3 Querying RDF stores with SPARQL

SPARQL All slides are adapted from the W3C Recommendation SPARQL Query Language for RDF

SPARQL Query Language for RDF