1 / 17

Towards Benefit-Based RDF Source Selection for SPARQL Queries

Katja Hose Ralf Schenkel MPI Informatik Saarland University. Towards Benefit-Based RDF Source Selection for SPARQL Queries. Context: Linked Data Cloud. General Knowledge Bases: DBPedia, Freebase, YAGO

kiona
Download Presentation

Towards Benefit-Based RDF Source Selection for SPARQL Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Katja Hose Ralf Schenkel MPI Informatik Saarland University Towards Benefit-Based RDF Source Selection for SPARQL Queries

  2. Context: Linked Data Cloud • General Knowledge Bases: DBPedia, Freebase, YAGO • Domain-specific knowledge: Biology, Geo, Government, Publications, Movies, Songs, … • Linked Open Data as large integrated knowledge base > 31 billion triples in the LOD cloud, 325 sources DBPedia: 3.6 million entities, 1.2 billion triples Ralf Schenkel

  3. SPARQL: Querying Semantic Data SPARQL example (simplified – no prefixes etc.): SELECT ?a WHERE { ?a dc:authorOf ?p. ?p dc:publishedAt sigmod:2012/SWIM .} Source selection problem:Which of the 325 sources to query? • For each triple pattern, select all sources that have answers • 2 major solution branches: • Use ASK queries • Use source statistics Ralf Schenkel

  4. Existing Approaches: ASK, VoID • Ask each source if results exists for triple pattern ASK {?p dc:publishedAt sigmod:2012/SWIM} • Send query to all relevant sources VoID: • Summary for each source::DBLP_L3S void:propertyPartition [ void:property dc:publishedAt; void:triples 400000 ]; • Less precise than ASK, but cheaper   More on Database Techniques for LOD:Tutorial by A. Harth, K. Hose, R. Schenkel on Thursday 10:30-12:00 Ralf Schenkel

  5. Focus of this Talk: Source Overlap Many sources contain the same facts Many duplicate results Many unnecessary requests Obvious problem: overlapping sources Ralf Schenkel

  6. Example for Overlapping Sources • 6 results overall • 2 sources enough to retrieve all results • Source 1 alone is „optimal“ if • only one access possible • or 5 results are enough Source 1 Source 2 Source 3 Our contribution: Determine „optimal“set of sources without seeing the results Ralf Schenkel

  7. Problem Definition Given SPARQL query with triple patterns P and possible sources S, compute query plan qpPS(which pattern is executed at which source)such that • all results are retrieved with a minimal number of requests to sources (minimal exact plan) • as many results as possible are retrieved with |qp|≤max (maximize recall) • as little requests as possible are performed to retrieve at least r results (minimal approximate plan) Ralf Schenkel

  8. High-Level Solution Overview • Extend ASK operation to provide concise yet expressive summary of result bindings of each variable (instead of boolean yes/no) • Estimate source overlap with summaries • Select sources based on benefit Functional properties of summaries for sets: • Size of set (number of distinct elements) • Size of union of two sets • Size of intersection of two sets • Summary smaller than the data • Data not be reproducible from the summary Examples: Bloom Filters, kmv synopsis, … Ralf Schenkel

  9. Bloom Filters • Represent elements in set by k bits in bitvector, determined by hash function • Summary of union/intersection by union/intersection of bit vectors • Estimation for number of elements in underlying set of vector with t 1-bits Example (k=2): {dblp:swim12/p1, dblp:swim12/p2} hash1 hash2 hash2 hash1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 Ralf Schenkel

  10. Source Selection for Single Triple Pattern • Benefit of a source: number of new results it can contribute • Incremental selection algorithm: • Maintain summary for union of results from sources already selected • Estimate source benefit from summary • Select source with highest benefit • Stop when target (# results or # requests) reached • Finally: Evaluate triple pattern at all selected sources; select more sources if too few results Ralf Schenkel

  11. Example (Single Triple Pattern) 6: 0: 5: 2: 2: 5: 5: 3: 5: 3: 0 1 1 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 1 0 Source 1 Source 2 Source 3 1. ASK each source 2. Select source with highest number of results 3. Stop if stopping condition is met (recall or number of results) 4. Compute benefit for each remaining source Source 2: 2 -  = 1 Source 3: 3 -  = 1 5. Select source with highest benefit 6. Continue with step 3  current result summary Ralf Schenkel

  12. Star-Shaped Queries Multiple triple patterns with a single identical variable • Not enough to consider each triple pattern separately • Need to focus on the intersection of the result sets • Extended incremental algorithm: • Init: Pick one source for each triple pattern with most results • Benefit of evaluating a triple pattern at a source: number of new results in the intersection • Estimated by intersection of per-pattern summaries (union of summaries from each selected source) ?x imdb:gender „female“.?x imdb:bornIn dbpedia:Germany.?x imdb:actedIn imdb:Titanic. Ralf Schenkel

  13. Complex Queries Queries with >1 variable and >1 triple patterns • Summaries not applicable for whole query: • no connection of summaries for variables ?m and ?p • Do new bindings for ?p join with existing bindings for ?m ? • But: separate source selection for each pattern possible • Plus: exclude join candidates at execution time reduces effort for nested-loop joins run full query at sources if no cross-joins possible imdb:Tom_Cruise imdb:actedIn ?m.?m imdb:producedBy ?p. best: 3 local joins naive:3x3 joins improved:6 joins Ralf Schenkel

  14. Experimental Evaluation: Setup • RDF Dataset from first 100,000 IMDB moviesand their actors and directors • Generate overlapping partitions • For movies based on genre (28 partitions) • For persons based on birthplace and birthdate (22 p.) • Queries: • 20 single triple patterns • 20 star-shaped queries • Consider minimal exact plan • Bloom filters of different sizes, kmv synopsis Ralf Schenkel

  15. Triple Pattern Queries Much fewer requests while retrieving (almost) all results Ralf Schenkel

  16. Star-Shaped Queries Good Efficiency, but effectiveness sometimes suboptimal Ralf Schenkel

  17. Conclusions and Future Work • Benefit-Aware query routing can improve query performance for Linked Data • Additional benefit for join processing Future Work: • Integration of sameAs links • More general notions of benefit: • Transfer time • Access cost • Data quality Ralf Schenkel

More Related