190 likes | 288 Views
Learn about DARQ, an engine for federated SPARQL queries, enabling integrated access to multiple RDF data sources. Understand its processes, service descriptions, query planning, optimization, and execution. Explore statistical information utilization for query optimization.
E N D
Querying Distributed RDF Data Sources with SPARQL Presented by Bastian Quilitz and Ulf Leser Humboldt-Universitat zu Berlin ESWC 2008 2009-07-23 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea
Introduction • SPARQL has to deal with thousands of RDF data • with a local machine • with multiple and distributed machines • Integrated access to multiple RDF data sources is a key challenge for many semantic web applications • Current implementations of SPARQL load all RDF graphs to the local machine • This usually incurs a large overhead in network traffic Center for E-Business Technology
Introduction • DARQ, an engine for federated SPARQL queries • Provides transparent query access to multiple SPARQL services • Distributed ARQ, as an extension to ARQ (jena) • Available under GPL License at http://darq.sf.net/ In this presentation, .. Building Sub-queries Metadata for each DS Data Source Do not care Center for E-Business Technology
Preliminaries • A SPARQL query Q is defined as Q = (E, DS, R) • E : an algebra expression of the SPARQL query • DS : a RDF data source • R : Query Type (SELECT, CONSTRUCT, DESCRIBE, ASK) • The algebra expression E consists of • Graph Patterns • Triple Pattern : (s, p, o) • Basic Graph Pattern : a set of triple pattern • Filtered BGP : BGP with constraints • Solution Modifiers, • Such as PROJECTION, DISTINCT, LIMIT or ORDER BY Center for E-Business Technology
An Example SPARQ Query SELECT ?name ?mbox WHERE { ?x foaf:name ?name. ?x foaf:mbox ?mbox. FILTERregex(?name, “^Tim”) && regex(?mbox, “w3c”) } ORDERBY ?name LIMIT 5 Query Type Projection TP BGP FBGP Solution Modifiers Center for E-Business Technology
Query Processing • A query is processed in 4 stages: • Parsing : converts the query string into a tree model of SPARQL. The DARQ query engine reuses the parser shipped with ARQ • Query Planning : the query engine decomposes the original query and builds multiple sub-queries according to the information in the service descriptions, each of which can be answered by one known data source • Query Optimization : In the third stage, the query optimizer takes the sub-queries and rewrites them for optimization • Query Execution : the Query execution plan is executed. The sub-queries are sent to the data sources and the results are integrated Center for E-Business Technology
Service Descriptions • Information for each data sources is helpful • To find the relevant data sources for the different triples • To decompose the query into sub-queries • Service descriptions • Let us know whether the data available from a data source • Allow limitations on access patterns • Include statistical information used for query optimization • Are represented in RDF Center for E-Business Technology
Service Descriptions • Data Description • A service description defines the capabilities which indicates whether data is available or not • Ex) sd:capability [ sd:predicate rdf:type ]; • The definition of capabilities is based on predicates • DARQ currently only supports queries with bounded predicates • Limitation on Access Pattern • DARQ supports limitations on access patterns • Ex) sd:requiredBindings [ sd:subjectBinding foaf:name ]; • Ex) sd:requiredBindings [ sd:objectBinding foaf:name ]; Center for E-Business Technology
Service Descriptions • Statistical Information • Helps the query optimizer to find a cost-effective query plan • Includes • Ns : The total number of triples • Optional information for each predicate • nD(p) : The number of triples for the predicate p in the data source D • sselD(p) : The selectivity of a triple pattern for the predicate p when the subject is bounded (default = 1 / nD(p) ) • oselD(p) : The selectivity of a triple pattern for the predicate p when the object is bounded (default = 1) • Using simple statistics => every data source can provide them • More precise statistics would be preferable but will not be available Center for E-Business Technology
Service Descriptions • The data source defined in the example can answer queries for foaf:name, foaf:mbox and foaf:weblog. • Objects for a triple with predicate foaf:name must always start with a letter from A to R • In total it stores 112 triples • The data source has limitations on access patterns, i.e. a query must contain a triple pattern with predicate foaf:name or foaf:mbox with a bounded object Center for E-Business Technology
Query Planning • Query planning is based on the information provided by service descriptions • In this system, we have two stages • Source Selection: let us know which data source is relevant to a given query • The algorithm simply matches given triple patterns against the capabilities of the data sources • Ex) sd:capability [ sd:predicaterdf:type]; • SELECT ?x WHERE ?x rdf:typefoaf:Person; • As a result, every triple pattern in a BGP has a set of corresponding data sources • The results from source selection are used to build sub-queries that can be answered by the data source • Building Sub-Queries • Each data source has a sub-query • Each sub-query has a filtered BGP Center for E-Business Technology
Query Planning SELECT ?name ?mbox WHERE { ?x foaf:name ?name. ?x foaf:mbox ?mbox. FILTERregex(?name, “^Tim”) && regex(?mbox, “w3c”) } ORDERBY ?name LIMIT 5 DARQ (?x foaf:name ?name) (?x foaf:mbox ?mbox) (?x foaf:name ?name) (?x foaf:mbox ?mbox) sd:capability sd:predicate foaf:name. sd:capability sd:predicate foaf:mbox. sd:capability sd:predicate foaf:name; sd:predicate foaf:mbox. (Person, name, “TBL”) (Person, mbox, “T@x.y”) (Person, name, “ABC”) (Person, mbox, “A@b.c) Center for E-Business Technology
Query Optimization - Logical • Rule-based Query Rewriting • Based on [Perez, J. et al., ISWC 2006] • Reduces the number of BGP & variables • Moving value constraints into sub-queries Center for E-Business Technology
Query Optimization - Physical • Physical optimization is about the intermediate result size estimation (cost-based optimization) • The result size estimation is based on the statistics provided in the service descriptions • Join, Single Triple, Multiple Triples (BGP) • An example of a single triple pattern Center for E-Business Technology
Evaluation • Dataset : a subset of DBpedia, 31.5 million triples in total • Contains RDF data extracted from Wikipedia • http://dbpedia.org Center for E-Business Technology
Evaluation • 2 physical machines, 5 logical SPARQL endpoints Center for E-Business Technology
Evaluation • Optimization has made significant improvements • My opinion • The experiment doesn’t count the loading time • There need to be compared with other systems • http://esw.w3.org/topic/LargeTripleStores Center for E-Business Technology
Conclusion • DARQoffers a single interface for querying multiple, distributed SPARQL end-points • Using SPARQL Standard => Flexible • Using Service Descriptions • Data sources can be added and/or removed dynamically • A query can be federated and optimized with statistical information • Limitation • Predicates must be bounded (Sub. ?p Obj. is not allowed) • CONSTRUCT, DESCRIBE, ASK are not supported • GRAPH, UNION, OPTIONAL are not supported Center for E-Business Technology
Paper Evaluation • Pros • Good idea • Distributed SPARQL processing is relatively new research field • Defining service descriptions • Dealing with all aspects of query engine • Implementation • My Comments • Too simple, and still slow • Many limitations Center for E-Business Technology