1 / 39

Design and Evaluation of an IR-Benchmark for SPARQL Fulltext Queries

Design and Evaluation of an IR-Benchmark for SPARQL Fulltext Queries. Master’s Thesis in Computer Science b y Arunav Mishra Supervised by Prof. Martin Theobald Reviewed by Prof. Gerhard Weikum. Overview. Motivation Contributions Data collection and Query Benchmark.

jamese
Download Presentation

Design and Evaluation of an IR-Benchmark for SPARQL Fulltext Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Evaluation of an IR-Benchmark for SPARQL Fulltext Queries Master’s Thesis in Computer Science by Arunav Mishra Supervised by Prof. Martin Theobald Reviewed by Prof. Gerhard Weikum

  2. Overview • Motivation • Contributions • Data collection and Query Benchmark. • SPAR-Key : Rewriting SPARQL-FT to SQL • Results • Summary Motivation

  3. Motivation Keyword-based document retrieval (Fulltext): • Returns ranked documents based on a notion of relevance. • But: Offers no precise answer to the user Nelson Mandela • Fails to combine information that is located across two or more documents. Mother Teresa “What is common between Nelson Mandela and Mother Teresa” , … Fulltext search on Wikipedia • Challenges of exact user’s intentinterpretation anddisambiguation. Tim Burton “This is directed by Tim Burton” , … Fulltext search on Wikipedia Results

  4. Motivation Knowledge retrieval on structured (semantic) data: • Queries to be formulated in a query language like SPARQL with complex join logics. • Highly structured Queries are basic graph pattern matching thus returns results with high precision. • But: Difficult for a user (or any automated system) without knowledge of underlying data schema. “Mountain range” = MountainRange Or mountainRange “Unitary state” = Unitary_state Or Unitary_State Results

  5. Desiderata Combine both the retrieval techniques where: Adds to flexibilityand incorporates the notion of relevance. Captures, interprets and disambiguates user intention. Answer Jeopardy-style NL questions. Motivation

  6. Contributions • Unique entity-centric data collection combining structured and unstructured data. • Query benchmark containing 90 manually translated Jeopardy-style NL questions into SPARQL-FT queries. • Query engine to process SPARQL-FT queries over data collection. • We organized and participated in the INEX’12 LOD track. Motivation

  7. Overview • Motivation • Contributions • Data collection and Query Benchmark. • SPAR-Key : Rewriting SPARQL-FT to SQL • Results • Summary Data collection and Query Benchmark.

  8. INEX’12 - Wikipedia-LOD Collection Wikipedia Entity Data collection and Query Benchmark.

  9. INEX’12 - Jeopardy Task (Benchmark) • SPARQL-FT : Extend SPARQL with FTContainsOperator. • FTContains(<entity>, “<keyword(s)>”) : • Argument1: <entity> occurs as a Subject or Object in the SPARQL part of the query. • Argument 2: The set of keywords can be used for a fulltext search on the unstructured data of the collection. • Binds an entity to a set of keywords. • Represents a fulltext condition. 90 Jeopardy-style Natural Language Questions SPARQL + Fulltext = SPARQL-FT Queries manually translated Data collection and Query Benchmark.

  10. Two Classes of Benchmark Queries Type I: 50Jeopardy-style NL questions target single entity. Middle name of "Naked and the Dead" author Mailer or first name of "Lucky Jim" author Amis SELECT ?s WHERE { ?x <http://dbpedia.org/property/author> ?s. FILTER FTContains (?x, "Lucky Jim"). } One answer Type 2: 40 Natural-language question target ranked list of one or more entities. These are famous couples of actors acting in crime movies SELECT Distinct ?s ?o WHERE { ?s <http://dbpedia.org/property/partner> ?o. ?m1 <http://dbpedia.org/ontology/starring> ?s. ?m1 <http://dbpedia.org/ontology/starring> ?o. FILTER FTContains (?m1, "crime movie") . } Ranked list Data collection and Query Benchmark.

  11. Overview • Motivation • Contributions • Data collection and Query Benchmark. • SPAR-Key : Rewriting SPARQL-FT to SQL • Results • Summary SPAR-Key : Rewriting SPARQL-FT to SQL

  12. Storing RDF and Textual Data. Schema of DBpediaCore table Schema of Keywordstable Indices over Keywordstable Indices over DBpediaCoretable SPAR-Key : Rewriting SPARQL-FT to SQL

  13. TopX Indexer TopX 2.0 indexer to create per-term inverted lists from the plain (unstructured) text content of the Wikipedia documents. • White-space tokenizer. • Porter stemmer. • Stopword removal. OKAPI BM25 • Bulk loader to create the KEYWORDS table. • Wikipedia_IDDbpedia_Enitity, page_ids_en.nt file  SPAR-Key : Rewriting SPARQL-FT to SQL

  14. Query processing SELECT ?p WHERE { ?m <http://dbpedia.org/ontology/border> ?p . ?p <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/MountainRange> . FILTER FTContains(?p, "popular sightseeing sports destination") . } “This mountain range is bordered by another mountain range and is a popular sightseeing and sports destination?” SPAR-Key : Rewriting SPARQL-FT to SQL

  15. SQL Conjunctive Query SELECT DISTINCT T1.SUBJECT AS SUB, (K0.SCORE+K1.SCORE+K2.SCORE+K3.SCORE) AS SCORE FROM  DBPEDIACORE T2, DBPEDIACORE T1, KEYWORDS K0, KEYWORDS K1, KEYWORDS K2, KEYWORDS K3 WHERE  T1.OBJECT= T2.SUBJECT  AND T1.PREDICATE=‘http://dbpedia.org/ontology/border‘ AND T2.OBJECT='http://dbpedia.org/ontology/MountainRange‘ AND T2.PREDICATE='http://www.w3.org/1999/02/22-rdf-syntax-ns#type‘ AND K0.TERM= 'sport'  AND K1.TERM = 'popular'  AND K2.TERM= 'destin'  AND K3.TERM= 'sightse'  AND T1.OBJECT = K0.ENTITY_ID  AND T1.OBJECT = K1.ENTITY_ID  AND T1.OBJECT = K2.ENTITY_ID  AND T1.OBJECT = K3.ENTITY_ID AND T2.SUBJECT = K0.ENTITY_ID  AND T2.SUBJECT = K1.ENTITY_ID  AND T2.SUBJECT = K2.ENTITY_ID  AND T2.SUBJECT = K3.ENTITY_ID Consider the Example: SELECT ?p WHERE { ?m <http://dbpedia.org/ontology/border> ?p. ?p <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/MountainRange>. FILTER FTContains(?p, "popular sightseeing and sports destination"). } SPAR-Key : Rewriting SPARQL-FT to SQL

  16. Keyword Scores to SPO Triples SELECT ?p WHERE { ?m <http://dbpedia.org/ontology/border> ?p. ?p <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/MountainRange>. FILTER FTContains(?p, "popular sightseeing sports destination"). } DBpediaCore T1 Keywords K1 T1.Object = K1.Entity_ID Result SPAR-Key : Rewriting SPARQL-FT to SQL

  17. Bottlenecks and Solutions SPAR-Key : Rewriting SPARQL-FT to SQL

  18. Oracle Optimizer Hints: Ordered SELECT /*+ORDERED*/ DISTINCT ENTITY_ID , MAX(SCORE) AS SCORE FROM … FROM (SELECT DISTINCT ENTITY_ID , SCORE AS SCORES FROM KEYWORDS WHERE TERM='sightse') K1 FULL OUTER JOIN (SELECT DISTINCT ENTITY_ID , SCORE AS SCORES FROM KEYWORDS WHERE TERM='destin') K2 ON K1.ENTITY_ID = K2.ENTITY_ID FULL OUTER JOIN (SELECT DISTINCT ENTITY_ID , SCORE AS SCORES FROM KEYWORDS WHERE TERM='popular') K3 ON K1.ENTITY_ID = K3.ENTITY_ID FULL OUTER JOIN (SELECT DISTINCT ENTITY_ID , SCORE AS SCORES FROM KEYWORDS WHERE TERM='sport' ) K4 ON K1.ENTITY_ID = K4.ENTITY_ID … Execution Time : From 2.237 seconds To 0.975 seconds SPAR-Key : Rewriting SPARQL-FT to SQL

  19. Class Selection & Query Structure Exploitation • RDF data is already classified. • Property Domain and Range, mark the required classes. • For example, SELECT ?s ?c WHERE { ?s rdf:type<http://dbpedia.org/ontology/Station>. ?s <http://dbpedia.org/ontology/location> ?c. FILTER FTContains (?s, "most beautiful railway station"). } • The entities in the marked classes are considered as candidate entities. SPAR-Key : Rewriting SPARQL-FT to SQL

  20. URI Search • Titles or the entity URIs best summarizes the content and contains unique contextual information. • In addition, entity description in a fulltext condition generally uses surface forms. • Entity disambiguation and capture false negatives. • Create an additional URI index on tokens of the entity URIs. SPAR-Key : Rewriting SPARQL-FT to SQL

  21. SPAR-Key Ultimatum • Select Query Class Selection • TAB1 • TAB2 Class Constraint c1 and c2 Class Constraint c1 • DBPEDIACORE • T1 • KEYSFINAL • DBPEDIACORE • T2 URI Search • KEYSEARCH • Keywords K1 • [Term= “sightseeing ”] Class Constraint c1 SELECT ?p WHERE { ?m <http://dbpedia.org/ontology/border> ?p. ?p <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/MountainRange>. FILTER FTContains(?p, "popular sightseeing sports destination") . } Class Constraint c1 • Keywords K2 • [Term= “destination”] • Keywords K3 • [Term= “popular”] Class Constraint c1 Class Constraint c1 • Keywords K4 • [Term= “sports”] = Inner Join = Outer Join SPAR-Key : Rewriting SPARQL-FT to SQL

  22. Overview • Motivation • Contributions • Data collection and Query Benchmark. • SPAR-Key : Rewriting SPARQL-FT to SQL • Results • Summary Results

  23. Experimental setup • Dbpedia v 3.7 – created in July 2011 • YAGO2 core and full dump – created in January 2012 DBpediaCore Table Statistics Keywords Table Statistics Data Collection Statistics Results

  24. Translation Algorithms Results

  25. Official INEX Results Results

  26. Gold Result Set • For Example, • <topic id=“2012305”> • North Dakota’s highest point is White Butte; its lowest is on this river of another colour. • Correct Answer= “Red river of the north” • “Red river of the north” <maps to> http://dbpedia.org/resource/Red_River_of_the_North. • However, a Ad-hoc search style relevance assessment marks the following entities as the relevant entities : • http://dbpedia.org/resource/geography_of_north_dakota • http://dbpedia.org/resource/pembina_county,_north_dakota • http://dbpedia.org/resource/north_dakota • http://dbpedia.org/resource/portal:north_dakota/dyk/2007july • http://dbpedia.org/resource/portal:north_dakota/dyk/2007june • http://dbpedia.org/resource/1997_red_river_flood_in_the_united_states Results

  27. Re-evaluations Results

  28. Results

  29. Results

  30. Results

  31. Overview • Motivation • Contributions • Data collection and Query Benchmark. • SPAR-Key : Rewriting SPARQL-FT to SQL • Results • Summary Results

  32. Summary • A fast shot approach towards processing and evaluating SPARQL queries with full-text conditions. • Implemented using relational databases to show proof of concept and retrieve answers to the NL questions (there is no publicly available engine). • Improved result quality compared to full-text query processing engines. • As future work we would work on getting better results by utilizing dictionaries and automatic query expansion. • The long-term goal is indeed to provide a custom engine that processes these queries. Results

  33. Demonstration Results

  34. Thank You

  35. Optimiser Hints, Outer Joins & Materialization Temporary Tables : SPAR-Key Supremacy SELECT ?p WHERE { ?m <http://dbpedia.org/ontology/border> ?p. ?p <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/MountainRange>. FILTER FTContains(?p, "popular sightseeing and sports destination") . } • Select Query • DBPEDIACORE • TAB1 • TAB2 • Keywords • DBPEDIACORE • T1 • KEYWORDS • KEYS0 • DBPEDIACORE • T2 • Keywords K1 Adding constraint on Predicate and Object Adding constraint on Predicate • Keywords K2 = Inner Join = Outer Join • Keywords K3 = An instance of DBPEDIACORE table • Keywords K4 = An instance of KEYWORDS Table Results

  36. Query Structure Exploitation Exploiting the structure of query for class identification Star Structure ?c SELECT ?c WHERE { <museum> <located > ?a. ?b <capital> ?a. ?b ?predicate ?c. FILTER FTContains(?a,”…”). } City <located> <capital> museum ?b ?a fulltext Country Chain Structure Station ?s ?c <type> Results

  37. Backup: Creating temporary table for fulltext condition. CREATE TABLE KEYS0 AS SELECT /*+ORDERED*/ * FROM (SELECT /*+ORDERED*/ * FROM (SELECT /*+ORDERED*/ DISTINCT ENTITY_ID , MAX(SCOREss) AS SCORE FROM ( SELECT DISTINCT K1.ENTITY_ID AS ENTITY_ID , (NVL(K1.SCORES,0)+1 + NVL(K2.SCORES,0) + NVL(K3.SCORES,0) + NVL(K4.SCORES,0) ) AS SCORESS FROM (SELECT DISTINCT ENTITY_ID , SCORE AS SCORES FROM KEYWORDS WHERE TERM='sightse') K1 FULL OUTER JOIN (SELECT DISTINCT ENTITY_ID ,SCORE AS SCORES FROM KEYWORDS WHERE TERM='destin') K2 ON K1.ENTITY_ID = K2.ENTITY_ID FULL OUTER JOIN (SELECT DISTINCT ENTITY_ID ,SCORE AS SCORES FROM KEYWORDS WHERE TERM='popular') K3 ON K1.ENTITY_ID = K3.ENTITY_ID FULL OUTER JOIN (SELECT DISTINCT ENTITY_ID , SCORE AS SCORES FROM KEYWORDS WHERE TERM='sport') K4 ON K1.ENTITY_ID = K4.ENTITY_ID ORDER BY SCORESS DESC) GROUP BY ENTITY_ID ORDER BY SCORE DESC )) Results

  38. Temporary table for triple patterns. CREATE TABLE TAB2 AS SELECT SUBJECT , PREDICATE , OBJECT , ( NVL(KEYS0.score,0) ) AS SCORE FROM(SELECT * FROM dbpediacore3 t1 WHERE t1.PREDICATE='http://www.w3.org/1999/02/22-rdf-syntax-ns#type' AND t1.OBJECT='http://dbpedia.org/ontology/MountainRange')TEMP INNER JOIN KEYS0 ON TEMP.SUBJECT=KEYS0.ENTITY_ID CREATE TABLE TAB1 AS SELECT SUBJECT , PREDICATE , OBJECT , ( NVL(KEYS0.score,0) ) AS SCORE FROM (SELECT * FROM dbpediacore3 t2 WHERE t2.PREDICATE='http://dbpedia.org/ontology/border')TEMP INNER JOIN KEYS0 ON TEMP.OBJECT=KEYS0.ENTITY_ID Results

  39. Final Select Query SELECT /*+Ordered*/ p , MAX(SCORE) AS SCORE_MAX FROM (SELECT /*+ORDERED*/ DISTINCT TAB2.SUBJECT AS p , ( NVL(TAB2.score ,0 ) + NVL(TAB1.score ,0 )) AS score FROM TAB2 , TAB1 WHERE TAB1.OBJECT = TAB2.SUBJECT ) GROUP BY p ORDER BY SCORE_MAX DESC Dropping the Intermediate Tables DROP TABLE KEYS0 DROP TABLE TAB2 DROP TABLE TAB1 Results

More Related