Graphinder Semantic Search Relational Keyword Search over Data Graphs

Graphinder Semantic SearchRelational Keyword Search over Data Graphs Thanh Tran, Lei Zhang, Veli Bicer, Yongtao Ma Researcher: www.sites.google.com/site/kimducthanh Co-Founder: www.graphinder.com

Agenda • Introduction • Graphinder: Overview • Keyword Query Translation • Keyword Query Result Ranking • Keyword Query Rewriting • Suggesting correct and meaningful queries • Auto-complete as user types

Introduction

Motivation: lots of structured data

Semantic Search: use information about entities and relationships explicitly given in structured data to provide relevant answers for complex questions asked using intuitive interfaces “singles written by freddie, who is member of the band queen” “single written by freddie queen” MusicBrainz <x, type, Single> <Freddie Mercury, writer, x> <Freddie Mercury, type, Artist> <Freddie Mercury, member, Queen> <Queen, type, Band> Single Person Artist Queen Queen Elizabeth 1 DBpedia member member producer formed in <x, type, Single> <x, wrritenBy, Freddy> marital status Freddie Mercury Brian May Liar single 1971 Links <Freddy, same-as, Freddy Mercury> writer

Entity Semantic Search: find relevant entity, return structured data summary, facts, related entities

Relational Semantic Search: find relevant entities involved in a relationship, return entity summaries…

Semantic Search Problem: understand user inputs as entities and relationships and find relevant answers Query Translation: What are possible connections (schema-level) between recognized entities and relationships? “singlewritten byfreddiequeen” 1) <x, type, Single> <Freddie Mercury, writer, x> <Freddie Mercury, member, Queen> 2) …. “singleswritten byfreddie, who is member of the band queen” Person Single Artist Queen Queen Elizabeth 1 member member formed in marital status producer Freddie Mercury Brian May Query Answering: What are actual connections (data-level) between recognized entities and relationships? Liar single 1971 writer 1) <Liar Liar, type, Single> <Freddie Mercury, writer, Liar Liar> <Freddie Mercury, member, Queen> 2) …

Relational Semantic Search at Facebook: recognizes entities and relationships via LMs, uses manually specified template (grammar) to find possible connections between them and computes answers via resulting translated queries Grammar: set of production rules, capturing all possible connections, i.e. the search space of all parse trees [start]  [users] [users]  my friends friends(x, me) […]  is member of [bands] member(x, $1) [bands]  {band} $1 … “my friends, who is member of queen” [start] my friends, who is member of [id:Queen1] friends(x,me), member(x,Queen1) [user-head] my friends friends(x,me) [user-filter] who is member of [id:1] member(x,Queen1) [who] who - [member-vp] is member of [id:1] member(x,Queen1) {band} [id:Queen1] Queen1 [member-of-v] is member of member() Grammar-based Query Translation: which combination of production rules results in a parse tree that connects the recognized entities and relationships? member queen friends

Overview

Graphinder Semantic Search: a translation-based approach for relational keyword search over data graphs Single Person Artist Sem. Auto-completion Queen Query Translation Queen Elizabeth 1 member - Entity + Relationships - Multi-source - Domain-independent - Low manual effort member formed in producer marital status Freddie Mercury Brian May Liar single 1971 writer

Graphinder: selected publications • On-demand, domain-independent, relational keyword search over data graphs • Structure index for data graphs (TKDE13b) • Top-k exploration of translation candidates (ICDE09) • Index-based materialization of graphs (CIKM11a) • Ranking results using structured relevance model (SRM) (CIKM11b) • Multi-source • Deduplication using inferred type information: TYPifier (ICDE13), TYPimatch (WSDM13) • On-the-fly deduplication using SRM (WWW11) • Ranking withdeduplication(ISWC13) • Routing keyword queries to relevant data graphs (TKDE13a) • Hermes: keyword search over heterogeneous data graphs (SIGMOD09) • Semantic auto-completion • Computing valid query rewrites for given keywords (VLDB14)

Query Translation

0) Query Translation: constructing pseudo schema graph representing all possible connections between data elements • Structure index for data graph: nodes are groups of data elements that are share same structure pattern • Parameters: structure pattern with edge labels L and paths of maximum length n • Pseudo schema • Node groups all instances that have same set of properties • structure pattern: all properties, i.e. all outgoing paths with n = 1, L = all edge labels • Algorithm: • Start with one single partition/node representing all instances • Spit until all nodes are “stable”, i.e., all contained instances share same structure pattern Single Person Artist writer maritalstatus member producer Queen Queen Elizabeth 1 Person Value2 Single Thing12 Artist member member marital status producer Freddie Mercury Brian May Liar single writer

1) Query Translation: constructing search space representing all possible interpretations of query keywords “written byfreddiequeensingle” Keyword Interpretation: use inverted index and LM-based ranking function to return relevant schema and data elements Data Index Queen Elizabeth 1 Freddie Mercury single Queen Schema Index Single writer • Search Space Construction: augment pseudo schema with query-specific keyword matching elements • All possible connections of predicates applicable to recognized query keywords writer maritalstatus member producer Person Literal Band Single Artist Queen Elizabeth 1 Freddie Mercury Queen single Top-k Subgraph Exploration Result Retrieval & Ranking

2) Query Translation: score-directed algorithm for finding top-k subgraphs connecting keyword matching elements “written byfreddiequeensingle” writer maritalstatus member <x, type, Single> <Queen, producer, x> <Freddie Mercury, writer, x> <Queen, type, Band> <Freddy Mercury, type, Artist> producer Person Literal Band Single Artist Queen Elizabeth 1 Freddie Mercury Queen single • Algorithm: score-directed top-k Steiner graph search • Start: explore all distinct paths starting from keyword elements • Every iteration • One step expansion of current path withhighest score • When connecting elementfound, merge paths and add resulting graph to list • Top-k termination: lowest score of the candidate list > highest possible score that can achieved with paths in the queues yet to be explored • Termination: all paths of maximum length d have been explored • Final step: mapping rules to translate Steiner graph to structured query

Result Ranking

Ranking Using Structured LMs: Keyword query is short and ambiguous, while structured data provide rich structure information: ranking based on LMs capturing both content and structure • Structured LMs for structured results r • Structured LM for queries using structured pseudo-relevant feedback results FR (relevance model) • Compute distance between query and result LMs

Relevance Models Mercury Brian May Protest Raid Clash Bank West freddiequeen Query F Documents • Term probabilities of query model is based on documents • Ranking behaves like similarity search between pseudo-relevant feedback documents and corpus documents Mercury Brian May Protest Raid Clash Bank West Candidate Documents

Structured Relevance Models Structured Data Mercury Brian May Protest Raid Clash Bank West queensingle Query F Results • Term probabilities of query model is based on pseudo-relevant structured data • Ranking behaves like similarity search between pseudo-relevant structured results and structured result candidates Structured Data Mercury Brian May Protest Raid Clash Bank West Candidate Results

Ranking: construct edge-specific query model for each unique e from feedback resources FR, edge-specific model for every candidate r, and finally, compute distance For all resources r in FR Prob of observing term v in value of property e of resource r Importance of resource r w.r.t. query

Query Rewriting

Query Rewriting: find syntactically and semantically valid rewrites to suggest as user types Token rewriting via syntactic distance single fromfreddy mercury que • Keyword Interpretation: • Imprecise / fuzzy matching • Match every keyword Data Index Queen Elizabeth 1 Freddie Mercury 1) single fromfreddie mercury queen … single Queen Schema Index Token rewriting via semantic distance Single writer single writerfreddie mercury queen … • Benefits: • Higher selectivity of query terms (quality) • Reduced number of query terms (efficiency) • Better search experience… Query segmentation single writer “freddie mercury” queen … Data Index • Keyword / Key Phrase Interpretation: • Precise matching • Match keyword and key phrases Freddie Mercury Queen Schema Index Single writer Search Space Construction Challenges: many rewrite candidates, some are semantically not “valid” in the relational setting single (marital status) writer “freddie mercury” queen (the queen of UK) Search Space Construction Result Retrieval & Ranking

Probabilistic Model for Query Rewriting: the rank of a query rewrite (suggestion) S is based on the probability of observing S in the data, given the query Based on Bayes‘ Theorem Probabilityuserswritespellingerrors/ semanticallyrelatedqueryindependentof data D single writer freddy mercury que 1) single writer freddie mercury queen 2) single writer freddrickmercury monarch 3) song writer freddrickmercury head of state Single Person Artist Constant givenquery Q anddata D Queen Queen Elizabeth 1 member member Query Segmentation: S is ranked high when prob that S can be observed in the data D is high Token Rewriting: S is ranked high when prob that query Q can be observed in S is high formed in producer marital status Freddie Mercury Brian May Liar single 1971 writer

Token Rewriting Split: | Concatenate: + P(q|t): is high when q is syntactically and semantically close to t single writer freddy mercury que single |writer | freddie+ mercury | queen 1) single writer “freddie mercury” queen 2) single writer “freddrick mercury” monarch 3) single writer “freddrick mercury” head of state • Modeling tokenrewritingP(Q|S) • Independence assumption • Modeling syntacticandsemanticdifferences

Query Segmentation single writer freddie mercury que α = concatenate? α = split? where PD(αiti+1|t1α1t2…αi-1ti) standsforP(αiti+1|t1α1t2…αi-1ti,D). single writer freddie Single Person Artist Queen Queen Elizabeth 1 member member formed in producer marital status Freddie Mercury Brian May Liar single 1971 writer Modeling querysegmentationP(S|D) NthorderMarkovassumption

Estimating Probability of Segmentation where C(ti…tj) denotes the count of occurrences of the token sequence ti…tj Segmentation in structured data setting • Concatenate two segments si and sj when they co-occur in the data • Split whensi and sjareconnected(si↭ sj), i.e.,when the two data elements ni and ni mentioning si and sj are connected in the data Single Person Artist Queen single writer freddie mercury queen Queen Elizabeth 1 α = concatenate? α = split? member member producer formed in marital status Freddie Mercury Brian May Liar single 1971 writer single writer freddie Maximum likelihoodestimation (MLE)

Estimating Probability of Segmentation Case 1: previous segment si has length equal or more than context N freddie j. mercury queen freddie j. mercury queen where C(st) denotes the count of co-occurrences of the sequence stin D and C(s ↭ t) is the count of all occurrences of token t connected to segment s Two cases: (1) l(si) ≥N; (2) l(si) < N (1) When the previously induced segment si has length equal or more than N, i.e. l(si) ≥N, it suffices to focus on si (N) to predict the next action αi on ti+1 Estimation of probability

Estimating Probability of Segmentation Case 2: previous segment si has length less than context N single writer freddie mercury single writer freddie mercury where C(P ↭ s) denotes the count of all occurrences of the segment s connected to all segments in P (2) When the previous segment si has length less than N, i.e. l(si) < N, the action αi on the next token ti+1 depends on si andPi(N), the set of segments that precede si that together with si, contains at most N tokens in total, i.e., Estimation of probability

Experimental results & Conclusions

Graphinder, a relational keyword search approach for suggesting query completions, translating queries and ranking results • Keyword translation performance • Query translation and index-based approaches at least one-order of magnitude faster than online in-memory search (bidirectional) • Query translation comparable with index-based approaches, but less space • Keyword translation result quality • According to recent benchmark, our ranking consistently outperforms all existing ranking systems in precision, recall and MAP (10% - 30% improvement) • Effect of query rewriting • Better user experience • Improves efficiency by reducing number of query terms • Improves quality / selectivity of query terms • …depends on complexity of queries and underlying keyword search engine • Tight integration of query suggestion and translation • From research prototypes to Graphinder, a powerful, flexible, low upfront-cost semantic search system

Tran Duc Thanh tran.du.th@gmail.com http://sites.google.com/site/kimducthanh/ Thanks!

[VLDB14] Yongtao Ma, Thanh TranProbabilistic Query Rewriting for Efficient and and Effective Keyword Search on Graph DataIn International Conference on Very Large Data Bases (VLDB'14). Hangzhou, China, September, 2014 [ISWC13] Daniel Herzig, Roi Blanco, Peter Mika and Thanh Tran Federated Entity Search Using On-the-Fly ConsolidationIn International Semantic Web Conference(ISWC'13). Sydney, Australia, October, 2013 [ICDE13] Yongtao Ma, Thanh TranTYPifier: Inferring the Type Semantics of Structured DataIn International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April, 2013 [WSDM13] Yongtao Ma, Thanh TranTYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for Heterogeneous Web Data IntegrationIn International Conference on Web Search and Data Mining (WSDM'13). Rome, Italy, February, 2013 [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian RudolphManaging Structured and Semi-structured RDF Data Using Structure IndexesIn Transactions on Knowledge and Data Engineering journal. [TKDE12b] Thanh Tran, Lei ZhangKeyword Query RoutingIn Transactions on Knowledge and Data Engineering journal. References (1)

[WWW12] Daniel Herzig, Thanh TranHeterogeneous Web Data Search Using Relevance-based On The Fly Data IntegrationIn Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon, France, April, 2012 [CIKM11a] Günter Ladwig, Thanh TranIndex Structures and Top-k Join Algorithms for Native Keyword Search DatabasesIn Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011 [CIKM11b] Veli Bicer, Thanh TranRanking Support for Keyword Search on Structured Data using Relevance ModelsIn Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011 [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh Tran DucRepeatable and Reliable Search System Evaluation using CrowdsourcingIn Proceedings of 34th Annual International ACM SIGIR Conference (SIGIR'11), Beijing, China, July, 2011 [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp CimianoTop-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009 [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun, Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi StuderHermes: A Travel through Semantics in the Data Web In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009 References (2)

Backup

Graphinder Semantic Search Relational Keyword Search over Data Graphs