Keyword Search on Graph-Structured Data

Keyword Search on Graph-Structured Data S. SudarshanIIT Bombay Joint work withSoumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe,Rushi Desai, Hrishi K., Arvind Hulgeri, Bhavana Dalvi and Meghana Kshirsagar Jan 2009

Outline • Motivation and Graph Data Model • Query/Answer models • Tree answer model • Proximity queries • Graph Search Algorithms • Backward Expanding Search • Bidirectional Search • Search on external memory graphs • Conclusion

Keyword Search on Semi-Structured Data • Keyword search of documents on the Web has been enormously successful • Much data is resident in databases • Organizational, government, scientific, medical data • Deep web • Goal: querying of data from multiple data sources, with different data models • Often with no schema or partially defined schema

BANKS: Keyword search… Focused Crawling … paper writes Sudarshan Soumen C. Byron Dom author Keyword Search on Structured/Semi-Structured Data • Key differences from IR/Web Search: • Normalization (implicit/explicit) splits related data across multiple tuples • To answer a keyword query we need to find a (closely) connected set of entities that together match all given keywords • soumen crawling or soumen byron

Graph Data Model • Lowest common denominator across many data models • Relational • node = tuple, edge = foreign key • XML • Node = element, edge = containment/idref/keyref • HTML • node = page, edge = hyperlink • Documents • node = document, edge = links inferred by data extraction • Knowledge representation • node = entity, edge = relationship • Network data • e.g. social networks, communication networks

Graph Data Model (Cont) • Nodes can have • labels • E.g. relation name, or XML tag • textual or structured (attribute-value) data • Edges can have labels

Query/Answer Models • Basic query model: • Keywords match node text/labels • Can extend query model with attribute specification, path specifications • e.g. paper(year < 2005, title:xquery), • Alternative answer models • tree connecting nodes matching query keywords • nodes in proximity to (near) query keywords

Tree Answer Model • Answer: Rooted, directed tree connecting keyword nodes • In general, a Steiner tree • Multiple answers possible • Answer relevance computed from • answer edge scorecombined with • answer node score paper Focused Crawling writes writes author author Byron Dom Soumen C. Eg. “Soumen Byron”

Answer Ranking • Naïve model: answers ranked by number of edges • Problem: • Some tuples are connected to many other tuples • E.g. highly cited papers, popular web sites • Highly connected tuples create misleading shortcuts • six degrees of separation • Solution: use directed edges with edge weights • allow answer tree to have edge uv if original graph has vu, but at higher cost

3 3 3 Edge Weight Model-1 • Forward edge weight (edge present in data) • Default to 1, can be based on schema • Lower weight  closer connection • Create extra backward edges vu for each edge uv present in data • Edge weight  log(1+#edges pointing to v) • Overall Answer-tree Edge Score EA = 1/ (S edge weights) • Higher score  better result 1 1 1

Edge Weight Model -2 • Probabilistic edge scoring model • Edge traversal probability (from a given node): • Forward  1/out-degree • Backward  1/in-degree • Can be weighted by edge type • Path weight =  probability of following each edge in path • Edge score = log(edge traversal probability) • Answer-tree Edge Score EA = (harmonic) mean of path weights from root to each leaf • Note: • other edge weight models possible • our search algorithms are independent of how edge weights are computed

Node Weight • Node prestige based on indegree • More incoming edges  higher prestige • PageRank style transfer of prestige • Node weight computing using biased random walk model • Node weight: function of node prestige, other optional criteria such as TF/IDF • Answer-tree Node score NA = root node weight + S leaf node weights

Overall Tree Answer Score • Overall score of answer tree A • combine tree and node scores • for details, and recall/precision metrics see BANKS papers in ICDE 2002 and VLDB 2005 • Anecdotal results on DBLP Bibliography • “Transaction”: Jim Gray’s classic paper and textbook at the top because of prestige (# of citations) • “soumen sudarshan”: several coauthored papers, links via common co-authors • “goldman shivakumar hector”: The VLDB 98 proximity search paper, followed by citation/co-author connections

Answer Models • Tree Answer Model • Proximity (near query) model

Proximity Queries • Node weight by proximity • author (near olap) (on DBLP) • faculty (near earthquake) (on IITB thesis database) • Node prestige > if close to multiple nodes matching near keywords • Example applications • Finding experts on a particular area OLAP over uncertain .. Widom Raghu Computing sparse cubes… Overview of OLAP… Allocation in OLAP …

Proximity via Spreading Activation • Idea: • Each “near” keyword has activation of 1 • Divided among nodes matching keyword, proportional to their node prestige • Each node • keeps fraction 1-μ of its received activation and • spreads fraction μ amongst its neighbors • Combine activation ai received from neighbors • a = 1 – Π(1-ai) (belief function) • Graph may have cycles • Iterate till convergence

Example Answers • Anecdotal results on DBLP Bibliography • author (near recovery): Dave Lomet, C. Mohan, etc • sudarshan(near change): Sudarshan Chawate • sudarshan(near query): S. Sudarshan • Queries can combine proximity scores with tree scores • hector sudarshan(near query) vs.hector sudarshan • author(near transactions) data integration

Related Work • Proximity Search • Goldman, Shivakumar, Venkatasubramanian and Garcia-Molina [VLDB98] • Considers only shortest path from each node, aggregates across nodes • Our version aggregates evidence from alternative paths • E.g. author (near “Surajit Chaudhuri”) • Object Rank [VLDB04] • Similar idea to ours, precomputed

Related Work • Keyword querying on relational databases • DBExplorer (Microsoft, ICDE02) Discover (UCSD, VLDB02, VLDB03), • Use SQL generation, not applicable to arbitrary graphs • ranking based only on #nodes/edges • Keyword querying on XML : Tree Model • XRank (Cornell, SIGMOD03), proximity in XML (AT&T Research, VLDB03), Schema-Free XQuery (Michigan, VLDB04), • Tree model is too limited • Keyword querying on XML: Graph Model • XKeyword (UCSD, ICDE03, VLDB03), SphereSearch (MaxPlanck, VLDB05) • ranking based only on #nodes/edges

Finding Answer Trees • Backward Expanding Search Algorithm (Bhalotia et al, ICDE02): • Intuition: find vertices from which a forward path exists to at least one node from each Si. • Run concurrentsingle source shortest path algorithm from each node matching a keyword • Create an iterator for each node matching a keyword • Traverse the graph edges in reverse direction • Output next nearest node on each get-next() call • Do best-first search across iterators • Output an answer when its root has been reached from each keyword • Answer heap to collect and output results in score order

Backward Expanding Search Query: soumen byron paper Focused Crawling writes Soumen C. Byron Dom authors

Backward Exp. Search Limitations • Too many iterators (one per keyword node) • Solution: single iterator per keyword (SI-Bkwd search) • tracks shortest path from node to keyword • Changes answer set slightly • Different justifications for same root may be lost • Not a big problem in practice • Nodes explored for different keywords can vary greatly • E.g. “mining” or “query” vs “knuth” • High fan-out when traversing backwards from some nodes • Connection with join ordering • Similar to traversing backwards from all relations that have selections

Bidirectional Search: Motivation

Bidirectional Search: Intuition • First cut solution: • Don’t expand backward if keyword matches many nodes • Instead explore forward from other keywords • Problems • Doesn’t deal with high fan-out during search • What should cutoff for not expanding be? • Better solution: [Kacholia et al, VLDB 2005] • Perform forward search from all nodes reached • Prioritize expansion of nodes based on • path weight (as in backward expanding search) + • spreading activation • to penalize frequent keyword and bushy trees

Bidirectional Search: Example OLAP Divesh Harper Query: harper divesh olap

Bidirectional Search (1) • Spreading activation to prioritize backward search • (Different from spreading activation for near queries) • Lower weight edges get higher share of activation • Nodes prioritized by sum of activations • Single forward iterator

Bidirectional Search (2) • Forward search iterator • Forward search from all nodes reached by backward search • Track best forward path to each keyword • Initially infinite cost • Whenever this changes, propagate cost change to all affected ancestors 2,∞ 2,2 ∞,∞ ∞,2 k1 k2

Bidirectional Search (3) • On each path length update (due to backward or forward search) • Check if node can reach all keywords • If so, add it to output heap • When to output nodes from heap • For each keyword Ki, track Mi • Mi: minimum path length to Ki among all yet-to-be-explored nodes in backward search tree • Edge score bounds: • What is the best possible edge score of a future answer? • Bounds similar to NRA algorithm (Fagin) • Cheaper bounds (e.g. 1/Max(Mi)) or heuristics (e.g. 1/Sum(Mi)) can be used • Output answer if its score is > overall score upper bound for future answers

Performance • Worst case complexity: polynomial in size of graph • But for typical (average) case, even linear is too expensive • Intuition: typical query should access only small part of graph • Studied experimentally • Datasets: DBLP, IMDB, US Patent • Queries: manually created • Typical cases • < 1 second to generate answer • 10K-100K nodes explored

Performance Results • Two versions of backward search: • Iterator per node (MI-Bkwd) vs Iterator per keyword (SI-Bkwd) • Origin size: number of nodes matching keywords Time ratio MI/SI • Very minor loss in recall

Performance Results • SI-Bkwd versus Bidirectional search • Bidirectional search gain increases with origin size, # keywords

Related Work (1) • Publish as document approach • Gather related data into a (virtual) document and index the document (Su/Widom, IDEAS05) • Positives • Avoids run-time graph search • Works well for a class of applications • E.g. Bibliographic data  DBLP page per author • Negatives • Not all connections can be captured • Duplication of data across multiple documents • High index space overhead

Related Work (2) • DPBF (Ding et al., ICDE07) • dynamic programming technique • exact for top-1 answer, heuristic for top-k • BLINKS (He et al.,SIGMOD07) • Round-robin expansion across iterators • Optimal within a factor of m, with m keywords • Forward index: node to keyword distance • Used instead of searching forward • single level index: impractically large space • bi-level index: > main memory IO efficiency?

External Memory Graph Search • Graph representation quite efficient • Requires of < 20 bytes per node/edge • Problem: what if graph size > memory? • Alternative 1: Virtual Memory • thrashing • Alternative 2 (for relational data): SQL • not good for top-K answer generation across multiple SQL queries • Alternative 3: use compressed graph representation to reduce IO • [Dalvi et al, VLDB 2008]

Supernodes and Superedges

Multi-granular Graph • Dumb algorithm • search on supernode graph • get k*F answers, expand their supernodes into memory, search on resultant graph • no guarantees on answers • Better idea: use multi-granular graph • Supernode graph in memory • Some nodes expanded • Expanded nodes are part of cache • Algorithms on multi-granular graph (coming up)

Multi-granular Graph

Expanding Nodes • Key idea: Edge score of answer containing a supernode is lower bound on actual edge score of any corresponding real answer

Iterative Search • Iterative search on multi-granular graph • Repeat • search on current multi-granular graph using any search algorithm, to find top results • expand super nodes in top results • Until top K answers are all pure • Guarantees finding top-K answers • Very good IO efficiency • But high CPU cost due to repeated work • Details: nodes expanded above never evicted from “virtual memory” cache

Incremental Search • Idea: when node expanded, incrementally update state of search algorithm to reflect change in multi-granular graph • Run search algorithm until top K answers are all pure • Currently implemented for backward search • Modifies the state of the Dijkstra shortest path algorithm used by backward search • One shortest path search iterator per keyword • SPI tree: shortest path iterator tree

Incremental Search (1) SPI tree for k1

Incremental Search (2)

Incremental Search (3)

Queries External Memory Search: Performance

External Memory Search: Performance • Supernode graph very effective at minimizing IO • Cache misses with incremental often < # nodes matching keywords • Iterative has high CPU cost • VM (backward search with cache as virtual memory) has high IO cost • Incremental combines low IO cost with low CPU cost

Conclusions • Keyword search on graphs continues to grow in importance • E.g. graph representation of extracted knowledge in YAGO/NAGA (Max Planck) • Ranking is critical • Edge and node weights, spreading activation • Efficient graph search is important • In-memory and external-memory

Ongoing/Future Work • External memory graph search • Compression ratios for supernode graph for DBLP/IMDB: factor of 5 to 10 • Ongoing work on graph clustering shows good results • Graph search in a parallel cluster • Goal: search integrated WWW/Wikipedia graph • New search algorithms • Integration with existing applications • To provide more natural display of results, hiding schema details • Authorization

Keyword Search on Graph-Structured Data