1 / 58

Keyword Search on Graph-Structured Data

Keyword Search on Graph-Structured Data. S. Sudarshan IIT Bombay. Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi K., Arvind Hulgeri, Bhavana Dalvi and Meghana Kshirsagar. Jan 2009. Outline. Motivation and Graph Data Model Query/Answer models

djosephs
Download Presentation

Keyword Search on Graph-Structured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Keyword Search on Graph-Structured Data S. SudarshanIIT Bombay Joint work withSoumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe,Rushi Desai, Hrishi K., Arvind Hulgeri, Bhavana Dalvi and Meghana Kshirsagar Jan 2009

  2. Outline • Motivation and Graph Data Model • Query/Answer models • Tree answer model • Proximity queries • Graph Search Algorithms • Backward Expanding Search • Bidirectional Search • Search on external memory graphs • Conclusion

  3. Keyword Search on Semi-Structured Data • Keyword search of documents on the Web has been enormously successful • Much data is resident in databases • Organizational, government, scientific, medical data • Deep web • Goal: querying of data from multiple data sources, with different data models • Often with no schema or partially defined schema

  4. BANKS: Keyword search… Focused Crawling … paper writes Sudarshan Soumen C. Byron Dom author Keyword Search on Structured/Semi-Structured Data • Key differences from IR/Web Search: • Normalization (implicit/explicit) splits related data across multiple tuples • To answer a keyword query we need to find a (closely) connected set of entities that together match all given keywords • soumen crawling or soumen byron

  5. Graph Data Model • Lowest common denominator across many data models • Relational • node = tuple, edge = foreign key • XML • Node = element, edge = containment/idref/keyref • HTML • node = page, edge = hyperlink • Documents • node = document, edge = links inferred by data extraction • Knowledge representation • node = entity, edge = relationship • Network data • e.g. social networks, communication networks

  6. Graph Data Model (Cont) • Nodes can have • labels • E.g. relation name, or XML tag • textual or structured (attribute-value) data • Edges can have labels

  7. Outline • Motivation and Graph Data Model • Query/Answer models • Tree answer model • Proximity queries • Graph Search Algorithms • Backward Expanding Search • Bidirectional Search • Search on external memory graphs • Conclusion

  8. Query/Answer Models • Basic query model: • Keywords match node text/labels • Can extend query model with attribute specification, path specifications • e.g. paper(year < 2005, title:xquery), • Alternative answer models • tree connecting nodes matching query keywords • nodes in proximity to (near) query keywords

  9. Tree Answer Model • Answer: Rooted, directed tree connecting keyword nodes • In general, a Steiner tree • Multiple answers possible • Answer relevance computed from • answer edge scorecombined with • answer node score paper Focused Crawling writes writes author author Byron Dom Soumen C. Eg. “Soumen Byron”

  10. Answer Ranking • Naïve model: answers ranked by number of edges • Problem: • Some tuples are connected to many other tuples • E.g. highly cited papers, popular web sites • Highly connected tuples create misleading shortcuts • six degrees of separation • Solution: use directed edges with edge weights • allow answer tree to have edge uv if original graph has vu, but at higher cost

  11. 3 3 3 Edge Weight Model-1 • Forward edge weight (edge present in data) • Default to 1, can be based on schema • Lower weight  closer connection • Create extra backward edges vu for each edge uv present in data • Edge weight  log(1+#edges pointing to v) • Overall Answer-tree Edge Score EA = 1/ (S edge weights) • Higher score  better result 1 1 1

  12. Edge Weight Model -2 • Probabilistic edge scoring model • Edge traversal probability (from a given node): • Forward  1/out-degree • Backward  1/in-degree • Can be weighted by edge type • Path weight =  probability of following each edge in path • Edge score = log(edge traversal probability) • Answer-tree Edge Score EA = (harmonic) mean of path weights from root to each leaf • Note: • other edge weight models possible • our search algorithms are independent of how edge weights are computed

  13. Node Weight • Node prestige based on indegree • More incoming edges  higher prestige • PageRank style transfer of prestige • Node weight computing using biased random walk model • Node weight: function of node prestige, other optional criteria such as TF/IDF • Answer-tree Node score NA = root node weight + S leaf node weights

  14. Overall Tree Answer Score • Overall score of answer tree A • combine tree and node scores • for details, and recall/precision metrics see BANKS papers in ICDE 2002 and VLDB 2005 • Anecdotal results on DBLP Bibliography • “Transaction”: Jim Gray’s classic paper and textbook at the top because of prestige (# of citations) • “soumen sudarshan”: several coauthored papers, links via common co-authors • “goldman shivakumar hector”: The VLDB 98 proximity search paper, followed by citation/co-author connections

  15. Answer Models • Tree Answer Model • Proximity (near query) model

  16. Proximity Queries • Node weight by proximity • author (near olap) (on DBLP) • faculty (near earthquake) (on IITB thesis database) • Node prestige > if close to multiple nodes matching near keywords • Example applications • Finding experts on a particular area OLAP over uncertain .. Widom Raghu Computing sparse cubes… Overview of OLAP… Allocation in OLAP …

  17. Proximity via Spreading Activation • Idea: • Each “near” keyword has activation of 1 • Divided among nodes matching keyword, proportional to their node prestige • Each node • keeps fraction 1-μ of its received activation and • spreads fraction μ amongst its neighbors • Combine activation ai received from neighbors • a = 1 – Π(1-ai) (belief function) • Graph may have cycles • Iterate till convergence

  18. Example Answers • Anecdotal results on DBLP Bibliography • author (near recovery): Dave Lomet, C. Mohan, etc • sudarshan(near change): Sudarshan Chawate • sudarshan(near query): S. Sudarshan • Queries can combine proximity scores with tree scores • hector sudarshan(near query) vs.hector sudarshan • author(near transactions) data integration

  19. Related Work • Proximity Search • Goldman, Shivakumar, Venkatasubramanian and Garcia-Molina [VLDB98] • Considers only shortest path from each node, aggregates across nodes • Our version aggregates evidence from alternative paths • E.g. author (near “Surajit Chaudhuri”) • Object Rank [VLDB04] • Similar idea to ours, precomputed

  20. Related Work • Keyword querying on relational databases • DBExplorer (Microsoft, ICDE02) Discover (UCSD, VLDB02, VLDB03), • Use SQL generation, not applicable to arbitrary graphs • ranking based only on #nodes/edges • Keyword querying on XML : Tree Model • XRank (Cornell, SIGMOD03), proximity in XML (AT&T Research, VLDB03), Schema-Free XQuery (Michigan, VLDB04), • Tree model is too limited • Keyword querying on XML: Graph Model • XKeyword (UCSD, ICDE03, VLDB03), SphereSearch (MaxPlanck, VLDB05) • ranking based only on #nodes/edges

  21. Outline • Motivation and Graph Data Model • Query/Answer models • Tree answer model • Proximity queries • Graph Search Algorithms • Backward Expanding Search • Bidirectional Search • Search on external memory graphs • Conclusion

  22. Finding Answer Trees • Backward Expanding Search Algorithm (Bhalotia et al, ICDE02): • Intuition: find vertices from which a forward path exists to at least one node from each Si. • Run concurrentsingle source shortest path algorithm from each node matching a keyword • Create an iterator for each node matching a keyword • Traverse the graph edges in reverse direction • Output next nearest node on each get-next() call • Do best-first search across iterators • Output an answer when its root has been reached from each keyword • Answer heap to collect and output results in score order

  23. Backward Expanding Search Query: soumen byron paper Focused Crawling writes Soumen C. Byron Dom authors

  24. Backward Exp. Search Limitations • Too many iterators (one per keyword node) • Solution: single iterator per keyword (SI-Bkwd search) • tracks shortest path from node to keyword • Changes answer set slightly • Different justifications for same root may be lost • Not a big problem in practice • Nodes explored for different keywords can vary greatly • E.g. “mining” or “query” vs “knuth” • High fan-out when traversing backwards from some nodes • Connection with join ordering • Similar to traversing backwards from all relations that have selections

  25. Bidirectional Search: Motivation

  26. Bidirectional Search: Intuition • First cut solution: • Don’t expand backward if keyword matches many nodes • Instead explore forward from other keywords • Problems • Doesn’t deal with high fan-out during search • What should cutoff for not expanding be? • Better solution: [Kacholia et al, VLDB 2005] • Perform forward search from all nodes reached • Prioritize expansion of nodes based on • path weight (as in backward expanding search) + • spreading activation • to penalize frequent keyword and bushy trees

  27. Bidirectional Search: Example OLAP Divesh Harper Query: harper divesh olap

  28. Bidirectional Search (1) • Spreading activation to prioritize backward search • (Different from spreading activation for near queries) • Lower weight edges get higher share of activation • Nodes prioritized by sum of activations • Single forward iterator

  29. Bidirectional Search (2) • Forward search iterator • Forward search from all nodes reached by backward search • Track best forward path to each keyword • Initially infinite cost • Whenever this changes, propagate cost change to all affected ancestors 2,∞ 2,2 ∞,∞ ∞,2 k1 k2

  30. Bidirectional Search (3) • On each path length update (due to backward or forward search) • Check if node can reach all keywords • If so, add it to output heap • When to output nodes from heap • For each keyword Ki, track Mi • Mi: minimum path length to Ki among all yet-to-be-explored nodes in backward search tree • Edge score bounds: • What is the best possible edge score of a future answer? • Bounds similar to NRA algorithm (Fagin) • Cheaper bounds (e.g. 1/Max(Mi)) or heuristics (e.g. 1/Sum(Mi)) can be used • Output answer if its score is > overall score upper bound for future answers

  31. Performance • Worst case complexity: polynomial in size of graph • But for typical (average) case, even linear is too expensive • Intuition: typical query should access only small part of graph • Studied experimentally • Datasets: DBLP, IMDB, US Patent • Queries: manually created • Typical cases • < 1 second to generate answer • 10K-100K nodes explored

  32. Performance Results • Two versions of backward search: • Iterator per node (MI-Bkwd) vs Iterator per keyword (SI-Bkwd) • Origin size: number of nodes matching keywords Time ratio MI/SI • Very minor loss in recall

  33. Performance Results • SI-Bkwd versus Bidirectional search • Bidirectional search gain increases with origin size, # keywords

  34. Related Work (1) • Publish as document approach • Gather related data into a (virtual) document and index the document (Su/Widom, IDEAS05) • Positives • Avoids run-time graph search • Works well for a class of applications • E.g. Bibliographic data  DBLP page per author • Negatives • Not all connections can be captured • Duplication of data across multiple documents • High index space overhead

  35. Related Work (2) • DPBF (Ding et al., ICDE07) • dynamic programming technique • exact for top-1 answer, heuristic for top-k • BLINKS (He et al.,SIGMOD07) • Round-robin expansion across iterators • Optimal within a factor of m, with m keywords • Forward index: node to keyword distance • Used instead of searching forward • single level index: impractically large space • bi-level index: > main memory IO efficiency?

  36. Outline • Motivation and Graph Data Model • Query/Answer models • Tree answer model • Proximity queries • Graph Search Algorithms • Backward Expanding Search • Bidirectional Search • Search on external memory graphs • Conclusion

  37. External Memory Graph Search • Graph representation quite efficient • Requires of < 20 bytes per node/edge • Problem: what if graph size > memory? • Alternative 1: Virtual Memory • thrashing • Alternative 2 (for relational data): SQL • not good for top-K answer generation across multiple SQL queries • Alternative 3: use compressed graph representation to reduce IO • [Dalvi et al, VLDB 2008]

  38. Supernodes and Superedges

  39. Multi-granular Graph • Dumb algorithm • search on supernode graph • get k*F answers, expand their supernodes into memory, search on resultant graph • no guarantees on answers • Better idea: use multi-granular graph • Supernode graph in memory • Some nodes expanded • Expanded nodes are part of cache • Algorithms on multi-granular graph (coming up)

  40. Multi-granular Graph

  41. Expanding Nodes • Key idea: Edge score of answer containing a supernode is lower bound on actual edge score of any corresponding real answer

  42. Iterative Search • Iterative search on multi-granular graph • Repeat • search on current multi-granular graph using any search algorithm, to find top results • expand super nodes in top results • Until top K answers are all pure • Guarantees finding top-K answers • Very good IO efficiency • But high CPU cost due to repeated work • Details: nodes expanded above never evicted from “virtual memory” cache

  43. Incremental Search • Idea: when node expanded, incrementally update state of search algorithm to reflect change in multi-granular graph • Run search algorithm until top K answers are all pure • Currently implemented for backward search • Modifies the state of the Dijkstra shortest path algorithm used by backward search • One shortest path search iterator per keyword • SPI tree: shortest path iterator tree

  44. Incremental Search (1) SPI tree for k1

  45. Incremental Search (2)

  46. Incremental Search (3)

  47. Queries External Memory Search: Performance

  48. External Memory Search: Performance • Supernode graph very effective at minimizing IO • Cache misses with incremental often < # nodes matching keywords • Iterative has high CPU cost • VM (backward search with cache as virtual memory) has high IO cost • Incremental combines low IO cost with low CPU cost

  49. Conclusions • Keyword search on graphs continues to grow in importance • E.g. graph representation of extracted knowledge in YAGO/NAGA (Max Planck) • Ranking is critical • Edge and node weights, spreading activation • Efficient graph search is important • In-memory and external-memory

  50. Ongoing/Future Work • External memory graph search • Compression ratios for supernode graph for DBLP/IMDB: factor of 5 to 10 • Ongoing work on graph clustering shows good results • Graph search in a parallel cluster • Goal: search integrated WWW/Wikipedia graph • New search algorithms • Integration with existing applications • To provide more natural display of results, hiding schema details • Authorization

More Related