1 / 58

XRANK

XRANK. XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ . This Paper. Describes the architecture, implementation and evaluation of the XRANK system The contributions of the paper are: (a) the problem definition and system architecture

meli
Download Presentation

XRANK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ Gökay Burak AKKUŞEce AKSU

  2. This Paper... • Describes the architecture, implementation and evaluation of theXRANK system • The contributions of the paper are: • (a) the problem definition and system architecture • (b) an algorithm for computing the rankingof XML elements • (c) new inverted list indexstructures and associated query processing algorithms • (d) anexperimental evaluation of XRANK Gökay Burak AKKUŞEce AKSU

  3. Overview • Problem: Efficiently producing ranked results for keyword search queries over hierarchical XML documents. • New challanges • Returns deeply nested XML elements. • Ranking is at the granularity of an XML element (not the document) • Keyword proximity is more complex. Gökay Burak AKKUŞEce AKSU

  4. Overview - 2 • This paper pesents XRANK system to handle these features of XML keyword search. • XRANK offers both space & performance benefits • XRANK generalizes a hyperlink based HTML search engine such as Google. • XRANK can be used to query both HTML and XML documents. Gökay Burak AKKUŞEce AKSU

  5. Keyword Search Querying - 1 • Keyword search querying Adv: simple • users do not have to learn a complex query language • can issue queries without any prior knowledge about the structure of the underlying data. Consequence: Interface is fexible • Queries may not always be precise and can return large number of query results. Gökay Burak AKKUŞEce AKSU

  6. Keyword Search Querying - 2 • An important requirement for keyword search is torank the query results so that the most relevant results appear first. • Certain limitations of the HTML data model make such systemsineffective in many domains. • HTML is a presentation language • HTML cannot capture much semantics Gökay Burak AKKUŞEce AKSU

  7. Keyword Search Querying - 3 • The XML data model addresses this limitation byallowing for extensible element tags. (Example: Figure.1) Gökay Burak AKKUŞEce AKSU

  8. Gökay Burak AKKUŞEce AKSU

  9. Querying XML Documents • One approach is the sophisticated query language XQUERY • Effective in some cases • Users have to learn a complex query language and understand the schema of underlying XML • An alternative approach is XRANK • Retain the simple keyword search query interface • Exploit XML’s tagged and nested structure during query processing. Gökay Burak AKKUŞEce AKSU

  10. New Challanges • Keyword searching over XML introduces many new challenges. 1. The result of the keyword search querycan be a deeply nested XML element. • return the ‘deepest’ node 2. Ranking is not solely based on hyperlinks. • semantics of containment links (relating parent and child elements) is very different from that of hyperlinks (such as IDREFs and XLinks) Gökay Burak AKKUŞEce AKSU

  11. New Challanges 3. The notion of proximity among keywords is more complex • In HTML, proximity among keywords translates directly to the distance between keywords in a document. • For XML there is a 2-dimensional proximity metric. • Keyword distance • Ancestor distance Gökay Burak AKKUŞEce AKSU

  12. XML Data Model • XML is a hierarchical formatfor data representation and exchange. • An XML document consists of: • Root element, nested sub-elements, attributes and values, • supports intra-document and inter-document references. Gökay Burak AKKUŞEce AKSU

  13. XML Data Model-2 • Intra-document referencees are represented using IDREFs. • Inter-document references are represented using XLink. • Both IDREFs and XLinks are reffered as hyperlinks! Gökay Burak AKKUŞEce AKSU

  14. Definitions • A collection ofhyperlinked XML documents can be defined as a directed graph: G = (N, CE, HE) N : The set of nodes N = NE UNV NE :The set ofelements NV : The set of values CE :The set of containmentedges relating nodes HE :The set of hyperlink edgesrelating nodes Gökay Burak AKKUŞEce AKSU

  15. Definitions - 2 • The edge (u, v) CE iff v is avalue/nested sub-element of u. • The edge (u, v) HE iff u contains a hyperlinkreference to v. • An element u is a sub-element of an element v if(v,u) CE. • An element u is the parent of node v if (u,v) CE. • The predicate contains*(v, k) is true if the node v directly or indirectly contains the keyword k. Gökay Burak AKKUŞEce AKSU

  16. Keyword Query Results • There are two possible semantics for keyword searchqueries: • conjunctive keyword query semantics • contain all of the query keywords are returned. • disjunctivekeyword query semantics • contain at least one of thequery keywords are returned • This paper focuses on conjunctive keywordquery semantics. Gökay Burak AKKUŞEce AKSU

  17. Keyword Query Results - 2 • Q={k1,…, kn}. • R0 = {vv  NE   k  Q(contains*(v,k))} the set of elements that directly or indirectly contain all of the query keywords. • Result(Q)={v  k  Q c  N ((v,c)  CE  c R0 contains*(c,k))} • ensures that only the most specific results are returned. • ensures that an element that has multipleindependent occurrences of the query keywords is returned, • CE are considered for result set, HE are considered for ranking Gökay Burak AKKUŞEce AKSU

  18. Keyword Query Results - 3 • XML elements provides more contextinformation • Also poses interesting user-interface challenges. • One solution is to allow the user to navigate up to theancestors of the query result • Another solution, is to predefine a set of“answer nodes” AN. • XRANK supports both • may require knowledge of thedomain and underlying XML schema Gökay Burak AKKUŞEce AKSU

  19. Ranking Keyword Query Results • Desired Properties of Ranking Function: 1) Result specificity: more specific results higher than less specific results.one dimension of result proximity. 2) Keyword proximity: another dimension of result proximity. 3) Hyperlink Awareness:hyperlinked structure of XML documents. Gökay Burak AKKUŞEce AKSU

  20. Ranking Function: Definition • ElemRank is defined at the granularityof an element and takes the nested structure of XML into account. • Similar to Google’s PageRank • Q = (k1, k2, …, kn) • R = Result(Q) • A result element v1R • First define the ranking of v1with respect to one query keywordki, r(v1,ki) before defining the overall rank, rank(v1, Q). Gökay Burak AKKUŞEce AKSU

  21. Ranking with respect to one keyword • There exists a sub-element/value node v2 of v1such that v2 R0and contains*(v2, ki). • There is a sequence of containment edges in CE of the form (v1, v2), (v2, v3), …, (vt, vt+1) such that vt+1is a value node that directly contains the keyword ki. Gökay Burak AKKUŞEce AKSU

  22. Ranking with respect to one keyword • r(v1, ki) does not dependon the ElemRank of the result node v1, except when v1 = vt for 2 reasons: 1. less specific results indeed get lower ranks. 2. in fact related to ElemRank(v1) due tocertain properties of containment edges. For multiple occurences of ki in v1 combined rank is: • f = max Gökay Burak AKKUŞEce AKSU

  23. Overall Ranking • The overall ranking is the sum of the ranks with respect to eachquery keyword, multiplied by a measure of keyword proximityp(v1, k1, k2, …, kn). Gökay Burak AKKUŞEce AKSU

  24. XRANK System Architecture Gökay Burak AKKUŞEce AKSU

  25. XRANK System Architecture-2 • ElemRank Computation Module • Computes the ElemRanks of XML elements • Combined with ancestor info • HDIL • Generates an index structure called HDIL • The Query Evaluator Module • Evaluates queries using HDIL • Returnsranked results. Gökay Burak AKKUŞEce AKSU

  26. ElemRank Computational Module • ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs. • PageRank function is sum of 2 probabilities • Visiting v at random (d=0.85) • Visiting v by navigating Gökay Burak AKKUŞEce AKSU

  27. ElemRank Computational Module • PageRank is unidirectional • Forward ElemRank propagation • Paper  section • Reverse ElemRank propagation • Paper -- > workshop Gökay Burak AKKUŞEce AKSU

  28. Refinements of PageRank • Bi-directional transfer of ElemRanks • Discrimination between containment and hyperlink edges • Aggregate ElemRanks forreverse containment relationships Gökay Burak AKKUŞEce AKSU

  29. Bi-directional Transfer of ElemRanks • A simple solution is to add reverse containment edges, • does not distinguish between containment and hyperlink edges Gökay Burak AKKUŞEce AKSU

  30. Discrimination between containment and hyperlink edges • It weights forward andreverse containment relationships similarly. Gökay Burak AKKUŞEce AKSU

  31. Aggregate ElemRanks forreverse containment relationships Gökay Burak AKKUŞEce AKSU

  32. XRANK System Efficiently Evaluating XML Keyword Search Queries

  33. Efficiently Evaluating XML Keyword Search Queries • Naïve Approach • Dewey Inverted List (DIL) • Ranked Dewey Inverted List (RDIL) • Hybrid Dewey Inverted List (HDIL) Gökay Burak AKKUŞEce AKSU

  34. Naïve Approach • Main Difference between XML and HTML keyword search: • The granularity of query results • XML keyword search returns elements • HTML keyword search returns documents • One way to do XML keyword search • Treat each element as a document Gökay Burak AKKUŞEce AKSU

  35. Problems of Naïve Approach • Space Overhead • Spurious Query Results • Inaccurate ranking of results Gökay Burak AKKUŞEce AKSU

  36. Space Overhead • An inverted list contains for each keyword, the list of documents that contain the keyword • For XML documents, the list of elements • A large space overhead; because each inverted list contains • XML element that directly contains the keyword(1) • All of (1)s ancestors redundantly Gökay Burak AKKUŞEce AKSU

  37. Spurious Query Results • The naïve approach ignores ancestor-descendant relationships. • All elements treated as independent documents • Results will not correspond to the desired semantics for XML keyword search Gökay Burak AKKUŞEce AKSU

  38. Inaccurate Ranking of Results • Existing approaches do not take result specificity into account when ranking results. Gökay Burak AKKUŞEce AKSU

  39. Dewey Inverted List (DIL) • Naïve approach has drawbacks: • Decouples representation of ancestors and descendants. • Dewey encoding of Element IDs jointly captures ancestor and descendant information. Gökay Burak AKKUŞEce AKSU

  40. Gökay Burak AKKUŞEce AKSU

  41. DIL • An interesting feature: • ID of an ancestor is a prefix of the ID of a descendant. • Ancestor-descendant relationships are implicitly captured in the Dewey ID. Gökay Burak AKKUŞEce AKSU

  42. DIL Data Structure • The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k. • For multiple documents : • First component of each Dewey ID is the document ID Gökay Burak AKKUŞEce AKSU

  43. DIL Data Structure -2 • An entry in DIL: • ElemRank of corresponding XML element • The list of all positions where the keyword k appears in that element. • Entries are sorted by Dewey IDs • The size of DIL is smaller than that of Naïve Approach. Gökay Burak AKKUŞEce AKSU

  44. Gökay Burak AKKUŞEce AKSU

  45. DIL Query Processing • An algorithm that works in a single pass over the query keyword inverted lists. • The key idea: • Merge the query keyword inverted lists • Simultaneously compute the longest common prefix of the Dewey IDs in different lists. Gökay Burak AKKUŞEce AKSU

  46. Gökay Burak AKKUŞEce AKSU

  47. Gökay Burak AKKUŞEce AKSU

  48. Ranked Dewey Inverted List (RDIL) • “If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results.” Gökay Burak AKKUŞEce AKSU

  49. RDIL -2 • One solution: • Order the inverted lists by the ElemRank instead of by the Dewey ID. • Higher ranked results will appear first in the inverted list. • Threshold Algorithm. Gökay Burak AKKUŞEce AKSU

  50. RDIL Data Structure • RDIL is similar to DIL except that: • Inverted lists are ordered by ElemRank, • Each inverted list has a B+-tree index of the Dewey ID field. Gökay Burak AKKUŞEce AKSU

More Related