1 / 49

XML Information Retreival

XML Information Retreival. Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign. Some slides are borrowed from Nobert Fuhr’s XML Tutorial. Outline. XML basics Research Topics XML IR Tasks Retrieval methods Clustering XML documents. XML standards.

lefty
Download Presentation

XML Information Retreival

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s XML Tutorial.

  2. Outline • XML basics • Research Topics • XML IR • Tasks • Retrieval methods • Clustering XML documents

  3. XML standards

  4. Basic XML • Hierarchical document format for information exchange in WWW • Self describing data (tags) • Nested element structure having a root • Element data can have • Attributes • Sub-elements (Slides from Jayavel Shanmugasundaram)

  5. Element Attribute Example XML document <?xml version="1.0" encoding="ISO-8859-1" ?> - <!--Edited with XML Spy v4.2 --> <book> <title> Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW</title> <author id = “rbelew”> <name> <firstname> Richard </firstname> <lastname> Belew </lastname> </name> <address> <city> San Diego </city> <zip> 92093 </zip> </address> </author> </book>

  6. Tree structure of XML documents book title author id=“rbelew” name address Finding…. First name Last name city Zip code Richard Belew San Diego 92093

  7. Basic XML standard does not deal with … • Standardization of element names XML namespaces • Structure of element content XML DTDs • Data types of element content XML schema

  8. <table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table> <table> <name>GPA Table</name> <width>80</width> <length>120</length> </table> XML namespace Provide a method to avoid element name conflicts

  9. <h:table xmlns:h="http://www.w3.org/TR/html4/"> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr> </h:table> <f:table xmlns:f="http://www.w3schools.com/gpa"> <f:name>GPA Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> XML namespace(Cont.) Provide a method to avoid element name conflicts

  10. <?xml version="1.0"?> <!DOCTYPE note SYSTEM "note.dtd"> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Have a rest!</body> </note> XML Document Type Definition Define the document structure with a list of legal elements <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)>

  11. Research Topics related to XML

  12. IR areas Retrieval Models Query Languages … DB areas Query Languages System architecture Apply relational DB technology to XML data Streaming XML XML Query Processing XML indexing and compression …… Research Topics

  13. XML IR

  14. INEX:Initiative for the Evaluation for XML Retrieval • Documents: 12,107 articles in XML format • Queries: 30 Content-only; 30 Content and structure • Relevance Assessments: by participating groups • Participants: 36 active groups in 2003

  15. CO search task • Document as hierarchical structure of nested elements • Type of elements is not considered • Query refers to content only • Query syntax as in standard text retrieval • Task: Find smallest subtree(element) satisfying the query

  16. Example of CO Topic <INEX-Topic topic-id=“45” query-type=“CO” ct-no=“056”> <Title> <cw>augmented reality and medicine</cw></Title> <Description> How virtual (or augmented )reality can contribute to improve the medical and surgical practice. </Description> <Narrative> In order to be considered relevant, a document/component must include considerations about applications of computer graphics and especially augmented (or virtual) reality to medice(including surgery). </Narrative> <Keywords> Augmented virtual reality medicine surgery improve computer assisted aided image </Keywords> </INEX-Topic>

  17. CAS search Task • Queries contain explicit references to the XML structure, by restricing • The context of interest • <te>:target element • The context of certain search concepts • (<cw>,<ce>) pairs

  18. Example of CAS topic <INEX-Topic topic-id=“09” query-type=“CAS” ct-no=“048”> <Title> <te>article</te> <cw>non-monotonic reasoning</cw><ce>bdy/sec</ce> <cw>1999 2000</cw> <ce>hdr//yr</ce> <cw>-calendar</cw><ce><tig/at1<ce> <cw>belief revision</cw> </Title> <Narrative> Retrieve all articles from the years 1999-2000 that deal with works on non-monotonic reaonsing. Do not retrieve CfPs/calendar entries </Narrative> <Keywords>non-monotonic reasoning belief revision </Keywords> </INEX-Topic>

  19. XML Retrieval Methods • XIRQL • XML query languages with IR-related features • Language models • JuruXML

  20. XIRQL(I) • CO Approaches : • Split document text into disjoint nodes • Index nodes separately • Aggregate indexing weights for higher-level elements (subtrees)

  21. document class="H.3.3" chapter chapter author title heading section section John Smith heading This. . . XML Query We describe heading heading syntax of XQL Lang. XQL XML Retrieval Introduction 1 3 2 Examples Syntax 4 5 Index nodes as units for term weighting Application of known indexing functions (e.g. tf*idf)

  22. Index nodes for relevance-oriented search document class="H.3.3" chapter chapter author title heading section section John Smith heading This. . . XML Query We describe heading heading syntax of XQL Lang. XQL XML Retrieval Introduction 1 3 2 Examples Syntax Q1: syntax  example Q2: XQL 4 5

  23. 0.8+0.3-0.8*0.3=0.86 0.86 0.5 example 0.7*0.5=0.35 0.7 syntax Combining weights …by disjunction chapter 0.3 XQL section1 section2 0.5 example 0.8 XQL 0.7 syntax Need to return most specific element satisfying the query! Q1: syntax  example Q2: XQL

  24. 0.48+0.3-0.48*0.3=0.64 0.64 0.30 example 0.42 syntax 0.6 0.6 Combining weights … with augmentation weight chapter 0.3 XQL section1 section2 0.5 example 0.8 XQL 0.7 syntax Q2: XQL

  25. XIRQL(II) • CAS approaches • Extension of XQL by • Weighting and ranking • Data types with vague predicates • Structural relativism

  26. XQL Expressions • Path condition • search for single elements heading • parent-child: chapter/heading • ancestor-descendant: chapter//section • document root: /book/* • Filter wrt. structure: //chapter[heading] • Filter wrt. content: /document[@class=“H.3.3” $and$ author=“John Smith”]

  27. Data types with vague predicates • Compares two values of a specific data-type • E.g. Near, broader, narrower • Returns (probabilistic) matching value • E.g. “Search for an artist named Ulbrich, living in Frankfurt, Germany about 100 years ago” Ernst Olbrich, Darmstadt, 1899 P(Olbrich Ulbrich)=0.8 (phonetic similarity) P(1899 1903)=0.9 (numeric similarity) P(Darmstadt Frankfurt)=0.7 (geographic distance)

  28. Semantic Relativism • Drop distinction attribute/element: ~author searches for attribute or element • Generalize to data types: #personname searches for attribute/elements of specific data type

  29. Language models • Generate language models for each node in the tree • Combine the children language models using linear interpolation • Use EM approach to train the linear interpolation parameters

  30. Element-specific language models---CO Approaches

  31. 0.5 0.5 Higher level nodes: mixture of language models Query: dog and cat

  32. Type-specific language models--- CAS approaches

  33. 0.5 0.5 0.5 0.5 • “Return components of type x where it has component y that contains the query term w” • e.g. return documents where the title is contains the word “bird” e.g. return documents where the body’s first section is contains the word “dog”

  34. Juru-XML • Element-specific indexing+vector space model: • Transform query into set of (term,path)-conditions • Vague matching of path conditions • Modified cosine similarity as retrieval function

  35. JuruXML(1)---Transform Query

  36. JuruXML(2)---Vague matching of path conditions

  37. Standard cosine similarity wQ(ti): query term weight of term ti wD(ti): indexing weight of term ti in the document Modified cosine similarity wQ(ti ,ciQ): query term weight of pair (ti,ciQ) wD(ti ,ciD): indexing weight of pair (ti,ciD)in the document JuruXML(3)---Retrieval function

  38. JuruXML(4)---Alternative approach (Merging contexts) • For each query term (ti,ciQ) treat all matched document terms (ti,cjD) equally from the user perspective. • Define a weight function w(ciQ) • E.g.

  39. Clustering XML documents

  40. Document similarity • Document representation: documentN-dimensional vector • N= # document features • Feature sets • Text only • Tags only • Text + Tags • Feature weighting in the document vector • Similarity measure--- vector similarity • E.g. cosine measure

  41. Clustering methods • Hierarchical clustering: • Main weakness: quadratic complexity • Partitional clustering: • K-means • Linear time complexity • Simplicity of its algorithm

  42. K-Means clustering algorithm

  43. Measuring clustering quality • External quality: comparison of clusters with external classification • Entropy distribution of classes within clusters • Purity largest class in a cluster/cluster size • Internal quality: calculate average inter- and intra- cluster similarities. • cohesiveness ( overall similarity)

  44. Discussion • Text alone give best results • Text+tags: problem with weighting of tags vs. terms

  45. Conclusion • XML basics • XML Retrieval Tasks and methods • Clustering XML documents

  46. Bayesian Networks

  47. Context-dependent Retrieval • The score of one element is given by RSV(Retrieval Status Value). • RSV of node depends on RSVs of nodes in the context(parent nodes) • Elements with highest values are then presented to the user.

  48. Bayesian Networks

  49. Bayesian Networks(Cont.)

More Related