xml full text search challenges and opportunities n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
XML Full-Text Search: Challenges and Opportunities PowerPoint Presentation
Download Presentation
XML Full-Text Search: Challenges and Opportunities

Loading in 2 Seconds...

play fullscreen
1 / 80

XML Full-Text Search: Challenges and Opportunities - PowerPoint PPT Presentation


  • 156 Views
  • Updated on

XML Full-Text Search: Challenges and Opportunities. Sihem Amer-Yahia AT&T Labs – Research. Jayavel Shanmugasundaram Cornell University. Outline. Motivation Full-Text Search Languages Scoring Query Processing Open Issues. Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

XML Full-Text Search: Challenges and Opportunities


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. XML Full-Text Search: Challenges and Opportunities Sihem Amer-Yahia AT&T Labs – Research Jayavel Shanmugasundaram Cornell University VLDB Tutorial on XML Full-Text Search

    2. Outline • Motivation • Full-Text Search Languages • Scoring • Query Processing • Open Issues VLDB Tutorial on XML Full-Text Search

    3. Motivation • XML is able to represent a mix of structured and text information. • XML applications: digital libraries, content management. • XML repositories: IEEE INEX collection, LexisNexis, the Library of Congress collection. VLDB Tutorial on XML Full-Text Search

    4. XML in Library of Congresshttp://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml <bill bill-stage="Introduced-in-House"> <congress>109th CONGRESS</congress> <session>1st Session</session> <legis-num>H. R. 2739</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-chamber> <action> <action-date date="20050526">May 26, 2005</action-date> <action-desc><sponsor name-id="T000266">Mr. Tierney</sponsor> (for himself, <cosponsor name-id="M001143">Ms. McCollum of Minnesota</cosponsor>, <cosponsor name-id="M000725">Mr. George Miller of California</cosponsor>) introduced the following bill; which was referred to the <committee-name committee-id="HED00">Committee on Education and the Workforce</committee-name> </action-desc> </action> … VLDB Tutorial on XML Full-Text Search

    5. THOMAS: Library of Congress VLDB Tutorial on XML Full-Text Search

    6. INEX Data <article> <fno>K0271</fno> <doi>10.1041/K0271s-2004</doi> <fm> <hdr><hdr1><ti>IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING</ti> <crt> <issn>1041-4347</issn>/04/$20.00 &copy; 2004 IEEE Published by the IEEE Computer Society</crt></hdr1><hdr2><obi><volno>Vol. 16</volno>, <issno>No. 2</issno></obi> <pdt><mo>FEBRUARY</mo><yr>2004</yr></pdt> <pp>pp. 271-288</pp></hdr2> </hdr> <tig><atl>A Graph-Based Approach for Timing Analysis and Refinement of OPS5 Knowledge-Based Systems</atl><pn>pp. 271-288</pn><ref rid="K02711aff" type="aff">*</ref></tig> <au sequence="first"><fnm>Albert Mo Kim</fnm><snm> <ref aid="K0271a1“ type="prb">Cheng</ref></snm><role>Senior Member</role><aff><onm>IEEE</onm></aff></au><au sequence="additional"><fnm>Hsiu-yen</fnm><snm> Tsai</snm></au> <abs><p><b>Abstract</b>&mdash;This paper examines the problem of predicting the timing behavior of knowledge-based systems for real-… VLDB Tutorial on XML Full-Text Search

    7. Example INEX Query <inex_topic topic_id="275" query_type="CAS"> <castitle>//article[about(.//abs, "data mining")]//sec[about(., "frequent itemsets")]</castitle> <description>sections about frequent itemsets from articles with abstract about data mining</description> <narrative>To be relevant, a component has to be a section about "frequent itemsets". For example, it could be about algorithms for finding frequent itemsets, or uses of frequent itemsets to generate rules. Also, the article must have an abstract about "data mining". I need this information for a paper that I am writing. It is a survey of different algorithms for finding frequent itemsets. The paper will also have a section on why we would want to find frequent itemsets.</narrative> </inex_topic> VLDB Tutorial on XML Full-Text Search

    8. Challenges in XML FT Search • Searching over Semi-Structured Data • Users may specify a search context and return context. • Expressive Power and Extensibility • Users should be able to express complex full-text searches and combine them with structural searches. • Scores and Ranking • Users may specify a scoring condition, possibly over both full-text and structured predicates and obtain top-k results based on query relevance scores. • The language should allow for an efficient implementation. VLDB Tutorial on XML Full-Text Search

    9. XML FT Search Definition • Context expression: XML elements searched: • pre-defined XML nodes. • XPath/XQuery queries. • Return expression:XML fragments returned: • pre-defined meaningful XML fragments. • XPath/XQuery to build answers. • Search expression:FT search conditions: • Boolean keyword search. • proximity distance, scoping, thesaurus, stop words, stemming. • Score expression: • system-defined scoring function. • user-defined scoring function. • query-dependent keyword weights. VLDB Tutorial on XML Full-Text Search

    10. Outline • Motivation • Full-Text Search Languages • Scoring • Query Processing • Open Issues VLDB Tutorial on XML Full-Text Search

    11. Four Classes of Languages • Keyword search (INEX Content-Only Queries) • “book xml” • Tag + Keyword search • book: xml • Path Expression + Keyword search • /book[./title about “xml db”] • XQuery + Complex full-text search • for $b in /booklet score $s := $b ftcontains “xml” && “db” distance 5 VLDB Tutorial on XML Full-Text Search

    12. Outline • Motivation • Full-Text Search Languages • Simple Keyword Search • Tags + Keyword Search • Path Expressions + Keyword Search • XQuery + Complex Full-Text Search • Scoring • Query Processing • Open Issues VLDB Tutorial on XML Full-Text Search

    13. XRank [Guo et al., SIGMOD 2003] <workshopdate=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paperid=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <citexmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> … VLDB Tutorial on XML Full-Text Search

    14. XRank [Guo et al., SIGMOD 2003] <workshopdate=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paperid=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <citexmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> … VLDB Tutorial on XML Full-Text Search

    15. XIRQL [Fuhr & Grobjohann, SIGIR 2001] <workshopdate=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paperid=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”> Searching on structured text is becoming more important with XML … <em>The XQL language </em> </section> … <citexmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> … Index Node VLDB Tutorial on XML Full-Text Search

    16. Similar Notion of Results • Nearest Concept Queries • [Schmidt et al., ICDE 2002] • XKSearch • [Xu & Papakonstantinou, SIGMOD 2005] VLDB Tutorial on XML Full-Text Search

    17. Outline • Motivation • Full-Text Search Languages • Simple Keyword Search • Tags + Keyword Search • Path Expressions + Keyword Search • XQuery + Complex Full-Text Search • Scoring • Query Processing • Open Issues VLDB Tutorial on XML Full-Text Search

    18. XSearch [Cohen et al., VLDB 2003] <workshopdate=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paperid=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”> Searching on structured text is becoming more important with XML … … </paper> <paperid=”2”> <title> XML Indexing </title> … <paperid=”2”> Not a “meaningful” result VLDB Tutorial on XML Full-Text Search

    19. Outline • Motivation • Full-Text Search Languages • Simple Keyword Search • Tags + Keyword Search • Path Expressions + Keyword Search • XQuery + Complex Full-Text Search • Scoring • Query Processing • Open Issues VLDB Tutorial on XML Full-Text Search

    20. XPath [W3C 2005] • fn:contains($e, string) returns true iff $e contains string //section[fn:contains(./title, “XML Indexing”)] VLDB Tutorial on XML Full-Text Search

    21. XIRQL [Fuhr & Grobjohann, SIGIR 2001] • Weighted extension to XQL (precursor to XPath) //section[0.6 · .//* $cw$ “XQL” + 0.4 · .//section $cw$ “syntax”] VLDB Tutorial on XML Full-Text Search

    22. XXL [Theobald & Weikum, EDBT 2002] • Introduces similarity operator ~ Select Z From http://www.myzoos.edu/zoos.html Where zoos.#.zoo As Z and Z.animals.(animal)?.specimen as A and A.species ~ “lion” and A.birthplace.#.country as B and A.region ~ B.content VLDB Tutorial on XML Full-Text Search

    23. NEXI [Trotman & Sigurbjornsson, INEX 2004] • Narrowed Extended XPath I • INEX Content-and-Structure (CAS) Queries //article[about(.//title, apple) and about(.//sec, computer)] VLDB Tutorial on XML Full-Text Search

    24. Outline • Motivation • Full-Text Search Languages • Simple Keyword Search • Tags + Keyword Search • Path Expressions + Keyword Search • XQuery + Complex Full-Text Search • Scoring • Query Processing • Open Issues VLDB Tutorial on XML Full-Text Search

    25. Schema-Free XQuery [Li, Yu, Jagadish, VLDB 2003] • Meaningful least common ancestor (mlcas) for $a in doc(“bib.xml”)//author $b in doc(“bib.xml”)//title $c in doc(“bib.xml”)//year where $a/text() = “Mary” and exists mlcas($a,$b,$c) return <result> {$b,$c} </result> VLDB Tutorial on XML Full-Text Search

    26. XQuery Full-Text [W3C 2005] • Two new XQuery constructs • FTContainsExpr • Expresses “Boolean” full-text search predicates • Seamlessly composes with other XQuery expressions • FTScoreClause • Extension to FLWOR expression • Can score FTContainsExpr and other expressions VLDB Tutorial on XML Full-Text Search

    27. FTContainsExpr //book ftcontains “Usability” && “testing” distance 5 //book[./content ftcontains “Usability” with stems]/title //book ftcontains /article[author=“Dawkins”]/title VLDB Tutorial on XML Full-Text Search

    28. FTScore Clause In any order FOR $v [SCORE $s]? IN [FUZZY] Expr LET … WHERE … ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $sRETURN <result score={$s}> $b </result> VLDB Tutorial on XML Full-Text Search

    29. FTScore Clause In any order FOR $v [SCORE $s]? IN [FUZZY] Expr LET … WHERE … ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing” and ./price < 10.00] ORDER BY $sRETURN $b VLDB Tutorial on XML Full-Text Search

    30. FTScore Clause In any order FOR $v [SCORE $s]? IN [FUZZY] Expr LET … WHERE … ORDER BY … RETURN Example FOR $b SCORE $s in FUZZY /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $sRETURN $b VLDB Tutorial on XML Full-Text Search

    31. XQuery Full-Text Evolution Quark Full-TextLanguage (Cornell) 2002 IBM, Microsoft,Oracle proposals TeXQuery(Cornell, AT&T Labs) 2003 XQuery Full-Text 2004 XQuery Full-Text (Second Draft) 2005 VLDB Tutorial on XML Full-Text Search

    32. Outline • Motivation • Full-Text Search Languages • Scoring • Query Processing • Open Issues VLDB Tutorial on XML Full-Text Search

    33. Full-Text Scoring • Score value should reflect relevance of answer to user query. Higher scores imply a higher degree of relevance. • Queries return document fragments. Granularity of returned results affects scoring. • For queries containing conditions on structure, structural conditions may affect scoring. • Existing proposals extend common scoring methods: probabilistic or vector-based similarity. VLDB Tutorial on XML Full-Text Search

    34. Granularity of Results • Keyword queries • compute possibly different scores for LCAs. • Tag + Keyword queries • compute scores based on tags and keywords. • Path Expression + Keyword queries • compute scores based on paths and keywords. • XQuery + Complex full-text queries • compute scores for (newly constructed) XML fragments satisfying XQuery (structural, full-text and scalar conditions). VLDB Tutorial on XML Full-Text Search

    35. Outline • Motivation • Full-Text Search Languages • Scoring • Simple Keyword Search • Tags + Keyword Search • Path Expressions + Keyword Search • XQuery + Complex Full-Text Search • Query Processing • Open Issues VLDB Tutorial on XML Full-Text Search

    36. Granularity of Results • Document as hierarchical structure of elements as opposed to flat document. • XXL [Theobald & Weikum, EDBT 2002] • XIRQL [Fuhr & Grobjohann, SIGIR 2001] • XRANK [Guo et al., SIGMOD 2003] • Propagate keyword weights along document structure. VLDB Tutorial on XML Full-Text Search

    37. <workshop> date <title> <editors> <proceedings> 28 July … XML and … David Carmel … <paper> <paper> … <title> <author> … … XQL and … Ricardo … XML Data Model Containment edge VLDB Tutorial on XML Full-Text Search Hyperlink edge

    38. XXL[Theobald & Weikum, EDBT 2002] • Compute similar terms with relevance score r1 using an ontology. • Compute tf*idf of each term for a given element content with relevance score r2. • Relevance of an element content for a term is r1*r2. • r1 and r2 are computed as a weighted distance in an ontology graph. • Probabilities of conjunctions multiplied (independence assumption) along elements of same path to compute path score. VLDB Tutorial on XML Full-Text Search

    39. Probabilistic ScoringXIRQL [Fuhr & Grobjohann, SIGIR 2001] • Extension of XPath. • Weighting and ranking: • weighting of query terms: • P(wsum((0.6,a), (0.4,b)) = 0.6 · P(a)+0.4 · P(b) • probabilistic interpretation of Boolean connectors: • P(a && b) = P(a) · P(b) VLDB Tutorial on XML Full-Text Search

    40. XIRQL Example • Query: • “Search for an artist named Ulbrich, living in Frankfurt, Germany about 100 years ago” • Data: • “Ernst Olbrich, Darmstadt, 1899” • Weights and ranking: • P(Olbrich p Ulbrich)=0.8 (phonetic similarity) • P(1899 n 1903)=0.9 (numeric similarity) • P(Darmstadt g Frankfurt)=0.7 (geographic distance) VLDB Tutorial on XML Full-Text Search

    41. d/3 d: Probability of following hyperlink d/3 d/3 PageRank [Brin & Page 1998] : Hyperlink edge w 1-d: Probability of random jump VLDB Tutorial on XML Full-Text Search

    42. d1/3 d3 d1/3 d1: Probability of following hyperlink d2: Probability of visiting a subelement d1/3 d3: Probability of visiting parent d2/2 d2/2 ElemRank [Guo et al. SIGMOD 2003] : Hyperlink edge : Containment edge w 1-d1-d2-d3: Probability of random jump VLDB Tutorial on XML Full-Text Search

    43. Outline • Motivation • Full-Text Search Languages • Scoring • Simple Keyword Search • Tags + Keyword Search • Path Expressions + Keyword Search • XQuery + Complex Full-Text Search • Query Processing • Open Issues VLDB Tutorial on XML Full-Text Search

    44. XSearch[Cohen et al., VLDB 2003] • tf*ilf to compute weight of keyword for a leaf element. • A vector is associated with each non-leaf element. • sim(Q,N): sum of the cosine distances between the vectors associated with nodes in N and vectors associated with terms matched in Q. VLDB Tutorial on XML Full-Text Search

    45. Outline • Motivation • Full-Text Search Languages • Scoring • Simple Keyword Search • Tags + Keyword Search • Path Expressions + Keyword Search • XQuery + Complex Full-Text Search • Query Processing • Open Issues VLDB Tutorial on XML Full-Text Search

    46. Vector–based ScoringJuruXML [Mass et al INEX 2002] • Transform query into (term,path) conditions: article/bm/bib/bibl/bb[about(., hypercube mesh torus nonnumerical database)] • (term,path)-pairs: hypercube, article/bm/bib/bibl/bb mesh, article/bm/bib/bibl/bb torus, article/bm/bib/bibl/bb nonnumerical, article/bm/bib/bibl/bb database, article/bm/bib/bibl/bb • Modified cosine similarity as retrieval function for vague matching of path conditions. VLDB Tutorial on XML Full-Text Search

    47. JuruXML Vague Path Matching • Modified vector-based cosine similarity Example of length normalization: cr (article/bibl, article/bm/bib/bibl/bb) = 3/6 = 0.5 VLDB Tutorial on XML Full-Text Search

    48. Query Relaxation on Structure • Schlieder, EDBT 2002 • Delobel & Rousset, 2002 • Amer-Yahia et al, VLDB 2005 VLDB Tutorial on XML Full-Text Search

    49. XML Query Relaxation[Amer-Yahia et al EDBT 2002]FlexPath [Amer-Yahia et al SIGMOD 2004] Query book • Tree pattern relaxations: • Leaf node deletion • Edge generalization • Subtree promotion info edition paperback author Dickens book book Data book edition? info info author Dickens info edition (paperback) author Charles Dickens edition paperback author C. Dickens VLDB Tutorial on XML Full-Text Search

    50. Adaptation of tf.idf to XML Whirlpool[Marian et al ICDE 2005] VLDB Tutorial on XML Full-Text Search