Xml full text search challenges and opportunities
Download
1 / 80

XML Full-Text Search: Challenges and Opportunities - PowerPoint PPT Presentation


  • 149 Views
  • Uploaded on

XML Full-Text Search: Challenges and Opportunities. Sihem Amer-Yahia AT&T Labs – Research. Jayavel Shanmugasundaram Cornell University. Outline. Motivation Full-Text Search Languages Scoring Query Processing Open Issues. Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' XML Full-Text Search: Challenges and Opportunities' - conley


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Xml full text search challenges and opportunities

XML Full-Text Search: Challenges and Opportunities

Sihem Amer-Yahia

AT&T Labs – Research

Jayavel Shanmugasundaram

Cornell University

VLDB Tutorial on XML Full-Text Search


Outline
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Motivation
Motivation

  • XML is able to represent a mix of structured and text information.

  • XML applications: digital libraries, content management.

  • XML repositories: IEEE INEX collection, LexisNexis, the Library of Congress collection.

VLDB Tutorial on XML Full-Text Search


Xml in library of congress http thomas loc gov home gpoxmlc109 h2739 ih xml
XML in Library of Congresshttp://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml

<bill bill-stage="Introduced-in-House">

<congress>109th CONGRESS</congress> <session>1st Session</session>

<legis-num>H. R. 2739</legis-num>

<current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-chamber>

<action>

<action-date date="20050526">May 26, 2005</action-date>

<action-desc><sponsor name-id="T000266">Mr. Tierney</sponsor> (for himself, <cosponsor name-id="M001143">Ms. McCollum of Minnesota</cosponsor>, <cosponsor name-id="M000725">Mr. George Miller of California</cosponsor>) introduced the following bill; which was referred to the <committee-name committee-id="HED00">Committee on Education and the Workforce</committee-name>

</action-desc>

</action>

VLDB Tutorial on XML Full-Text Search


Thomas library of congress
THOMAS: Library of Congress

VLDB Tutorial on XML Full-Text Search


Inex data
INEX Data

<article> <fno>K0271</fno> <doi>10.1041/K0271s-2004</doi>

<fm> <hdr><hdr1><ti>IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING</ti> <crt>

<issn>1041-4347</issn>/04/$20.00 &copy; 2004 IEEE Published by the IEEE Computer Society</crt></hdr1><hdr2><obi><volno>Vol. 16</volno>, <issno>No. 2</issno></obi> <pdt><mo>FEBRUARY</mo><yr>2004</yr></pdt>

<pp>pp. 271-288</pp></hdr2> </hdr> <tig><atl>A Graph-Based Approach for Timing Analysis and Refinement of OPS5 Knowledge-Based Systems</atl><pn>pp. 271-288</pn><ref rid="K02711aff" type="aff">*</ref></tig>

<au sequence="first"><fnm>Albert Mo Kim</fnm><snm> <ref aid="K0271a1“ type="prb">Cheng</ref></snm><role>Senior Member</role><aff><onm>IEEE</onm></aff></au><au sequence="additional"><fnm>Hsiu-yen</fnm><snm> Tsai</snm></au>

<abs><p><b>Abstract</b>&mdash;This paper examines the problem of predicting the timing behavior of knowledge-based systems for real-…

VLDB Tutorial on XML Full-Text Search


Example inex query
Example INEX Query

<inex_topic topic_id="275" query_type="CAS">

<castitle>//article[about(.//abs, "data mining")]//sec[about(., "frequent itemsets")]</castitle>

<description>sections about frequent itemsets from articles with abstract about data mining</description>

<narrative>To be relevant, a component has to be a section about "frequent itemsets". For example, it could be about algorithms for finding frequent itemsets, or uses of frequent itemsets to generate rules. Also, the article must have an abstract about "data mining". I need this information for a paper that I am writing. It is a survey of different algorithms for finding frequent itemsets. The paper will also have a section on why we would want to find frequent itemsets.</narrative>

</inex_topic>

VLDB Tutorial on XML Full-Text Search


Challenges in xml ft search
Challenges in XML FT Search

  • Searching over Semi-Structured Data

    • Users may specify a search context and return context.

  • Expressive Power and Extensibility

    • Users should be able to express complex full-text searches and combine them with structural searches.

  • Scores and Ranking

    • Users may specify a scoring condition, possibly over both full-text and structured predicates and obtain top-k results based on query relevance scores.

    • The language should allow for an efficient implementation.

VLDB Tutorial on XML Full-Text Search


Xml ft search definition
XML FT Search Definition

  • Context expression: XML elements searched:

    • pre-defined XML nodes.

    • XPath/XQuery queries.

  • Return expression:XML fragments returned:

    • pre-defined meaningful XML fragments.

    • XPath/XQuery to build answers.

  • Search expression:FT search conditions:

    • Boolean keyword search.

    • proximity distance, scoping, thesaurus, stop words, stemming.

  • Score expression:

    • system-defined scoring function.

    • user-defined scoring function.

    • query-dependent keyword weights.

VLDB Tutorial on XML Full-Text Search


Outline1
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Four classes of languages
Four Classes of Languages

  • Keyword search (INEX Content-Only Queries)

    • “book xml”

  • Tag + Keyword search

    • book: xml

  • Path Expression + Keyword search

    • /book[./title about “xml db”]

  • XQuery + Complex full-text search

    • for $b in /booklet score $s := $b ftcontains “xml” && “db” distance 5

VLDB Tutorial on XML Full-Text Search


Outline2
Outline

  • Motivation

  • Full-Text Search Languages

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Scoring

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Xrank guo et al sigmod 2003
XRank [Guo et al., SIGMOD 2003]

<workshopdate=”28 July 2000”>

<title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>

<proceedings>

<paperid=”1”>

<title> XQL and Proximal Nodes </title>

<author> Ricardo Baeza-Yates </author>

<author> Gonzalo Navarro </author>

<abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”>

Searching on structured text is becoming more important with XML …

<subsection name=“Related Work”>

The XQL language …

</subsection>

</section>

<citexmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite>

</paper>

VLDB Tutorial on XML Full-Text Search


Xrank guo et al sigmod 20031
XRank [Guo et al., SIGMOD 2003]

<workshopdate=”28 July 2000”>

<title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>

<proceedings>

<paperid=”1”>

<title> XQL and Proximal Nodes </title>

<author> Ricardo Baeza-Yates </author>

<author> Gonzalo Navarro </author>

<abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”>

Searching on structured text is becoming more important with XML …

<subsection name=“Related Work”>

The XQL language …

</subsection>

</section>

<citexmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite>

</paper>

VLDB Tutorial on XML Full-Text Search


Xirql fuhr grobjohann sigir 2001
XIRQL [Fuhr & Grobjohann, SIGIR 2001]

<workshopdate=”28 July 2000”>

<title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>

<proceedings>

<paperid=”1”>

<title> XQL and Proximal Nodes </title>

<author> Ricardo Baeza-Yates </author>

<author> Gonzalo Navarro </author>

<abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”>

Searching on structured text is becoming more important with XML …

<em>The XQL language </em>

</section>

<citexmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite>

</paper>

Index Node

VLDB Tutorial on XML Full-Text Search


Similar notion of results
Similar Notion of Results

  • Nearest Concept Queries

    • [Schmidt et al., ICDE 2002]

  • XKSearch

    • [Xu & Papakonstantinou, SIGMOD 2005]

VLDB Tutorial on XML Full-Text Search


Outline3
Outline

  • Motivation

  • Full-Text Search Languages

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Scoring

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Xsearch cohen et al vldb 2003
XSearch [Cohen et al., VLDB 2003]

<workshopdate=”28 July 2000”>

<title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>

<proceedings>

<paperid=”1”>

<title> XQL and Proximal Nodes </title>

<author> Ricardo Baeza-Yates </author>

<author> Gonzalo Navarro </author>

<abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”>

Searching on structured text is becoming more important with XML …

</paper> <paperid=”2”> <title> XML Indexing </title> … <paperid=”2”>

Not a

“meaningful”

result

VLDB Tutorial on XML Full-Text Search


Outline4
Outline

  • Motivation

  • Full-Text Search Languages

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Scoring

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Xpath w3c 2005
XPath [W3C 2005]

  • fn:contains($e, string) returns true iff $e contains string

//section[fn:contains(./title, “XML Indexing”)]

VLDB Tutorial on XML Full-Text Search


Xirql fuhr grobjohann sigir 20011
XIRQL [Fuhr & Grobjohann, SIGIR 2001]

  • Weighted extension to XQL (precursor to XPath)

//section[0.6 · .//* $cw$ “XQL” + 0.4 · .//section $cw$ “syntax”]

VLDB Tutorial on XML Full-Text Search


Xxl theobald weikum edbt 2002
XXL [Theobald & Weikum, EDBT 2002]

  • Introduces similarity operator ~

Select Z

From http://www.myzoos.edu/zoos.html

Where zoos.#.zoo As Z and

Z.animals.(animal)?.specimen as A and

A.species ~ “lion” and

A.birthplace.#.country as B and

A.region ~ B.content

VLDB Tutorial on XML Full-Text Search


Nexi trotman sigurbjornsson inex 2004
NEXI [Trotman & Sigurbjornsson, INEX 2004]

  • Narrowed Extended XPath I

  • INEX Content-and-Structure (CAS) Queries

//article[about(.//title, apple) and

about(.//sec, computer)]

VLDB Tutorial on XML Full-Text Search


Outline5
Outline

  • Motivation

  • Full-Text Search Languages

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Scoring

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Schema free xquery li yu jagadish vldb 2003
Schema-Free XQuery [Li, Yu, Jagadish, VLDB 2003]

  • Meaningful least common ancestor (mlcas)

for $a in doc(“bib.xml”)//author

$b in doc(“bib.xml”)//title

$c in doc(“bib.xml”)//year

where $a/text() = “Mary” and

exists mlcas($a,$b,$c)

return <result> {$b,$c} </result>

VLDB Tutorial on XML Full-Text Search


Xquery full text w3c 2005
XQuery Full-Text [W3C 2005]

  • Two new XQuery constructs

  • FTContainsExpr

    • Expresses “Boolean” full-text search predicates

    • Seamlessly composes with other XQuery expressions

  • FTScoreClause

    • Extension to FLWOR expression

    • Can score FTContainsExpr and other expressions

VLDB Tutorial on XML Full-Text Search


Ftcontainsexpr
FTContainsExpr

//book ftcontains “Usability” && “testing” distance 5

//book[./content ftcontains “Usability” with stems]/title

//book ftcontains /article[author=“Dawkins”]/title

VLDB Tutorial on XML Full-Text Search


Ftscore clause
FTScore Clause

In any

order

FOR $v [SCORE $s]? IN [FUZZY] Expr

LET …

WHERE …

ORDER BY …

RETURN

Example

FOR $b SCORE $s in

/pub/book[. ftcontains “Usability” && “testing”]

ORDER BY $sRETURN <result score={$s}> $b </result>

VLDB Tutorial on XML Full-Text Search


Ftscore clause1
FTScore Clause

In any

order

FOR $v [SCORE $s]? IN [FUZZY] Expr

LET …

WHERE …

ORDER BY …

RETURN

Example

FOR $b SCORE $s in

/pub/book[. ftcontains “Usability” && “testing”

and ./price < 10.00]

ORDER BY $sRETURN $b

VLDB Tutorial on XML Full-Text Search


Ftscore clause2
FTScore Clause

In any

order

FOR $v [SCORE $s]? IN [FUZZY] Expr

LET …

WHERE …

ORDER BY …

RETURN

Example

FOR $b SCORE $s in FUZZY

/pub/book[. ftcontains “Usability” && “testing”]

ORDER BY $sRETURN $b

VLDB Tutorial on XML Full-Text Search


Xquery full text evolution
XQuery Full-Text Evolution

Quark Full-TextLanguage (Cornell)

2002

IBM, Microsoft,Oracle proposals

TeXQuery(Cornell, AT&T Labs)

2003

XQuery Full-Text

2004

XQuery Full-Text

(Second Draft)

2005

VLDB Tutorial on XML Full-Text Search


Outline6
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Full text scoring
Full-Text Scoring

  • Score value should reflect relevance of answer to user query. Higher scores imply a higher degree of relevance.

  • Queries return document fragments. Granularity of returned results affects scoring.

  • For queries containing conditions on structure, structural conditions may affect scoring.

  • Existing proposals extend common scoring methods: probabilistic or vector-based similarity.

VLDB Tutorial on XML Full-Text Search


Granularity of results
Granularity of Results

  • Keyword queries

    • compute possibly different scores for LCAs.

  • Tag + Keyword queries

    • compute scores based on tags and keywords.

  • Path Expression + Keyword queries

    • compute scores based on paths and keywords.

  • XQuery + Complex full-text queries

    • compute scores for (newly constructed) XML fragments satisfying XQuery (structural, full-text and scalar conditions).

VLDB Tutorial on XML Full-Text Search


Outline7
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Granularity of results1
Granularity of Results

  • Document as hierarchical structure of elements as opposed to flat document.

    • XXL [Theobald & Weikum, EDBT 2002]

    • XIRQL [Fuhr & Grobjohann, SIGIR 2001]

    • XRANK [Guo et al., SIGMOD 2003]

  • Propagate keyword weights along document structure.

VLDB Tutorial on XML Full-Text Search


Xml data model

<workshop>

date

<title>

<editors>

<proceedings>

28 July …

XML and …

David Carmel …

<paper>

<paper>

<title>

<author>

XQL and …

Ricardo …

XML Data Model

Containment edge

VLDB Tutorial on XML Full-Text Search

Hyperlink edge


Xxl theobald weikum edbt 20021
XXL[Theobald & Weikum, EDBT 2002]

  • Compute similar terms with relevance score r1 using an ontology.

  • Compute tf*idf of each term for a given element content with relevance score r2.

  • Relevance of an element content for a term is r1*r2.

  • r1 and r2 are computed as a weighted distance in an ontology graph.

  • Probabilities of conjunctions multiplied (independence assumption) along elements of same path to compute path score.

VLDB Tutorial on XML Full-Text Search


Probabilistic scoring xirql fuhr grobjohann sigir 2001
Probabilistic ScoringXIRQL [Fuhr & Grobjohann, SIGIR 2001]

  • Extension of XPath.

  • Weighting and ranking:

    • weighting of query terms:

      • P(wsum((0.6,a), (0.4,b)) = 0.6 · P(a)+0.4 · P(b)

    • probabilistic interpretation of Boolean connectors:

      • P(a && b) = P(a) · P(b)

VLDB Tutorial on XML Full-Text Search


Xirql example
XIRQL Example

  • Query:

    • “Search for an artist named Ulbrich, living in Frankfurt, Germany about 100 years ago”

  • Data:

    • “Ernst Olbrich, Darmstadt, 1899”

  • Weights and ranking:

    • P(Olbrich p Ulbrich)=0.8 (phonetic similarity)

    • P(1899 n 1903)=0.9 (numeric similarity)

    • P(Darmstadt g Frankfurt)=0.7 (geographic distance)

VLDB Tutorial on XML Full-Text Search


Pagerank brin page 1998

d/3

d: Probability of following hyperlink

d/3

d/3

PageRank [Brin & Page 1998]

: Hyperlink edge

w

1-d: Probability of random jump

VLDB Tutorial on XML Full-Text Search


Elemrank guo et al sigmod 2003

d1/3

d3

d1/3

d1: Probability of following hyperlink

d2: Probability of visiting a subelement

d1/3

d3: Probability of visiting parent

d2/2

d2/2

ElemRank [Guo et al. SIGMOD 2003]

: Hyperlink edge

: Containment edge

w

1-d1-d2-d3: Probability of random jump

VLDB Tutorial on XML Full-Text Search


Outline8
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Xsearch cohen et al vldb 20031
XSearch[Cohen et al., VLDB 2003]

  • tf*ilf to compute weight of keyword for a leaf element.

  • A vector is associated with each non-leaf element.

  • sim(Q,N): sum of the cosine distances between the vectors associated with nodes in N and vectors associated with terms matched in Q.

VLDB Tutorial on XML Full-Text Search


Outline9
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Vector based scoring juruxml mass et al inex 2002
Vector–based ScoringJuruXML [Mass et al INEX 2002]

  • Transform query into (term,path) conditions:

    article/bm/bib/bibl/bb[about(., hypercube mesh torus nonnumerical database)]

  • (term,path)-pairs:

    hypercube, article/bm/bib/bibl/bb

    mesh, article/bm/bib/bibl/bb

    torus, article/bm/bib/bibl/bb

    nonnumerical, article/bm/bib/bibl/bb

    database, article/bm/bib/bibl/bb

  • Modified cosine similarity as retrieval function for vague matching of path conditions.

VLDB Tutorial on XML Full-Text Search


Juruxml vague path matching
JuruXML Vague Path Matching

  • Modified vector-based cosine similarity

Example of length normalization:

cr (article/bibl, article/bm/bib/bibl/bb) = 3/6 = 0.5

VLDB Tutorial on XML Full-Text Search


Query relaxation on structure
Query Relaxation on Structure

  • Schlieder, EDBT 2002

  • Delobel & Rousset, 2002

  • Amer-Yahia et al, VLDB 2005

VLDB Tutorial on XML Full-Text Search


Xml query relaxation amer yahia et al edbt 2002 flexpath amer yahia et al sigmod 2004
XML Query Relaxation[Amer-Yahia et al EDBT 2002]FlexPath [Amer-Yahia et al SIGMOD 2004]

Query

book

  • Tree pattern relaxations:

    • Leaf node deletion

    • Edge generalization

    • Subtree promotion

info

edition

paperback

author

Dickens

book

book

Data

book

edition?

info

info

author

Dickens

info

edition

(paperback)

author

Charles Dickens

edition

paperback

author

C. Dickens

VLDB Tutorial on XML Full-Text Search


Adaptation of tf idf to xml whirlpool marian et al icde 2005
Adaptation of tf.idf to XML Whirlpool[Marian et al ICDE 2005]

VLDB Tutorial on XML Full-Text Search


A family of xml scoring methods amer yahia et al vldb 2005

Twig scoring

High quality

Expensive computation

Path scoring

Binary scoring

Low quality

Fast computation

book

book

+

book

book

+

book

+

book

info

edition

(paperback)

info

edition

(paperback)

author

(Dickens)

info

edition

(paperback)

author

(Dickens)

author

(Dickens)

A Family of XML Scoring Methods [Amer-Yahia et al VLDB 2005]

book

Query

info

edition

(paperback)

author

(Dickens)

VLDB Tutorial on XML Full-Text Search


Outline10
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Xirql relaxation
XIRQL + Relaxation

  • XIRQL proposes vague predicates but it is not clear how to combine it with all of XQuery.

  • Open issue as how to relax all of XQuery including structured and scalar predicates.

VLDB Tutorial on XML Full-Text Search


Outline11
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Outline12
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

  • Query Processing

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Main issue
Main Issue

  • Given: Query keywords

  • Compute: Least Common Ancestors (LCAs) that contain query keywords, in ranked order

VLDB Tutorial on XML Full-Text Search


Na ve method
Naïve Method

Naïve inverted lists:

  • Ricardo 1 ; 5 ; 6 ; 8

  • XQL 1 ; 5 ; 6 ; 7

1

<workshop>

date

<title>

<editors>

<proceedings>

2

3

4

5

28 July …

XML and …

David Carmel …

<paper>

6

<paper>

<title>

<author>

7

8

Problems:

1. Space Overhead

2. Spurious Results

XQL and …

Ricardo …

Main issue: Decouples representation of ancestors and descendants

VLDB Tutorial on XML Full-Text Search


Dewey encoding of ids 1850s
Dewey Encoding of IDs [1850s]

<workshop>

0

date

0.0

<title>

0.1

<editors>

0.2

<proceedings>

0.3

28 July …

XML and …

David Carmel …

<paper>

0.3.0

<paper>

0.3.1

<title>

0.3.0.0

<author>

0.3.0.1

XQL and …

Ricardo …

VLDB Tutorial on XML Full-Text Search


Xrank dewey inverted list dil
XRank: Dewey Inverted List (DIL)

Position List

Dewey Id

Score

5.0.3.0.0

85

32

XQL

Sorted by

Dewey Id

91

8.0.3.8.3

38

89

5.0.3.0.1

82

38

Ricardo

Sorted by

Dewey Id

8.2.1.4.2

99

52

Store IDs of elements that directly contain keyword

- Avoids space overhead

VLDB Tutorial on XML Full-Text Search


Dil query processing
DIL: Query Processing

  • Merge query keyword inverted lists in Dewey ID Order

    • Entries with common prefixes are processed together

  • Compute Longest Common Prefix of Dewey IDs during the merge

    • Longest common prefix ensures most specific results

    • Also suppresses spurious results

  • Keep top-m results seen so far in output heap

    • Calculate rank using two-dimensional proximity metric

    • Output contents of output heap after scanning inverted lists

  • Algorithm works in a single scan over inverted lists

VLDB Tutorial on XML Full-Text Search


Xrank ranked dewey inverted list rdil
XRank: Ranked Dewey Inverted List (RDIL)

B+-tree

On Dewey Id

Inverted List …

XQL

Sorted by Score

…(other keywords)

VLDB Tutorial on XML Full-Text Search


Rdil algorithm
RDIL: Algorithm

  • An element may be ranked highly in one list and low in another list

    • B+-tree helps search for low ranked element

  • When to stop scanning inverted lists?

    • Based on Threshold Algorithm [Fagin et al., 2002], which periodically calculates a threshold

    • Can stop if we have sufficient results above the threshold

    • Extension to most specific results

VLDB Tutorial on XML Full-Text Search


RDIL: Query Processing

Output Heap

P

Temp Heap

P

B+-tree on Dewey Id

Ricardo

Inverted List

P: 9.0.4.2.0

Rank(9.0.4)

threshold = Score(P)+Score(R)

threshold = Score(P)+Max-Score

XQL

R

9.0.4.1.2

8.2.1.4.2

9.0.4.1.2

9.0.5.6

9.0.5.6

10.8.3

9.0.4.1.2

B+-tree on Dewey Id

9.0.4.2.0

VLDB Tutorial on XML Full-Text Search


Id order vs rank order
ID Order vs. Rank Order

  • Approaches that combine benefits

  • Long ID inverted list, short score inverted list

    • HDIL (Guo et al., SIGMOD 2003)

  • Chunk inverted list based on score, organize by ID within chunk

    • FlexPath (Amer-Yahia et al., SIGMOD 2004)

    • SVR (Guo et al., ICDE 2005)

VLDB Tutorial on XML Full-Text Search


Outline13
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

  • Query Processing

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Xsearch technique
XSearch Technique

  • Given: An interconnection relationship R between nodes (semantic relationship)

    • R is reflexive and symmetric

  • Node interconnection index

    • Given two nodes n and n’ in a document d, find if (n,n’) are in R*

  • Use dynamic programming to compute closure

    • Online vs. offline

VLDB Tutorial on XML Full-Text Search


Outline14
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

  • Query Processing

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Xxl indexing
XXL Indexing

  • Element Path Index (EPI)

    • Evaluates simple path expressions

  • Element Content Index (ECI)

    • Traditional inverted list (but replicates nested elements)

  • Ontology Index (OI)

    • Lookup similar concepts (for evaluating ~e)

    • Returned in ranked order

VLDB Tutorial on XML Full-Text Search


Myaeng et al sigir 1994
Myaeng et al. [SIGIR 1994]

Document ID

Element Tag

Element ID

Element Tag

Element Tag

Probability

Probability

Probability

5

85

act

XQL

0.3

play

0.2

plays

0.1

VLDB Tutorial on XML Full-Text Search


Integrating structure and il kaushik et al sigmod 2004
Integrating Structure and IL[Kaushik et al., SIGMOD 2004]

Document ID

1

book

Start ID

Index ID

End ID

Depth

Score

2

info

edition

3

5

85

99

XQL

3

5

0.9

author

title

4

5

B+ Tree

VLDB Tutorial on XML Full-Text Search


Outline15
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

  • Query Processing

    • Simple Keyword Search

    • Tags + Keyword Search

    • Path Expressions + Keyword Search

    • XQuery + Complex Full-Text Search

  • Open Issues

VLDB Tutorial on XML Full-Text Search


Scoring functions critical for top k query processing
Scoring Functions Critical for Top-k Query Processing

  • Top-k answer quality depends on scoring function.

  • Efficient top-k query processing requires scoring function to be:

    • Monotone.

    • Fast to compute.

VLDB Tutorial on XML Full-Text Search


Structural join relaxation

contains(edition,”paperback”)

paperback

contains(author,”Dickens”)

Dickens

pc(info,edition)

edition

pc(info,author)

author

pc(book,info)

Structural Join Relaxation

//book[./info[./author ftcontains “Dickens”]

[./edition ftcontains “paperback”]]

contains(edition,”paperback”)

paperback

contains(author,”Dickens”)

Dickens

pc(info,edition) or

ad(book,edition)

edition

pc(info,author)

author

pc(book,info) or ad(book,info)

book

info

book

info

VLDB Tutorial on XML Full-Text Search


Quark galatex architecture

inverted lists

Quark/GalaTex Architecture

4

<xml>

<doc>

Text Text

Text Text

</doc>

</xml

Preprocessing

& Inverted Lists

Generation

Full-Text Primitives

(FTWord,

FTWindow, FTTimes

etc.)

API on positions

.xml

<doc>

Text Text

TextText

</doc>

Quark/Galax

XQuery Engine

evaluation

.xml

Full-Text

Query

Equivalent

XQuery

Query

XQFT

Parser

VLDB Tutorial on XML Full-Text Search


Outline16
Outline

  • Motivation

  • Full-Text Search Languages

  • Scoring

  • Query Processing

  • Open Issues

VLDB Tutorial on XML Full-Text Search


System architecture
System Architecture

Integration Layer

XQuery Engine

IR Engine

VLDB Tutorial on XML Full-Text Search


System architecture1
System Architecture

XQuery + IR Engine

Quark/GalaTex use this architecture

VLDB Tutorial on XML Full-Text Search


Structural relaxation
Structural Relaxation

FOR $b SCORE $s in FUZZY

/pub/book[. ftcontains “Usability” with stems]

ORDER BY $s

RETURN $b

VLDB Tutorial on XML Full-Text Search


Search over views
Search Over Views

Data Source 1

Data Source 2

<books> <book> … </book> <book> … </book> …</books>

<reviews> <review> … </review> <review> … </review> …</reviews>

<book> <reviews> … </reviews></book>…

Integrated

View

VLDB Tutorial on XML Full-Text Search


Other open issues
Other Open Issues

  • Experimental evaluation of scoring functions and ranking algorithms for XML (INEX).

  • Search over a mix of HTML and XML.

  • Joint scoring on full-text and scalar predicates.

  • Score-aware algebra for XML for the joint optimization of queries on both structure and text.

VLDB Tutorial on XML Full-Text Search


ad