Indexing mixed types for approximate retrieval
Download
1 / 32

Indexing Mixed Types for Approximate Retrieval - PowerPoint PPT Presentation


  • 306 Views
  • Uploaded on

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore. Indexing Mixed Types for Approximate Retrieval. VLDB’2005 * Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Indexing Mixed Types for Approximate Retrieval' - sherlock_clovis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Indexing mixed types for approximate retrieval l.jpg

Liang Jin* UC Irvine

Nick Koudas University of Toronto

Chen Li*UC Irvine

Anthony K.H. Tung National University of Singapore

Indexing Mixed Types for Approximate Retrieval

VLDB’2005

* Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586


Queries with mixed type predicates l.jpg
Queries with Mixed-Type Predicates

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1980| <= 5;

  • SIMLARTO:

    • a domain-specific function

    • returns a similarity value between two strings

  • Example: edit distance ed(Tom Hanks, Ton Hank) = 2


Why fuzzy predicates l.jpg

Relation S

Relation R

Star

Star

Keanu Reeves

Keanu Reeves

Samuel Jackson

Samuel L. Jackson

Why fuzzy predicates?

Schwarzenegger

Schwarzenegger

Samuel Jackson

Samuel L. Jackson

  • Errors in queries

    • User doesn’t remember a string exactly

    • User types a wrong string


Problem formulation l.jpg
Problem Formulation

Given: A query with fuzzy predicates on strings and

range predicates on numeric attributes

on a single relation

Goal:Answer the query efficiently

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1980| <= 5;


Rest of the talk l.jpg
Rest of the talk

  • Motivation: supporting queries with mixed-type predicates

  • Our approach: MAT tree

  • Construction and maintenance of MAT tree

  • Experiments


Assumptions l.jpg
Assumptions

  • One fuzzy string predicate (edit distance)

  • One numeric predicate

Query:

(Qs, δs, Qn, δn)

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1980| <= 5;

(’Schwarrzenger’, 2, 1980, 5)


Intuition of mat mixed attribute type tree l.jpg
Intuition of MAT (Mixed-attribute-type) Tree

  • “2 > 1 + 1”

    • One integrated indexing structure is better than

    • two independent indexing structures on two attributes

  • Indexing numeric attributes: B-tree or R-tree

  • Indexing strings as a tree to support fuzzy predicates?

MAT tree


Answering a query qs s qn n l.jpg
Answering a query (Qs, δs, Qn, δn)

  • Top-down traverse the MAT-tree

  • At each node, do pruning by checking:

    • If [Qn – δn, Qn + δn] overlap with the numeric range.

    • If minEditDistance(Qs, Tn) <= δs.


Challenge l.jpg
Challenge

  • How to represent strings to fit into a limited space

  • and support fuzzy-predicate pruning

Limited space (disk based)


Existing approaches to indexing strings as trees l.jpg
Existing Approaches to Indexing Strings as Trees

  • M-tree:

    • Edit distance: metric space

  • Q-tree

    • Utilize the q-gram property of strings.

    • See our paper for details



Compressing a trie l.jpg
Compressing a trie

compression

  • Select k representative nodes (centers).

  • Each center is in the format of <alphabet,height>.

  • A compressed trie represents more strings


Slide13 l.jpg

Minimum edit distance between a string a trie

minEditDistace (Qs, Tn)?

  • Convert a trie to an automaton.

  • Compute the min distance between a string and an automaton [Myers and Miller, 1989]

  • Early termination possible


Compressed trie a utomaton l.jpg
Compressed trie  Automaton

  • Each node is a state.

  • Each edge becomes a transition between two states.

  • For compressed node <Σ, L>, expand it to L levels. At each level, all characters in Σ become single states and are connected to a common tail ε.

Convert a compressed node <{a,b,c},2> into automaton nodes.


Outline l.jpg
Outline

  • Motivation: supporting queries with mixed-type predicates

  • Our approach: MAT tree

  • Construction and maintenance of MAT tree

  • Experiments


Constructing mat tree l.jpg
Constructing MAT-tree

  • Option 1: insert records one by one.

  • Option 2:

    • bulk-load records

    • construct the MAT-tree bottom-up


Compressing a trie17 l.jpg
Compressing a trie

  • Important:

    • Accurately represent strings in a limited space.

    • Minimize “information loss”.

    • Maintain the pruning power during a traversal.

  • Three methods:

    • (1) Reducing # of accepted strings

    • (2) Keeping accepted strings “clustered”

    • (3) Combining of (1) and (2)


Method 1 reducing of accepted strings l.jpg
Method (1): Reducing # of accepted strings

  • Intuition:

    • reducing this # makes the compressed trie more accurate

  • Goodness function: # of accepted strings

  • Algorithm: “Randomized”

    • Randomly select k initial centers

    • Randomly select one of the centers

    • Randomly select an unselected node

    • Swap them if it can improve the goodness function

    • Do certain # of iterations


Method 2 keeping accepted strings clustered l.jpg
Method (2): Keeping accepted strings clustered

  • Intuition:

    • keeping the accepted strings similar to the original ones by letting them share common prefix.

    • Place k centers as close to the root as possible.

  • Algorithm: “BreadthFirst”


Method 3 combining 1 and 2 l.jpg
Method (3): Combining (1) and (2)

  • Intuition:

    • minimize the number of accepted strings, and in the same time maintain their similarity to the originals.

  • Algorithm: “Bottomup”

    • Keep shrinking the trie bottom up until we have k nodes

    • Compress a node that minimizes # of additional strings


Dynamic maintenance l.jpg
Dynamic maintenance

Insertion (s, n)

  • Search the index for (s, n). If it’s not in the index, identify the correct leaf node.

  • If no overflow:

    • update the “MBR” of the leaf node and its precedents recursively if necessary.

  • If overflow:

    • Split the leaf node and

    • Construct two compressed tries

    • Cascade the split to the precedents if necessary.

      Deletion and Update are handled similarly


Outline22 l.jpg
Outline

  • Motivation: supporting queries with mixed-type predicates

  • Our approach: MAT tree

  • Construction and maintenance of MAT tree

  • Experiments


Setting l.jpg
Setting

  • Data

    • IMDB: 100K movie star records (Name and YOB).

    • Customers: 50K records (Name and YOB)

  • Test bed

    • PC: 2.4G P4, 1.2GB Memory, Windows XP

    • Visual C++ compiler

  • Similar results. Report result for IMDB.


Implemented approaches l.jpg
Implemented approaches

  • B-tree

  • Q-tree

  • B-tree & Q-tree

  • BQ-tree

  • BM-tree

  • Sequential scan

    “BBQ-tree”? 


2 1 1 l.jpg
“2 > 1 + 1”

An integrated indexing structure is better than

two separate indexing structures

δs=3, δn=4







Number of centers l.jpg
Number of centers

  • Increasing cluster # may not reduce the running time: pruning power versus computational cost

  • For BottomUp and BreadthFirst (compared to Randomized)

  • - Centers close to the root, thus more likely to do early termination


Conclusion l.jpg
Conclusion

  • MAT-tree: an efficient indexing structure for queries with mixed-type predicates

  • Can be efficiently constructed and maintained

  • Future work: develop a uniform framework to support different kinds of similarity functions

The Flamingo Project :http://www.ics.uci.edu/~flamingo/

Q&A?


ad