indexing mixed types for approximate retrieval
Download
Skip this Video
Download Presentation
Indexing Mixed Types for Approximate Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 32

Indexing Mixed Types for Approximate Retrieval - PowerPoint PPT Presentation


  • 306 Views
  • Uploaded on

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore. Indexing Mixed Types for Approximate Retrieval. VLDB’2005 * Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Indexing Mixed Types for Approximate Retrieval' - sherlock_clovis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
indexing mixed types for approximate retrieval

Liang Jin* UC Irvine

Nick Koudas University of Toronto

Chen Li*UC Irvine

Anthony K.H. Tung National University of Singapore

Indexing Mixed Types for Approximate Retrieval

VLDB’2005

* Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586

queries with mixed type predicates
Queries with Mixed-Type Predicates

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1980| <= 5;

  • SIMLARTO:
    • a domain-specific function
    • returns a similarity value between two strings
  • Example: edit distance ed(Tom Hanks, Ton Hank) = 2
why fuzzy predicates

Errors in databases:

    • Data is not clean
    • Especially true in data integration and cleansing

Relation S

Relation R

Star

Star

Keanu Reeves

Keanu Reeves

Samuel Jackson

Samuel L. Jackson

Why fuzzy predicates?

Schwarzenegger

Schwarzenegger

Samuel Jackson

Samuel L. Jackson

  • Errors in queries
    • User doesn’t remember a string exactly
    • User types a wrong string
problem formulation
Problem Formulation

Given: A query with fuzzy predicates on strings and

range predicates on numeric attributes

on a single relation

Goal:Answer the query efficiently

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1980| <= 5;

rest of the talk
Rest of the talk
  • Motivation: supporting queries with mixed-type predicates
  • Our approach: MAT tree
  • Construction and maintenance of MAT tree
  • Experiments
assumptions
Assumptions
  • One fuzzy string predicate (edit distance)
  • One numeric predicate

Query:

(Qs, δs, Qn, δn)

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1980| <= 5;

(’Schwarrzenger’, 2, 1980, 5)

intuition of mat mixed attribute type tree
Intuition of MAT (Mixed-attribute-type) Tree
  • “2 > 1 + 1”
    • One integrated indexing structure is better than
    • two independent indexing structures on two attributes
  • Indexing numeric attributes: B-tree or R-tree
  • Indexing strings as a tree to support fuzzy predicates?

MAT tree

answering a query qs s qn n
Answering a query (Qs, δs, Qn, δn)
  • Top-down traverse the MAT-tree
  • At each node, do pruning by checking:
    • If [Qn – δn, Qn + δn] overlap with the numeric range.
    • If minEditDistance(Qs, Tn) <= δs.
challenge
Challenge
  • How to represent strings to fit into a limited space
  • and support fuzzy-predicate pruning

Limited space (disk based)

existing approaches to indexing strings as trees
Existing Approaches to Indexing Strings as Trees
  • M-tree:
    • Edit distance: metric space
  • Q-tree
    • Utilize the q-gram property of strings.
    • See our paper for details
compressing a trie
Compressing a trie

compression

  • Select k representative nodes (centers).
  • Each center is in the format of <alphabet,height>.
  • A compressed trie represents more strings
slide13

Minimum edit distance between a string a trie

minEditDistace (Qs, Tn)?

  • Convert a trie to an automaton.
  • Compute the min distance between a string and an automaton [Myers and Miller, 1989]
  • Early termination possible
compressed trie a utomaton
Compressed trie  Automaton
  • Each node is a state.
  • Each edge becomes a transition between two states.
  • For compressed node <Σ, L>, expand it to L levels. At each level, all characters in Σ become single states and are connected to a common tail ε.

Convert a compressed node <{a,b,c},2> into automaton nodes.

outline
Outline
  • Motivation: supporting queries with mixed-type predicates
  • Our approach: MAT tree
  • Construction and maintenance of MAT tree
  • Experiments
constructing mat tree
Constructing MAT-tree
  • Option 1: insert records one by one.
  • Option 2:
    • bulk-load records
    • construct the MAT-tree bottom-up
compressing a trie17
Compressing a trie
  • Important:
    • Accurately represent strings in a limited space.
    • Minimize “information loss”.
    • Maintain the pruning power during a traversal.
  • Three methods:
    • (1) Reducing # of accepted strings
    • (2) Keeping accepted strings “clustered”
    • (3) Combining of (1) and (2)
method 1 reducing of accepted strings
Method (1): Reducing # of accepted strings
  • Intuition:
    • reducing this # makes the compressed trie more accurate
  • Goodness function: # of accepted strings
  • Algorithm: “Randomized”
    • Randomly select k initial centers
    • Randomly select one of the centers
    • Randomly select an unselected node
    • Swap them if it can improve the goodness function
    • Do certain # of iterations
method 2 keeping accepted strings clustered
Method (2): Keeping accepted strings clustered
  • Intuition:
    • keeping the accepted strings similar to the original ones by letting them share common prefix.
    • Place k centers as close to the root as possible.
  • Algorithm: “BreadthFirst”
method 3 combining 1 and 2
Method (3): Combining (1) and (2)
  • Intuition:
    • minimize the number of accepted strings, and in the same time maintain their similarity to the originals.
  • Algorithm: “Bottomup”
    • Keep shrinking the trie bottom up until we have k nodes
    • Compress a node that minimizes # of additional strings
dynamic maintenance
Dynamic maintenance

Insertion (s, n)

  • Search the index for (s, n). If it’s not in the index, identify the correct leaf node.
  • If no overflow:
    • update the “MBR” of the leaf node and its precedents recursively if necessary.
  • If overflow:
    • Split the leaf node and
    • Construct two compressed tries
    • Cascade the split to the precedents if necessary.

Deletion and Update are handled similarly

outline22
Outline
  • Motivation: supporting queries with mixed-type predicates
  • Our approach: MAT tree
  • Construction and maintenance of MAT tree
  • Experiments
setting
Setting
  • Data
    • IMDB: 100K movie star records (Name and YOB).
    • Customers: 50K records (Name and YOB)
  • Test bed
    • PC: 2.4G P4, 1.2GB Memory, Windows XP
    • Visual C++ compiler
  • Similar results. Report result for IMDB.
implemented approaches
Implemented approaches
  • B-tree
  • Q-tree
  • B-tree & Q-tree
  • BQ-tree
  • BM-tree
  • Sequential scan

“BBQ-tree”? 

2 1 1
“2 > 1 + 1”

An integrated indexing structure is better than

two separate indexing structures

δs=3, δn=4

number of centers
Number of centers
  • Increasing cluster # may not reduce the running time: pruning power versus computational cost
  • For BottomUp and BreadthFirst (compared to Randomized)
  • - Centers close to the root, thus more likely to do early termination
conclusion
Conclusion
  • MAT-tree: an efficient indexing structure for queries with mixed-type predicates
  • Can be efficiently constructed and maintained
  • Future work: develop a uniform framework to support different kinds of similarity functions

The Flamingo Project :http://www.ics.uci.edu/~flamingo/

Q&A?

ad