1 / 36

Indexing Mixed Types for Approximate Retrieval

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore. Indexing Mixed Types for Approximate Retrieval. * Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586.

Download Presentation

Indexing Mixed Types for Approximate Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Liang Jin* UC Irvine Nick Koudas University of Toronto Chen Li*UC Irvine Anthony K.H. Tung National University of Singapore Indexing Mixed Types for Approximate Retrieval * Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586

  2. Queries with Mixed-Type Predicates SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; • SIMLARTO: • a domain-specific function • returns a similarity value between two strings • Example: edit distance ed(Tom Hanks, Ton Hank) = 2

  3. Errors in databases: • Data is not clean • Especially true in data integration and cleansing Relation S Relation R Star Star Keanu Reeves Keanu Reeves Samuel Jackson Samuel L. Jackson Why fuzzy predicates? Schwarzenegger Schwarzenegger Samuel Jackson Samuel L. Jackson … … • Errors in queries • User doesn’t remember a string exactly • User types a wrong string

  4. Problem Formulation Given: A query with fuzzy predicates on strings and range predicates on numeric attributes on a single relation Goal:Answer the query efficiently SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5;

  5. Rest of the talk • Motivation: supporting queries with mixed-type predicates • Our approach: MAT tree • Construction and maintenance of MAT tree • Experiments

  6. Assumptions • One fuzzy string predicate (edit distance) • One numeric predicate Query: (Qs, δs, Qn, δn) SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; (’Schwarrzenger’, 2, 1980, 5)

  7. Intuition of MAT (Mixed-attribute-type) Tree • “2 > 1 + 1” • One integrated indexing structure is better than • two independent indexing structures on two attributes • Indexing numeric attributes: B-tree or R-tree • Indexing strings as a tree to support fuzzy predicates? MAT tree

  8. Answering a query (Qs, δs, Qn, δn) • Top-down traverse the MAT-tree • At each node, do pruning by checking: • If [Qn – δn, Qn + δn] overlap with the numeric range. • If minEditDistance(Qs, Tn) <= δs.

  9. Challenge • How to represent strings to fit into a limited space • and support fuzzy-predicate pruning Limited space (disk based)

  10. Existing Approaches to Indexing Strings as Trees • M-tree: • Edit distance: metric space • Q-tree • Utilize the q-gram property of strings. • See our paper for details

  11. Representing strings as a trie

  12. Compressing a trie compression • Select k representative nodes (centers). • Each center is in the format of <alphabet,height>. • A compressed trie represents more strings

  13. Minimum edit distance between a string a trie minEditDistace (Qs, Tn)? • Convert a trie to an automaton. • Compute the min distance between a string and an automaton [Myers and Miller, 1989] • Early termination possible

  14. Compressed trie  Automaton • Each node is a state. • Each edge becomes a transition between two states. • For compressed node <Σ, L>, expand it to L levels. At each level, all characters in Σ become single states and are connected to a common tail ε. Convert a compressed node <{a,b,c},2> into automaton nodes.

  15. Outline • Motivation: supporting queries with mixed-type predicates • Our approach: MAT tree • Construction and maintenance of MAT tree • Experiments

  16. Constructing MAT-tree • Option 1: insert records one by one. • Option 2: • bulk-load records • construct the MAT-tree bottom-up

  17. Compressing a trie • Important: • Accurately represent strings in a limited space. • Minimize “information loss”. • Maintain the pruning power during a traversal. • Three methods: • (1) Reducing # of accepted strings • (2) Keeping accepted strings “clustered” • (3) Combining of (1) and (2)

  18. Method (1): Reducing # of accepted strings • Intuition: • reducing this # makes the compressed trie more accurate • Goodness function: # of accepted strings • Algorithm: “Randomized” • Randomly select k initial centers • Randomly select one of the centers • Randomly select an unselected node • Swap them if it can improve the goodness function • Do certain # of iterations

  19. Method (2): Keeping accepted strings clustered • Intuition: • keeping the accepted strings similar to the original ones by letting them share common prefix. • Place k centers as close to the root as possible. • Algorithm: “BreadthFirst”

  20. Method (3): Combining (1) and (2) • Intuition: • minimize the number of accepted strings, and in the same time maintain their similarity to the originals. • Algorithm: “Bottomup” • Keep shrinking the trie bottom up until we have k nodes • Compress a node that minimizes # of additional strings

  21. Dynamic maintenance Insertion (s, n) • Search the index for (s, n). If it’s not in the index, identify the correct leaf node. • If no overflow: • update the “MBR” of the leaf node and its precedents recursively if necessary. • If overflow: • Split the leaf node and • Construct two compressed tries • Cascade the split to the precedents if necessary. Deletion and Update are handled similarly

  22. Outline • Motivation: supporting queries with mixed-type predicates • Our approach: MAT tree • Construction and maintenance of MAT tree • Experiments

  23. Setting • Data • IMDB: 100K movie star records (Name and YOB). • Customers: 50K records (Name and YOB) • Test bed • PC: 2.4G P4, 1.2GB Memory, Windows XP • Visual C++ compiler • Similar results. Report result for IMDB.

  24. Implemented approaches • B-tree • Q-tree • B-tree & Q-tree • BQ-tree • BM-tree • Sequential scan “BBQ-tree”? 

  25. “2 > 1 + 1” An integrated indexing structure is better than two separate indexing structures δs=3, δn=4

  26. Scalability

  27. Effect of numeric threshold δn

  28. Effect of string threshold δs

  29. Dynamic Maintenance: time

  30. Dynamic maintenance: MAT quality

  31. Number of centers • Increasing cluster # may not reduce the running time: pruning power versus computational cost • For BottomUp and BreadthFirst (compared to Randomized) • - Centers close to the root, thus more likely to do early termination

  32. Conclusion • MAT-tree: an efficient indexing structure for queries with mixed-type predicates • Can be efficiently constructed and maintained • Future work: develop a uniform framework to support different kinds of similarity functions The Flamingo Project :http://www.ics.uci.edu/~flamingo/ Q&A?

  33. Backup Slides

  34. Constructing MAT-tree • Option 1: inserting records one by one. • Option 2: bulk-loading data records and constructing the MAT-tree in a bottom-up fashion. • Records are sorted based on one attribute. • Fill pages with records until full. • Calculate the numeric range and the compressed trie for each leaf nodes. • Merge leaf nodes into internal nodes recursively according to desired fanout, until a single root is formed.

  35. Example – Customer Service Call Center Customer calls in Serve the customer Issue a fuzzy query: Name LIKE “Tom Hanks” AND YOB CLOSE to 1958 In this example, the underline system should be able to support fuzzy query on both the string and numeric attributes! Return result

  36. Scalability test (IO)

More Related