1 / 67

Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval. Outline. Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex. Sec. 3.1. Dictionary data structures for inverted indexes.

gilon
Download Presentation

Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval

  2. Outline • Recap • Dictionaries • Wildcard queries • Edit distance • Spelling correction • Soundex

  3. Sec. 3.1 Dictionary data structures for inverted indexes • The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?

  4. Sec. 3.1 A naïve dictionary • An array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes • How do we store a dictionary in memory efficiently? • How do we quickly look up elements at query time?

  5. Sec. 3.1 Dictionary data structures • Two main choices: • Hash table • Tree • Some IR systems use hashes, some trees

  6. Sec. 3.1 Hashes • Each vocabulary term is hashed to an integer • (We assume you’ve seen hashtables before) • Pros: • Lookup is faster than for a tree: O(1) • Cons: • No easy way to find minor variants: • judgment/judgement • No prefix search [tolerant retrieval] • If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything

  7. Sec. 3.1 Trees • Simplest: binary tree • More usual: B-trees • Trees require a standard ordering of characters and hence strings … but we standardly have one • Pros: • Solves the prefix problem (terms starting with hyp) • Cons: • Slower: O(log M) [and this requires balanced tree] • Rebalancing binary trees is expensive • But B-trees mitigate the rebalancing problem

  8. Binary tree 8

  9. Sec. 3.1 Tree: B-tree • Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. n-z a-hu hy-m

  10. Outline • Recap • Dictionaries • Wildcard queries • Edit distance • Spelling correction • Soundex

  11. Sec. 3.2 Wild-card queries: * • mon*: find all docs containing any word beginning “mon”. • Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo • *mon: find words ending in “mon”: harder • Maintain an additional B-tree for terms backwards. Can retrieve all words in range: nom ≤ w < non. Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent?

  12. How to handle * in the middle of a term • Example: m*nchen • We could look up m* and *nchen in the B-tree and intersect the two term sets. • Expensive • Alternative: permuterm index • Basic idea: Rotate every wildcard query, so that the * occurs at the end. • Store each of these rotations in the dictionary, say, in a B-tree 12

  13. Sec. 3.2 B-trees handle *’s at the end of a query term • How can we handle *’s in the middle of query term? • co*tion • We could look up co* AND *tion in a B-tree and intersect the two term sets • Expensive • The solution: transform wild-card queries so that the *’s occur at the end • This gives rise to the Permuterm Index.

  14. Permuterm index • For term HELLO: add hello$, ello$h, llo$he, lo$hel, and o$hell to the B-tree where $ is a special symbol 14

  15. Permuterm → term mapping 15

  16. Permuterm index • For HELLO, we’ve stored: hello$, ello$h, llo$he, lo$hel, and o$hell • Queries • For X, look up X$ • For X*, look up $X* • For *X, look up X$* • For *X*, look up X* • For X*Y, look up Y$X* • Example: For hel*o, look up o$hel* • Permuterm index would better be called a permutermtree. • But permuterm index is the more common name. 16

  17. Processing a lookup in the permuterm index • Rotate query wildcard to the right • Use B-tree lookup as before • Problem: Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical number) 17

  18. k-gram indexes • More space-efficient than permuterm index • Enumerate all character k-grams (sequence of k characters) occurring in a term • 2-grams are called bigrams. • Example: from April is the cruelest month we get the bigrams: $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ • $ is a special word boundary symbol, as before. • Maintain an inverted index from bigrams to the terms that contain the bigram 18

  19. Postings list in a 3-gram inverted index 19

  20. k-gram (bigram, trigram, . . . ) indexes • Note that we now have two different types of inverted indexes • The term-document inverted index for finding documents based on a query consisting of terms • The k-gram index for finding terms based on a query consisting of k-grams 20

  21. Processing wildcarded terms in a bigram index • Query mon* can now be run as: $m AND mo AND on • Gets us all terms with the prefix mon . . . • . . . but also many “false positives” like MOON. • We must postfilter these terms against query. • Surviving terms are then looked up in the term-document invertedindex. • k-gram index vs. permutermindex • k-gram index is more space efficient. • Permuterm index doesn’t require postfiltering. 21

  22. Sec. 3.2.2 Processing wild-card queries • As before, we must execute a Boolean query for each enumerated, filtered term. • Wild-cards can result in expensive query execution (very large disjunctions…) • pyth* AND prog* • If you encourage “laziness” people will respond! • Which web search engines allow wildcard queries? Search Type your search terms, use ‘*’ if you need to. E.g., Alex* will match Alexander.

  23. Outline • Recap • Dictionaries • Wildcard queries • Edit distance • Spelling correction • Soundex

  24. Distance between misspelled word and “correct” word • We will study several alternatives. • weighted edit distance • Edit distance and Levenshtein distance • k-gram overlap 24

  25. Weighted edit distance • As above, but weight of an operation depends on the charactersinvolved. • Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q. • Therefore, replacing m by n is a smaller edit distance than by q. 25

  26. Edit distance • The edit distance between string s1 and string s2 is the minimum number of basic operations that convert s1 to s2. • Levenshtein distance: The admissible basic operations are insert, delete, andreplace • Levenshteindistancedog-do: 1 • Levenshteindistancecat-cart: 1 • Levenshteindistancecat-cut: 1 • Levenshtein distance cat-act: 2 26

  27. Levenshtein distance: Algorithm 27

  28. Levenshtein distance: Algorithm 28

  29. Levenshtein distance: Algorithm 29

  30. Levenshtein distance: Algorithm 30

  31. Levenshtein distance: Algorithm 31

  32. Each cell of Levenshtein matrix 32

  33. Levenshtein distance: Example 33

  34. Exercise • Compute Levenshtein distance matrix for OSLO – SNOW • What are the Levenshtein editing operations that transform cat into catcat? 34

  35. 35

  36. 36

  37. 37

  38. 38

  39. 39

  40. 40

  41. 41

  42. How do I read out the editing operations that transform OSLO into SNOW? 42

  43. 43

  44. 44

  45. 45

  46. 46

  47. 47

  48. 48

  49. 49

  50. 50

More Related