1 / 17

| 1

| 1. Zoekmachines. Gertjan van Noord 2014. Lecture 3: tolerant retrieval. Tolerant retrieval: overview. Methods to handle imprecise queries wildcard queries typo’s alternative spellings Building alternative indexes Finding the most similar terms. Sec. 3.2. Wild-card queries: *.

maeko
Download Presentation

| 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. | 1 Zoekmachines • Gertjan van Noord 2014 Lecture 3: tolerant retrieval

  2. Tolerant retrieval: overview Methods to handle imprecise queries • wildcard queries • typo’s • alternative spellings Building alternative indexes Finding the most similar terms

  3. Sec. 3.2 Wild-card queries: * mon*: find docs containing any word beginning with “mon”. *mon: find words ending in “mon”: harder. mo*n: find words that start with ‘mo’ and end with ‘n’ m*o*n: find words that start with ‘m’, end with ‘n’, and have an ‘o’ somewhere inbetween.

  4. Wildcard queries Two steps in retrieval for wildcard queries: • Find all terms that fall within wildcard definition • Find all docs containing any of these words Three ways to do this: B-trees, permuterm index, k-gram index

  5. Dictionary structures: Hash: very efficient (lookup and construction), but cannot be used to find terms that are “close” to the key Binary tree and B-tree (and tries): data structures which keep data sorted (and balanced). Efficient search, but construction is more costly. Words with same suffix are close together in the result → can be used for robust retrieval.

  6. Sec. 3.2 Wild-card queries: * mon*: Easy with binary tree (or B-tree) lexicon: retrieve all terms in range: mon ≤ w < moo *mon: Maintain an additional B-tree for terms backwards, retrieveall words in range: nom ≤ w < non. m*n: m*o*n: Combine B-tree and reverse B-tree. Expensive! ?? Solution: the permuterm index

  7. Permuterm index and queries Permuterm index add an end symbol: cat$ index all permuterms(in a structure like B-tree): cat$ at$c t$ca $cat Wildcard query processing: add $, rotate (if needed) until * is at the end examples: queries that can find (a.o.) cat: c*t c*at ca* ca*t *t *at permuterm form?

  8. Sec. 3.2.1 Permuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol. Queries: X lookup on X$ X* lookup on $X* *X lookup on X$**X* lookup on X* X*Y lookup on Y$X*X*Y*Z ???? Exercise! Query = hel*o X=hel, Y=o Lookup o$hel*

  9. K-gram index k-gram index (example k=3) to each dictionary term add a start and an end symbol: $kitten$ from this string, list all trigrams kitten:$ki kit itt tte en$ make an inverted index of trigrams $ki  (kinkiten, kitchen, kitten, ...) how can we find kitten?

  10. An alternative: K-gram indexes Index for dictionary lookup, not for document retrieval! Posting lists point from k-gram to vocabulary terms k-gram: group of k consecutive items (context-dependent: characters, syllabes, words,..) bigram (digram), trigram, …

  11. K-gram index and queries Part of 3-gram inverted index: $ki -> kinkiten kitchen kitten en$ -> kinkiten kitchen kitten che -> kitchen ink -> kinkiten itt -> kitten kit -> kinkiten kitchen kitten Wildcard query processing $kit*en$ $ki AND kit AND en$ kinkiten??? postprocessing needed!

  12. Sec. 3.2 Query processing • At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. • We still have to look up the postings for each enumerated term. • E.g., consider the query: se*ateANDfil*er • This may result in the execution of many Boolean AND queries.

  13. Spell correction When? If a query word (combination) is quite rare or not available at all in the dictionary Approach: • Find similar term(s) • Calculate their similarity to the query term • Choose the most frequent ones

  14. Finding similar words and calculate their similarity use k-gram index of words and calculate Jaccard coefficient to find most similar ones for query term |A ∩ B| / |A U B| relative similarity size of set of elements (k-grams) in commondivided bysize of set of all elements SET: no duplicates!

  15. Even more precise then use Levenshtein distance for more precisely selecting the terms with the least edit distance to the query term demo: http://www.miislita.com/searchito/levenshtein-edit-distance.html

  16. 26-01-12 Levenshtein distance m(i, j-1) m(i-1,j-1) m(i-1,j) Minimal edit distance

  17. Phonetic similarity To calculate which (English) written words are most similar in pronunciation, the SOUNDEX algorithm gives a (rather rough) measure. Demo: http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#SoundExConverter

More Related