1 / 31

Information Retrieval

Information Retrieval. January 28, 2005. Handout #4. Course Information. Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M 11-12 & Th 12-1 or via email Course page: http://tangra.si.umich.edu/~radev/650/

kris
Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval January 28, 2005 Handout #4

  2. Course Information • Instructor: Dragomir R. Radev (radev@si.umich.edu) • Office: 3080, West Hall Connector • Phone: (734) 615-5225 • Office hours: M 11-12 & Th 12-1 or via email • Course page: http://tangra.si.umich.edu/~radev/650/ • Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

  3. Arithmetic coding

  4. Arithmetic coding • Uses probabilities • Achieves about 2.5 bits per character – close to optimal • (Rissanen and Langdon 1979, Witten, Neal, and Cleary 1987)

  5. Exercise • Assuming the alphabet consists of a, b, and c, develop arithmetic encodings for the following strings: aaa aab aba baa abc cab cba bac

  6. Stemming

  7. Goals • Motivation: • Computer, computers, computerize, computational, computerization • User, users, using, used • Representing related words as one token • Simplify matching • Reduce storage and computation • Also known as: term conflation

  8. Methods • Manual (tables) • Achievement  achiev • Achiever  achiev • Etc. • Affix removal (Harman 1991, Frakes 1992) • if a word ends in “ies” but not “eies” or “aies” then “ies”  “y” • If a word ends in “es” but not “aes”, “ees”, or “oes”, then “es”  “e” • If a word ends in “s” but not “us” or “ss” then “s”  NULL • (apply only the first applicable rule)

  9. Porter’s algorithm (Porter 1980) • Home page: • http://www.tartarus.org/~martin/PorterStemmer • Reading assignment: • http://www.tartarus.org/~martin/PorterStemmer/def.txt • Consonant-vowel sequences: • CVCV ... C • CVCV ... V • VCVC ... C • VCVC ... V • Shorthand: [C]VCVC ... [V]

  10. Porter’s algorithm (cont’d) • [C](VC){m}[V] • {m} indicates repetition • Examples: • m=0 TR, EE, TREE, Y, BY • m=1 TROUBLE, OATS, TREES, IVY • m=2 TROUBLES, PRIVATE, OATEN • Conditions: • *S - the stem ends with S (and similarly for the other letters). • *v* - the stem contains a vowel. • *d - the stem ends with a double consonant (e.g. -TT, -SS). • *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).

  11. Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing Step 1b1 If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file

  12. Step 1c (*v*) Y -> I happy -> happi sky -> sky Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

  13. Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler

  14. Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll

  15. Porter’s algorithm (cont’d) Example: the word “duplicatable” duplicat rule 4duplicate rule 1b1duplic rule 3 The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied. % cd /clair4/class/ir-w03/tf-idf % ./stem.pl computers computers comput

  16. Porter’s algorithm

  17. Stemming • Not always appropriate (e.g., proper names, titles) • The same applies to casing (e.g., CAT vs. cat)

  18. String matching

  19. String matching methods • Index-based • Full or approximate • E.g., theater = theatre

  20. Index-based matching • Inverted files • Position-based inverted files • Block-based inverted files 1 6 9 11 1719 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Text: 11, 19 Words: 33, 40 From: 55

  21. Inverted index (trie) Letters: 60 l d Made: 50 a m n Many: 28 t Text: 11, 19 w Words: 33, 40

  22. Sequential searching • No indexing structure given • Given: database d and search pattern p. • Example: find “words” in the earlier example • Brute force method • try all possible starting positions • O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn) • Typical runtime is actually O(n) given that mismatches are easy to notice

  23. Knuth-Morris-Pratt • Average runtime similar to BF • Worst case runtime is linear: O(n) • Idea: reuse knowledge • Need preprocessing of the pattern

  24. Knuth-Morris-Pratt (cont’d) • Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm) database: ABC ABC ABC ABDAB ABCDABCDABDE pattern: ABCDABD index 0 1 2 3 4 5 6 7 char A B C D A B D – pos -1 0 0 0 0 1 2 0 1234567 ABCDABD ABCDABD

  25. Knuth-Morris-Pratt (cont’d) ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

  26. Boyer-Moore • Used in text editors • Demos • http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html • http://www.blarg.com/~doyle/pages/bmi.html

  27. Other methods • The Soundex algorithm (Odell and Russell) • Uses: • spelling correction • hash function • non-recoverable

  28. Word similarity • Hamming distance - when words are of the same length • Levenshtein distance - number of edits (insertions, deletions, replacements) • color --> colour (1) • survey --> surgery (2) • com puter --> computer ? • Longest common subsequence (LCS) • lcs (survey, surgery) = surey

  29. The Soundex algorithm 1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions 2. Assign the following numbers to the remaining letters after the first: b,f,p,v : 1 c,g,j,k,q,s,x,z : 2 d,t : 3 l : 4 m n : 5 r : 6

  30. The Soundex algorithm 3. if two or more letters with the same code were adjacent in the original name, omit all but the first 4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits Examples: Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300 same as Ellery, Ghosh, Heilbronn, Kant, and Ladd Some problems: Rogers and Rodgers, Sinclair and StClair

More Related