1 / 51

This Class

This Class. How stemming is used in IR Stemming algorithms Frakes: Chapter 8 Kowalski: pages 67-76. Stemming algorithms. Affix removing stemmers Dictionary lookup stemmers n-gram stemmers Successor variety stemmers. Stemming. Conflation - combining morphological term variants

quade
Download Presentation

This Class

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. This Class • How stemming is used in IR • Stemming algorithms • Frakes: Chapter 8 • Kowalski: pages 67-76

  2. Stemming algorithms • Affix removing stemmers • Dictionary lookup stemmers • n-gram stemmers • Successor variety stemmers

  3. Stemming • Conflation - combining morphological term variants • Done manually or automatically • Automatic algorithms called stemmers

  4. Stemming algorithms Conflation methods Manual Automatic Affix Removal Successor Variety Dictionary Lookup n-grams Longest Match Simple Removal

  5. Stemming is used for: • Enhance query formulation (and improve recall) by providing term variants • Reduce size of index files by combining term variants into single index term

  6. Stemming during indexing • Index terms are stemmed words • Saves dictionary space • One inverted index list for all variants • Saves inverted index file space when position information in document not included • Query terms are also stemmed

  7. Index is not stemmed • In this case the index contains words • No compression is achieved • No information is lost • Enables wild card searches • Enables long phrase searches when position information included

  8. Providing term variants during search • A stemming algorithm generate term variants • Term variants added to query automatically (query expansion) or • The user is provided with term variants and decides which ones to include

  9. Example • A user searching for ystem users?is provided in the CATALOG system with term variants for sers?and ystem

  10. Example (cont.) Search term: users Term Occurrences 1. user 15 2. users 1 3. used 3 4. using 2 • User selects variants to include in query

  11. Stemmer correctness • A stemmer can be incorrect by either • Under-stemming or by • Over-stemming • Over-stemming can reduce precision • Under-stemming can affect recall

  12. Over-stemming • Terms with different meanings are conflated • onsiderate? and onsider?and onsideration should not be stemmed to on? with ontra? ontact? etc.

  13. Under-Stemming • Prevents related terms from being conflated • Under-stemming onsideration?to onsiderat? prevents conflating it with onsider

  14. Evaluating stemmers • In information retrieval stemmers are evaluated by their: • effect on retrieval and • compression rate, and • not linguistic correctness

  15. Evaluating stemmers • Studies have shown that stemming has a positive effect on retrieval. • Performance of algorithms comparable • Results vary between test collections

  16. Affix removal stemmers • Remove • suffixes and and/or • prefixes from terms • leaving a stem

  17. Affix removal stemmers • In English stemmers are suffix removers • In other languages, for example Hebrew, both prefix and suffix are removed

  18. Affix removal stemmers • Most affix removal stemmers in use are: • iterative - for example, onsideration?stemmed first to onsiderat?then to onsider • longest match stemmers using a set of stemming rules.

  19. A simple stemmer • Harman experimented • concluded minimal stemming helpful • Her simple stemmer changes: • Plural to singular • Third person to first person

  20. A simple stemmer • Algorithm changes: • kies?to ky? ies->y • etrieves?to etrieve? es->s, and • oors?to oor? s->NULL • (leaves orpus?or ellness? • ies?to y?

  21. A simple stemmer 1. word ends in es?but not ies?or ies?change end to ? 2. word endsin s? but not es? es?or es?change to ? 3. word endsin ?but not s?or s? remove s

  22. The Paice/Husk stemmer • Uses a table of rules grouped into sections • Section for each last letter of a suffix (rules for forms ending in a, then b, etc.) • A form is any word or part of a word considered for stemming

  23. The Paice/Husk stemmer • Each rule specifies a deletion or a replacement of an ending • The order of the rules in each section is important. • Rules tried until one can be applied, and the current form is updated

  24. Rule structure • Each rule contains 5 parts (2 are optional): • An ending (one or more characters in reverse order) • An optional ntact?flag ??denoting form not yet stemmed

  25. Rule structure • A digit (>=0) specifying no. characters to remove • An optional string to append (after removal) • A rule ending with ??denotes stemming should continue ?? terminating the stemming process

  26. Examples of rules • ei3y>? • if form ends in es?then replace the last 3 letters by ?and continue stemming ( ries?becomes ry?

  27. Examples of rules • u*2.? • if form ends with m?and word is intact remove 2 last letters and terminate stemming. • aximum?is stemmed to axim? but resum?from resumably?remains unchanged

  28. Examples of rules • lp0.?- if word terminates in ly?terminate. Next rule l2>?does not remove y?from ultiply • ois4j>?causes ion?to be replaced by ? • ?acts as dummy ending • rovision?converted to rovij?and then to rovid

  29. Acceptability conditions • Rule not applied unless conditions satisfied • Attempt to prevent over-stemming • Without them ent? ant? ice? ate? ation?iver?reduce to ? • There are 2 rules:

  30. Acceptability conditions • If form starts with a vowel then at least 2 letters must remain (owed/owing->ow but not ear->e) • If a form starts with a consonant then at least 3 letters must remain, and at least one must be a vowel or  (saying->say, crying->cry, but not string->str, meant->me, or cement->ce)

  31. Acceptability conditions • These rules cause error in the stemming of some short-rooted words • (doing, dying, being). • These could be dealt with separately with a table lookup

  32. Example with Paice stemming • eparately?- use ?section • mismatch ylb1>, yli3y>, ylp0. • match yl2>. Form becomes eparate? • use rule 1>?in ?section • form changes to eparat?- use t section • mismatch with acilp4y.? match with a2>? change form to epar • use r section, match with a2.? So ep

  33. Other examples

  34. n-grams • Fixed length consecutive series of ?characters • Bigrams: • Sea colony -> (se ea co ol lo on ny) • Trigrams • Sea colony -> (sea col olo lon ony), or -> (#se sea ea# #co col olo lon ony ny#)

  35. Usage of n-grams • Used in world war II by cryptographers • Spell checking • Text compression • Signature files • Stemming

  36. n-gram temmers • Adamson and Borcham (1974) • Method for grouping term variants • Language independent

  37. n-gram temmers • Each term transformed to n-gram • A similarity value is generated between any pair of terms in database, resulting in a similarity matrix

  38. n-gram temmers • A clustering method (single link) groups highly similar terms into clusters • Most matrix elements had value 0. • Used a cutoff value of 0.6 for their clustering algorithm

  39. Dice Coefficient • Many formulas for computing set similarity • Dice coefficient: S=2(|A  B|)/(|A|+|B|) • 0 S  1 • S=1 if A=B, S=0 if A  B=

  40. Sets of Unique Bigrams • Let A and B denote the sets of unique bigrams associated with two terms, and let C=A B • statistics -> (st ta at ti is st ti ic cs) • Set of unique bigrams for statistics: A={at cs ic is st ta ti}, |A|=7

  41. n-gram temmers • statistical= (st ta at ti is st ti ic ca al) • Set of unique bigrams for statistical B= {al at ca ic is st ta ti}, |B|=8 • C={at ic is ta st ti}, |C|=6 • S=2|C|/(|A|+|B|)=2x6/(7+8)=.8

  42. Table lookup method • Ideally, a table is constructed with stem for every word • Stemming - look up word find stem • There is no such data for English • Systems use a combination of dictionary lookup and conflation rules

  43. Dictionary lookup method • INQUERY uses Kstem • Kstem is a morphological analyzer that conflates word variants to root form

  44. Dictionary lookup method • Tries to avoid collapsing words with different meaning to same root • The original word or a stemmed version is looked up in a dictionary and replaced by the best stem

  45. Successor variety stemmer • Based on work in structural linguistic (Hafer and Weiss) • Performed less well than affix removing stemmers • Given a set of words, the successor variety (SV) of a string is the number of different characters that follow it in words in the set

  46. Successor variety stemmers • Terms : {able, axle, accident, ape, about, apply, application, applies} • The SV of p?is 2 p?is followed by ?in pe?and by ?in pply application and applies • The SV of ?is 4 ?followed in set by ? ?? and 

  47. SVs for pply?and pplies * denotes a break point at peak

  48. SV for pplication

  49. Segmenting words • 4 ways: • Cut-off SV is reached • SV eaks • A substring of a word is equal to another word in the set eadable?breaks into ead?and ble • Entropy based method

  50. Selecting a stem • First segment is selected if it occurs in at most 12 words, • Otherwise the second segment is selected (3 segments are unlikely)

More Related