1 / 69

Automatic Indexing and Stemming in Information Retrieval

Learn about the use of stop lists, stemming algorithms, and manual vs automatic indexing in information retrieval. Discover the advantages and disadvantages of each approach.

flake
Download Presentation

Automatic Indexing and Stemming in Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #4 February 3, 1999

  2. This Class • Automatic indexing • Stop lists • How stemming is used in IR • Stemming algorithms • Frakes: Chapter 8

  3. Disadvantages of Manual Indexing • Human effort • Controlled vocabulary per collection • Subjective (intersection about 40%)

  4. Advantages of Manual Indexing • Human experts who use indexing aids such as “scope notes” describing allowable vocabulary and usage achieve good indexing uniformity

  5. Which is better? • Salton - claims result of automatic comparable to manual • Blair - claims manual better • Often, manual indexing not a practical option

  6. Automatic indexing • At this stage single words • Consider: • the usage of stop lists • which tokens to include • stemming algorithms

  7. Stop lists • A stop list is a list of terms which are not included in an index • Inverted lists may be saved and used for identifying phrases

  8. Why use stop words? • Lunh 1957 observed many of most frequently occurring words worthless as index terms • The 10 most frequently occurring terms account for 20-30% of the word occurrences • Eliminating stop words saves index space and computation time

  9. Stop lists • Traditionally most frequently occurring English words. • Among the top 200 are words such as “time” “war” “home” etc.

  10. Stop list for collection • “computer, machine, program, source, language” in a computer science collection

  11. Stop lists • Commercial systems use only few stop words • ORBIT uses only 8, “and, an, by, from, of , the, with” • Lists of stop words appear in literature (Frakes)

  12. Which tokens to include? • English words (not stop words) • Include numbers? • How to deal with hyphens? • Case sensitive?

  13. Include numbers? • Numbers - not good discriminators • Important in some contexts • Usually systems allow tokens to include digits but not to begin with one • So B6 (vitamin) but not 6

  14. Include hyphens? • Break into distinct terms or • Single term with hyphen • Chemical/abstracts service-hyphenated to single term • LEXIS/NEXIS - break apart into two terms if they occur in a title or abstract

  15. Punctuation and case • Punctuation is sometimes important for example “command.com” “OS/2” • Case - convert to lower case or not

  16. Commercial systems • Commercial systems prefer to enhance recall • Usually case insensitive • Index numbers • Very few stop words

  17. Recognizing names • People’s names - “Bill Clinton” • Company names - IBM & big blue • Places • New York City, NYC, the big apple

  18. Stemming algorithms • Affix removing stemmers • Dictionary lookup stemmers • n-gram stemmers • Successor variety stemmers

  19. Stemming algorithms Conflation methods Manual Automatic Affix Removal Successor Variety Dictionary Lookup n-grams Longest Match Simple Removal

  20. Stemming • Conflation - combining non identical words which refer to the same principal concept • Done manually or automatically • Automatic algorithms called stemmers

  21. Stemming is used to: • Enhance query formulation (and improve recall) by providing term variants • Reduce size of index files by combining term variants into single index term

  22. Stemming during indexing • Index terms are stemmed words • Saves dictionary space • One inverted index list for all variants • Saves inverted index file space when position information in document not included • Query terms are also stemmed

  23. Index is not stemmed • In this case the index contains words • No compression is achieved • No information is lost • Enables wild card searches • Enables long phrases searches when position information included

  24. Providing term variants during search • A stemming algorithm generates term variants • Term variants added to query automatically (query expansion) or • The user is provided with term variants and decides which ones to include

  25. Example • A user searching for “system users” is provided in the CATALOG system with term variants for “users” and “system”

  26. Example (cont.) Search term: users TermOccurrences 1. user 15 2. users 1 3. used 3 4. using 2 • User selects variants to include in query

  27. Stemmer correctness • A stemmer can be incorrect by either • Under-stemming or by • Over-stemming • Over-stemming can reduce precision • Under-stemming can affect recall

  28. Over-stemming • Terms with different meanings are conflated • “considerate”, and “consider” and “consideration” should not be stemmed to “con”, with “contra”, “contact”, etc.

  29. Under-Stemming • Prevents related terms from being conflated • Under-stemming “consideration” to “considerat” prevents conflating it with “consider”

  30. Evaluating stemmers • In information retrieval stemmers are evaluated by their: • effect on retrieval and • compression rate, and • not linguistic correctness

  31. Evaluating stemmers • Studies have shown that stemming has a positive effect on retrieval. • Performance of algorithms comparable • Results vary between test collections

  32. Affix removal stemmers • Remove • suffixes and and/or • prefixes from terms • leaving a stem

  33. Affix removal stemmers • In English stemmers are suffix removers • In other languages, for example Hebrew remove both prefix and suffix • Keshehalachnu --> halach • Nelechna --> halach

  34. Affix removal stemmers • Most affix removal stemmers in use are: • iterative - for example, “consideration” stemmed first to “considerat” then to “consider” • longest match stemmers using a set of stemming rules arranged on a ‘longest match’ principal (Lovins)

  35. A simple stemmer • Harman experimented • concluded minimal stemming helpful • Her simple stemmer changes: • Plural to singular • Third person to first person

  36. A simple stemmer (Harman) if word ends in “ies” but not “eies” or “aies” then “ies”->“y”; elsein “es” but not “aes”, “ees” or “oes” then “es”->e; elsein “s” but not “us” or “ss” then “s”->NULL endif

  37. A simple stemmer • Algorithm changes: • “skies” to “sky”, • “retrieves” to “retrieve”, and • “doors” to “door” but not “corpus” or “wellness” • “dies” to “dy”?

  38. The Paice/Husk stemmer • Uses a table of rules grouped into sections • Section for each last letter of a suffix (rules for forms ending in a, then b, etc.) • A formis any word or part of a word considered for stemming

  39. The Paice/Husk stemmer • Each rule specifies a deletion or a replacement of an ending • The order of the rules in each section is important. • Rules tried until one can be applied, and the current form is updated

  40. Rule structure • Each rule contains 5 parts (2 are optional): • An ending (one or more characters in reverse order) • An optional “intact” flag “*” denoting form not yet stemmed

  41. Rule structure • A digit (>=0) specifying no. characters to remove • An optional string to append (after removal) • A rule ending with “>“ denotes stemming should continue “.” terminating the stemming process

  42. Examples of rules • “sei3y>“ • if form ends in “ies” then replace the last 3 letters by y and continue stemming ( “tries” becomes “try”)

  43. Examples of rules • “mu*2.” • if form ends with “um” and word is intact remove 2 last letters and terminate stemming. • “maximum” is stemmed to “maxim”, but “presum” from “presumably” remains unchanged

  44. Examples of rules • “ylp0.” - if word terminates in “ply” terminate. Next rule “yl2>“ does not remove “ly” from “multiply” • “nois4j>“ causes “sion” to be replaced by “j”. • “j” acts as dummy ending • “provision” converted to “provij” and then to “provid”

  45. The Algorithm terminate:=false; whilenot terminate and there is a section for last letter of form do Use last letter of form to select a section in the table of rules applied:=false; while more rules in section andnot applied do {Check the applicability of the current rule} if form ending does not match the current rule’s exit if form matches ending but intact flag is on and the form is not intact exit if acceptability conditions are not satisfied exit apply rule to form (delete and append as rule specifies) applied:=true; if rule ends in “.” then terminate:=true; endwhile if not applied then terminate=true; endwhile

  46. Acceptability conditions • Attempt to prevent over-stemming • Without them “rent”, “rant”, “rice”, “rate”, “ration” “river” reduce to “r”. • There are 2 rules:

  47. Acceptability conditions • If form starts with a vowelthen at least 2 letters must remain (owed/owing->ow but not ear->e) • If a form starts with a consonantthen at least 3 letters must remain after stemming, and at least one of them must be a vowel or “y” (saying->say, crying->cry, but not string->str, meant->me, or cement->ce)

  48. Acceptability conditions • These rules cause error in the stemming of some short-rooted words • (doing, dying, being). • These could be dealt with separately with a table lookup

  49. Stemming “separately” • Use “y” section. Mismatch “ylb1>”, “yli3y>”, “ylp0.”. Match “yl2>”. Form becomes “separate” • Use “e1>“ in “e” section. Form changes to “separat” • Use t section. Mismatch with “tacilp4y.”. Match with “ta2>“. Form changes to “separ” • Match with “ra2.”. So “sep”

  50. Other examples

More Related