1 / 19

Stemming

Stemming. Cha- 8. Content. INTRODUCTION Example of Stemmer Use in Searching TYPES OF STEMMING ALGORITHMS Affix Removal Stemmers STEMMING TO COMPRESS INVERTED FILES. 1. INTRODUCTION.

waite
Download Presentation

Stemming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stemming Cha- 8

  2. Content • INTRODUCTION • Example of Stemmer Use in Searching • TYPES OF STEMMING ALGORITHMS • Affix Removal Stemmers • STEMMING TO COMPRESS INVERTED FILES

  3. 1. INTRODUCTION • One technique for improving IR performance is to provide searchers with ways of finding morphological variants of search terms. If, for example, a searcher enters the term stemming as part of a query, it is likely that he or she will also be interested in such variants as stemmed and stem. • We use the term conflation, meaning the act of fusing or combining, as the general term for the process of matching morphological term variants. • Conflation can be either manual--using some kind of regular expressions--or automatic, via programs called stemmers. Stemming is also used in IR to reduce the size of index files. Since a single stem typically corresponds to several full terms, by storing stems instead of terms, compression factors of over 50 percent can be achieved.

  4. Contd..

  5. Contd.. • Figure 8.1 shows a taxonomy for stemming algorithms. There are four automatic approaches. Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem. These algorithms sometimes also transform the resultant stem. • The name stemmer derives from this method, which is the most common. Successor variety stemmers use the frequencies of letter sequences in a body of text as the basis of stemming. • The n-gram method conflates terms based on the number of digrams or n-grams they share. Terms and their corresponding stems can also be stored in a table. Stemming is then done via lookups in the table. These methods are described below.

  6. Contd..

  7. 2. Example of Stemmer Use in Searching • To illustrate how a stemmer is used in searching, consider the following example from the CATALOG system (Frakes 1984, 1986). In CATALOG, terms are stemmed at search time rather than at indexing time. CATALOG prompts for queries with the string "Look for:". At the prompt, the user types in one or more terms of interest. • For example: Look for: system users will cause CATALOG to attempt to find documents about system users. CATALOG takes each term in the query, and tries to determine which other terms in the database might have the same stem. If any possibly related terms are found, CATALOG presents them to the user for selection. In the case of the query term "users," for example, CATALOG might respond as follows:

  8. Contd.. • The user selects the terms he or she wants by entering their numbers. This method of using a stemmer in a search session provides a naive system user with the advantages of term conflation while requiring little knowledge of the system or of searching techniques. • It also allows experienced searchers to focus their attention on other search problems. Since stemming may not always be appropriate, the stemmer can be turned off by the user. Having a user select the terms from the set found by the stemmer also reduces the likelihood of false matches.

  9. 3. TYPES OF STEMMING ALGORITHMS • There are several approaches to stemming. One way to do stemming is to store a table of all index terms and their stems. For example: Terms from queries and indexes could then be stemmed via table lookup. • Using a B-tree or hash table, such lookups would be very fast. There are problems with this approach. The first is that there is no such data for English. Even if there were, many terms found in databases would not be represented, since they are domain dependent that is, not standard English.

  10. Contd.. • For these terms, some other stemming method would be required. Another problem is the storage overhead for such a table, though trading size for time is sometimes warranted. Storing precomputed data, as opposed to computing the data values on the fly, is useful when the computations are frequent and/or expensive. • Bentley (1982), for example, reports cases such as chess computations where storing precomputed results gives significant performance improvements. n-gram stemmers

  11. Contd.. • Adamson and Boreham (1974) reported a method of conflating terms called the shared digram method. A digram is a pair of consecutive letters. Since trigrams, or n-grams could be used, we have called it the n-gram method. • Though we call this a "stemming method," this is a bit confusing since no stem is produced. In this approach, association measures are calculated between pairs of terms based on shared unique digrams. For example, the terms statistics and statistical can be broken into digrams as follows.

  12. Contd..

  13. Contd.. • Thus, "statistics" has nine digrams, seven of which are unique, and "statistical" has ten digrams, eight of which are unique. The two words share six unique digrams: at, ic, is, st, ta, ti. • Once the unique digrams for the word pair have been identified and counted, a similarity measure based on them is computed. The similarity measure used was Dice's coefficient, which is defined as S=2C/A+B. • Here A is the number of unique digrams in the first word, B the number of unique digrams in the second, and C the number of unique digrams shared by A and B.

  14. Cont”d • For the example for above, Dice's coefficient would equal (2 x • 6)/(7 + 8) = .80. Such similarity measures are determined for all • pairs of terms in the database, forming a similarity matrix. Since • Dice's coefficient is symmetric (Sij = Sji), a lower triangular • similarity matrix can be used. • With a cut-off value of S = 0.6, Adamson found ten out of eleven clusters correct for matching document titles.

  15. 4. Affix Removal Stemmers • Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem. These algorithms sometimes also transform the resultant stem. A simple example of an affix removal stemmer is one that removes the plurals from terms. A set of rules for such a stemmer is as follows (Harman 1991). • Most stemmers currently in use are iterative longest match stemmers, a kind of affix removal stemmer first developed by Lovins (1968). An iterative longest match stemmer removes the longest possible string of characters from a word according to a set of rules. This process is repeated until no more characters can be removed. • Even after all characters have been removed, stems may not be correctly conflated. The word "skies," for example, may have been reduced to the stem "ski" which will not match "sky." There are two techniques to handle this--recoding or partial matching.

  16. Contd..

  17. 5. STEMMING TO COMPRESS INVERTED FILES • Since a stem is usually shorter than the words to which it corresponds, storing stems instead of full words can decrease the size of index files. Lennon et al. (1981), for example, report the following compression percentages for various stemmers and databases. For example, the indexing file for the Cranfield collection was 32.1 percent smaller after it was stemmed using the INSPEC stemmer.

  18. Contd..

  19. Contd.. Thank You

More Related