Discussion Class 3. The Porter Stemmer. Discussion Classes. Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear.
The Porter Stemmer
Ask a member of the class to answer.
Provide opportunity for others to comment.
Give your name. Make sure that the TA hears it.
Speak clearly so that all the class can hear.
Do not be shy at presenting partial answers.
Differing viewpoints are welcome.
Who wrote this paper? When? For what audience?
Define the terms: stem, suffix, prefix, conflation
What makes a good stemming algorithm? How would you measure it?
Porter proposes a criterion for removing suffixes. What is it? Do you agree with it?
Earlier system Present system
precision recall precision recall
0 57.24 0 58.60
10 56.85 10 58.13
20 52.85 20 53.92
30 42.61 30 43.51
40 42.20 40 39.39
50 39.06 50 38.85
60 32.86 60 33.18
70 31.64 70 31.19
80 27.15 80 27.52
90 24.59 90 25.85
100 24.59 100 25.85
Explain the data in this table.
The paper calls this, "the standard recall cutoff method". Have you any comments?
The following diagram illustrate the various categories of stemmer. Porter's algorithm is shown by the red path. What do these terms mean?
Manual Automatic (stemmers)
Affix Successor Table n-gram
removal variety lookup
The paper gives the following example of Step 1a. Explain what this step does.
Suffix Replacement Examples
sses ss caresses -> caress
ies i ponies -> poni
ties -> ti
ss ss caress -> caress
s cats -> cat
Conditions Suffix Replacement Examples
(m > 0) eed ee feed -> feed
agreed -> agree
(*v*) ed null plastered -> plaster
bled -> bled
(*v*) ing null motoring -> motor
sing -> sing
(a) Explain this table
(b) How does this table apply to: "exceeding", "ringed"?
Step 5a is defined as follows. What does this do and why?
(m>1) E -> probate -> probat
rate -> rate
(m=1 and not *o) E -> cease -> ceas
Discuss the following:
"The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach. It was merely observed that m could be used quite effectively to help decide whether or not it was wise to take off a suffix."
(a) What is m?
(b) Why is it a reasonable measure?
(c) What anomalies does it produce?
(a) In Web search engines, the tendency is not to use stemming. Why? (There are several answers.)
(b) Does your answer to part (a) mean that stemming is no longer useful?