EE3J2 Data Mining Lecture 3 Zipf’s Law Stemming & Stop Lists Martin Russell

EE3J2 Data MiningLecture 3Zipf’s LawStemming & Stop ListsMartin Russell EE3J2 Data Mining 2008 – lecture 3

Objectives • Understand Zipf’s law • Understand utility of Zipf’s law for IR • Understand motivation and methods of Stemming • Understand definition and use of Stop Lists EE3J2 Data Mining 2008 – lecture 3

Zipf’s Law • George Kingsley Zipf (1902-1950) • For each word w, let F(w) be the number of times w occurs in the corpus • Sort the words according to frequency • The word’s rank-frequency distribution will be fitted closely by the function: EE3J2 Data Mining 2008 – lecture 3

Word Frequency Plot: “Alice in Wonderland” Zipf’s law Actual statistics from “Alice in Wonderland” Different words 2,787, Total words 26,395 EE3J2 Data Mining 2008 – lecture 3

Explanations of Zipf’s Law (1) • Linguistic / psycholinguistic explanation • Mathematical explanation EE3J2 Data Mining 2008 – lecture 3

Principle of Least Effort Basically, when a person applies a tool to a job, he or she tries to minimise the effort to achieve an acceptable goal • Example – speech communication • Goal is to exchange information • Speaker would prefer minimum effort (minimum articulation, short words, small vocabulary etc) • Listener would prefer careful articulation, detailed & specific vocabulary • Feedback from listener indicates success of failure of communication • E.G: Lombard effect EE3J2 Data Mining 2008 – lecture 3

Speech Communication • So, if the talker and listener are familiar friends from the same linguistic background, and the environment is quiet, minimal articulation is needed so minimal articulation is used • But if the talker and listener are, say, both non-native users of the language and it is noisy then careful articulation and choice of vocabulary is important EE3J2 Data Mining 2008 – lecture 3

‘Text communication’ • For an author of text there is typically no immediate feedback • The ‘Principle of Least Effort’ suggests that the author will use a basic ‘working’ vocabulary where possible, with uncommon, special words for particular tasks • Also, no matter how creative the author is, he or she will need to use lots of “the”s and “and”s, because correct grammar requires this. • This appears to be consistent with Zipf’s law EE3J2 Data Mining 2008 – lecture 3

Mathematical explanation • Monkey Text • Imagine a typewriter with just 5 keys: a, b, c, d and <space> • Suppose that a monkey sits at the typewriter and presses each key with equal probability p: p = p(a) = p(b) = p(c) = p(d) = p(<space>) =1/5 • As before, we’ll say that a word is a sequence of characters bordered by <space>s EE3J2 Data Mining 2008 – lecture 3

‘Monkey text’ continued • Probability of a particular 1 character ‘word’ x is: p(x)  p(<space>) = 1/25 • There are 4 1-character ‘words’ • Probability of a particular 2 character ‘word’ xy is: p(x)  p(y)  p(<space>) = 1/125 • There are 4  4 = 16 2-character ‘words’ etc EE3J2 Data Mining 2008 – lecture 3

<sp> A B Z     A A A A B B B B Z Z Z Z             <sp> <sp> <sp> <sp> Graphical ‘tree’ representation 0 1 2 3 p(A) p(AZ) p(AA) p(AAA) EE3J2 Data Mining 2008 – lecture 3

Zipf’s Law and ‘Monkey Text’ EE3J2 Data Mining 2008 – lecture 3

Observations • This analysis isn’t quite right (although the basic conclusion is correct)! • Zipf’s Law applies to the probability distribution of words in a text • The probabilities that we have calculated only form a distribution (i.e. sum to one) if all sequences are considered • But we are only interested in sequences that begin and end with a space • Therefore we need to normalise the subset of probabilities corresponding to the sequences of interest • This is dealt with in Belew, pages 150-152 (but beware, there are some errors!) EE3J2 Data Mining 2008 – lecture 3

More formal analysis • Suppose the alphabet has M characters, plus a space character <sp> p = p(A) = … = p(Z) = 1/(M+1) • So, the probability of a particular ‘word’ wk of length k is (remember the spaces before and after the word) p(wk) = p(k+2) EE3J2 Data Mining 2008 – lecture 3

Calculating word probability • Given an infinitely long text, the probability pk that wk occurs in the text must be proportional to p(wk) pk = cp(wk) = c/(M+1)(k+2) • The number of words of length k is Mk. So the probability of any word of length k occurring is Mkpk • It must be the case that the sum of these probabilities over all k is 1, and we can use this to find c: EE3J2 Data Mining 2008 – lecture 3

Calculating word rank • In general, number of words of length k is Mk • Therefore, the number of words of length k or less is: • Our word wkoccurs less frequently than shorter words and more frequently than longer words. In other words, if rkis the rank of wk: EE3J2 Data Mining 2008 – lecture 3

Calculating word rank (continued) • Therefore, on average: • So, we have: EE3J2 Data Mining 2008 – lecture 3

Relating probability and rank • By combining these we get an expression for pk in terms of rk and M which is what you need to compare with the Zipf curve: • This is the same form as Zipf’s law. The values of B and C depend on M and α • The value of αdepends of M (See Belew, pages 150-152 for details) • For M=26, α = 1.012, C = 0.02, B = 0.54 EE3J2 Data Mining 2008 – lecture 3

Conclusions (Zipf’s Law) • Zipf’s law appears to reflect simple character statistics, rather than meaning • Of limited direct relevance for IR • Potentially useful for identifying keywords EE3J2 Data Mining 2008 – lecture 3

‘Resolving Power’ of words ‘Resolving power’ of word Zipf’s Law Words too common Words too rare Upper cutoff Lower cutoff EE3J2 Data Mining 2008 – lecture 3

Stemming (morphology) • Remove surface markings from words to reveal their basic form: • forms form • forming  form • formed  form • former  form • If a query and document contain different forms of the same word, we want to know this EE3J2 Data Mining 2008 – lecture 3

Stemming • Of course, not all words obey such simple rules: • running run • runs run • women  woman • leaves  leaf • ferries  ferry • alumnus  alumni • datum  data • crisis  crises [Belew, chapter 2] EE3J2 Data Mining 2008 – lecture 3

Stemming • Linguists distinguish between different types of morphology: • Minor changes, such as plurals, tense • Major changes, e.g. incentive  incentivize, which change the grammatical category of a word • Common solution is to identify sub-pattern of letters within words and devise rules for dealing with these patterns EE3J2 Data Mining 2008 – lecture 3

Stemming • Example rules [Belew, p 45] • (.*)SSES  /1SS • Any string ending SSES is stemmed by replacing SSES with SS • (.[AEIOU].*)ED  /1 • Any string containing a vowel and ending in ED is stemmed by removing the ED EE3J2 Data Mining 2008 – lecture 3

Stemmers • A stemmer is a piece of software which implements a stemming algorithm • The Porter stemmer is a standard stemmer which is available as a free download (see EE3J2 webpage for the URL) • The Porter stemmer implements a set of about 60 rules • Use of a stemmer typically reduces vocabulary size by 10% to 50% EE3J2 Data Mining 2008 – lecture 3

Example • Apply the Porter stemmer to the ‘Jane Eyre’ and ‘Alice in Wonderland’ texts 34% reduction 22% reduction EE3J2 Data Mining 2008 – lecture 3

Example • Examples of results of Porter stemmer: • form  form • former former • formed form • forming form • formal formal • formality formal • formalism formal • formica formica • formic formic • formant formant • format format • formation format EE3J2 Data Mining 2008 – lecture 3

Example: First paragraph from ‘Alice in Wonderland’ Before Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,‘thought Alice ‘without pictures or conversation?’ After alic wa begin to get veri tire of sit by her sister on the bank, and of have noth to do: onc or twice she had peep into the book her sister wa read, but it had no pictur or convers in it, ‘and what is the us of a book,’ thought alic ‘without pictur or convers?’ EE3J2 Data Mining 2008 – lecture 3

Noise Words Therewas no possibility of taking a walk that day. We hadbeen wandering, indeed, inthe leafless shrubbery an hour inthe morning; butsince dinner (Mrs. Reed, whentherewasno company, dined early) the cold winter wind had brought withit clouds so sombre, and a rain so penetrating, that further out-door exercise was now out of the question • Noise words • Vital to understand the grammatical structure of a text • Of little use in the ‘bundle of words’ approach EE3J2 Data Mining 2008 – lecture 3

Stop Lists • In Information Retrieval, these words are often referred to as Stop Words • Rather than detecting stop words using rules, or some other form of analysis, the stop words are simply specified to the system in a list: the Stop List • Stop Lists typically consist of the most common words from some large corpus • There are lots of candidate stop lists online EE3J2 Data Mining 2008 – lecture 3

Example 1: Short Stop List (50 wds) the of and to a in that is was he for it with as his on be at by i this had not are but from or have an they which you were her all she there would their we him been has when who will more if out so EE3J2 Data Mining 2008 – lecture 3

Example 2: 300 Word Stop List the of and to a in that is was he for it with as his on be at by i this had not are but from or have an they which one you were her all she there would their we him been has when who will more no if out so said what up its about into than them can only other held keep sure probably free real seems behind cannot miss political air question making office brought whose special heard major problems ago became federal moment study available known result street economic boy ………. 300 most common words from Brown Corpus EE3J2 Data Mining 2008 – lecture 3

Alice vs Brown: Most Frequent Words the and to a she it of said i alice in you was that the of and to a in that is was he for it with as as her at on all with had but for so be not very what his on be at by i this Had not are but from or have this they little he out is one down up his if about then no know them like were again herself went would do have when could or there has when who will more if out so thought off how me an they which you were her all she there would their we him been EE3J2 Data Mining 2008 – lecture 3

stop.c • C program on course website • Reads in a stop list file (text file, one word per line) • Stores stop words in char **stopList • Read text file one word at a time • Compares each word with each stop word • Prints out words not in stop list • stop stopListFile textFile > opFile EE3J2 Data Mining 2008 – lecture 3

Examples Stop list 50 removed alice beginning get very tired sitting sister bank having nothing do once twice peeped into book sister reading no pictures conversations what use book thought alice without pictures conversation Original first paragraph Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,‘ thought Alice `without pictures or conversation?' Stop list Brown removed alice beginning tired sitting sister bank twice peeped book sister reading pictures conversations book alice pictures conversation EE3J2 Data Mining 2008 – lecture 3

Text query Document Stemming Stemming Stop Word Removal Stop Word Removal Match Summary EE3J2 Data Mining 2008 – lecture 3

Homework • Download Porter Stemmer from the web • See URL on course web page • Compile and run it under your favourite OS • Try it out on some words and text corpora • Download stop.c from the web • Download some stop lists • Compile and run stop.c under your favourite OS • Try it out on some stop lists and text corpora • How can you make stop.crun on stemmed text? EE3J2 Data Mining 2008 – lecture 3

EE3J2 Data Mining Lecture 3 Zipf’s Law Stemming & Stop Lists Martin Russell