Uncovering Insights Through Text Statistics

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. What can text statistics reveal?{week 05a} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

Text transformation how do we best convert documentsto their index terms how do we make acquired documentssearchable?

Find/Replace • Simplest approach is find, whichrequires no text transformation • Useful in user applications,but not in search (why?) • Optional transformationhandled during the findoperation: case sensitivity

Text statistics (i) • English documents are predictable: • Top two most frequently occurring words are “the” and “of” (10% of word occurrences) • Top six most frequently occurring wordsaccount for 20% of word occurrences • Top fifty most frequently occurring words account for 50% of word occurrences • Given all unique words in a (large) document, approximately 50% occur only once

Text statistics (ii) George Kingsley Zipf (1902-1950) • Zipf’s law: • Rank words in order of decreasing frequency • The rank (r) of a word times its frequency (f) is approximately equal to a constant (k) rxf = k • In other words, the frequency of the rth most common word is inversely proportional to r

Text statistics (iii) • The probability of occurrence (Pr)of a word is the word frequencydivided by the total number ofwords in the document • Revise Zipf’s law as: rxPr = c for English,c ≈ 0.1

Text statistics (iv) • Verify Zipf’s law using the AP89 dataset: • Collection of Associated Press (AP) news stories from 1989 (available at http://trec.nist.gov): Total documents 84,678 Total word occurrences 39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064

Text statistics (v) • Top 50wordsof AP89

Vocabulary growth (i) • As the corpus grows, so does vocabulary size • Fewer new words when corpus is already large • The relationship between corpus size (n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law: v = kxnβ • Constants k and β vary • Typically 10 ≤ k ≤ 100 and β ≈ 0.5

Vocabulary growth (ii) note values of k and β

Vocabulary growth (iii) Web pages crawled from .gov in early 2004

Estimating result set size (i) • Word occurrence statistics can be used to estimate result set size of a user query • Aside from stop words, how many pagescontain all of the query terms? • To figure this out, first assume that wordsoccur independently of one another • Also assume that the search engine knows N,the number of documents it indexes

Estimating result set size (ii) • Given three query terms a, b, and c • Probability of a document containing all threeis the product of individual probabilities foreach query term: P(ab c) = P(a) xP(b) xP(c) • P(ab c) is the joint probability ofevents a, b, andc occurring

Estimating result set size (iii) • We assume the search engine knows thenumber of documents that a word occurs in • Call these na, nb, and nc • Note that the book uses fa, fb, and fc • Estimate individual query term probabilities: • P(a) = na / NP(b) = nb / NP(c) = nc / N

Estimating result set size (iv) • Given P(a), P(b), and P(c), we estimatethe result set size as: nabc = Nx (na / N) x (nb / N) x (nc / N) nabc = (naxnbxnc) / N2 • This estimation sounds good, but is lacking due to our query term independence assumption

Estimating result set size (v) • Using the GOV2 dataset with N = 25,205,179 • Poor results,because of thequery termindependenceassumption • Could use wordco-occurrencedata...

Estimating result set size (vi) • Extrapolate based on the sizeof the current result set: • The current result set is the subset of documents that have been ranked thus far • Let C be the number of documents found thus far containing all the query words • Let s be the proportion of the total documents ranked (use least frequently occurring term) • Estimate result set size via nabc = C / s

Estimating result set size (vii) • Given example query: tropical fish aquarium • Least frequently occurring term is aquarium (which occurs in 26,480 documents) • After ranking 3,000 documents,258 documents contain all three query terms • Thus, nabc = C / s = 258 / (3,000 ÷ 26,480) = 2,277 • After processing 20% of the documents, the estimate is 1,778 • Which overshoots actual value of 1,529

What next? • Read and study Chapter 4 • Do Exercises 4.1, 4.2, and 4.3 • Start thinking about how to write code to implement the stopping & stemming techniques of Ch.4

Uncovering Insights Through Text Statistics