1 / 19

What can text statistics reveal? { week 05a}

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. What can text statistics reveal? { week 05a}. from Search Engines: Information Retrieval in Practice , 1st edition

tekla
Download Presentation

What can text statistics reveal? { week 05a}

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. What can text statistics reveal?{week 05a} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

  2. Text transformation how do we best convert documentsto their index terms how do we make acquired documentssearchable?

  3. Find/Replace • Simplest approach is find, whichrequires no text transformation • Useful in user applications,but not in search (why?) • Optional transformationhandled during the findoperation: case sensitivity

  4. Text statistics (i) • English documents are predictable: • Top two most frequently occurring words are “the” and “of” (10% of word occurrences) • Top six most frequently occurring wordsaccount for 20% of word occurrences • Top fifty most frequently occurring words account for 50% of word occurrences • Given all unique words in a (large) document, approximately 50% occur only once

  5. Text statistics (ii) George Kingsley Zipf (1902-1950) • Zipf’s law: • Rank words in order of decreasing frequency • The rank (r) of a word times its frequency (f) is approximately equal to a constant (k) rxf = k • In other words, the frequency of the rth most common word is inversely proportional to r

  6. Text statistics (iii) • The probability of occurrence (Pr)of a word is the word frequencydivided by the total number ofwords in the document • Revise Zipf’s law as: rxPr = c for English,c ≈ 0.1

  7. Text statistics (iv) • Verify Zipf’s law using the AP89 dataset: • Collection of Associated Press (AP) news stories from 1989 (available at http://trec.nist.gov): Total documents 84,678 Total word occurrences 39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064

  8. Text statistics (v) • Top 50wordsof AP89

  9. Vocabulary growth (i) • As the corpus grows, so does vocabulary size • Fewer new words when corpus is already large • The relationship between corpus size (n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law: v = kxnβ • Constants k and β vary • Typically 10 ≤ k ≤ 100 and β ≈ 0.5

  10. Vocabulary growth (ii) note values of k and β

  11. Vocabulary growth (iii) Web pages crawled from .gov in early 2004

  12. Estimating result set size (i) • Word occurrence statistics can be used to estimate result set size of a user query • Aside from stop words, how many pagescontain all of the query terms? • To figure this out, first assume that wordsoccur independently of one another • Also assume that the search engine knows N,the number of documents it indexes

  13. Estimating result set size (ii) • Given three query terms a, b, and c • Probability of a document containing all threeis the product of individual probabilities foreach query term: P(ab c) = P(a) xP(b) xP(c) • P(ab c) is the joint probability ofevents a, b, andc occurring

  14. Estimating result set size (iii) • We assume the search engine knows thenumber of documents that a word occurs in • Call these na, nb, and nc • Note that the book uses fa, fb, and fc • Estimate individual query term probabilities: • P(a) = na / NP(b) = nb / NP(c) = nc / N

  15. Estimating result set size (iv) • Given P(a), P(b), and P(c), we estimatethe result set size as: nabc = Nx (na / N) x (nb / N) x (nc / N) nabc = (naxnbxnc) / N2 • This estimation sounds good, but is lacking due to our query term independence assumption

  16. Estimating result set size (v) • Using the GOV2 dataset with N = 25,205,179 • Poor results,because of thequery termindependenceassumption • Could use wordco-occurrencedata...

  17. Estimating result set size (vi) • Extrapolate based on the sizeof the current result set: • The current result set is the subset of documents that have been ranked thus far • Let C be the number of documents found thus far containing all the query words • Let s be the proportion of the total documents ranked (use least frequently occurring term) • Estimate result set size via nabc = C / s

  18. Estimating result set size (vii) • Given example query: tropical fish aquarium • Least frequently occurring term is aquarium (which occurs in 26,480 documents) • After ranking 3,000 documents,258 documents contain all three query terms • Thus, nabc = C / s = 258 / (3,000 ÷ 26,480) = 2,277 • After processing 20% of the documents, the estimate is 1,778 • Which overshoots actual value of 1,529

  19. What next? • Read and study Chapter 4 • Do Exercises 4.1, 4.2, and 4.3 • Start thinking about how to write code to implement the stopping & stemming techniques of Ch.4

More Related