the college of saint rose csc 460 cis 560 search and information retrieval david goldschmidt ph d
Skip this Video
Download Presentation
What can text statistics reveal? { week 05a}

Loading in 2 Seconds...

play fullscreen
1 / 19

What can text statistics reveal? { week 05a} - PowerPoint PPT Presentation

  • Uploaded on

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. What can text statistics reveal? { week 05a}. from Search Engines: Information Retrieval in Practice , 1st edition

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' What can text statistics reveal? { week 05a}' - tekla

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the college of saint rose csc 460 cis 560 search and information retrieval david goldschmidt ph d
The College of Saint Rose

CSC 460 / CIS 560 – Search and Information Retrieval

David Goldschmidt, Ph.D.

What can text statistics reveal?{week 05a}

from Search Engines: Information Retrieval in Practice, 1st edition

by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

text transformation
Text transformation

how do we best

convert documentsto their index terms

how do we make

acquired documentssearchable?

find replace
  • Simplest approach is find, whichrequires no text transformation
    • Useful in user applications,but not in search (why?)
    • Optional transformationhandled during the findoperation: case sensitivity
text statistics i
Text statistics (i)
  • English documents are predictable:
    • Top two most frequently occurring words are “the” and “of” (10% of word occurrences)
    • Top six most frequently occurring wordsaccount for 20% of word occurrences
    • Top fifty most frequently occurring words account for 50% of word occurrences
    • Given all unique words in a (large) document, approximately 50% occur only once
text statistics ii
Text statistics (ii)

George Kingsley Zipf


  • Zipf’s law:
    • Rank words in order of decreasing frequency
    • The rank (r) of a word times its frequency (f) is approximately equal to a constant (k)

rxf = k

    • In other words, the frequency of the rth most common word is inversely proportional to r
text statistics iii
Text statistics (iii)
  • The probability of occurrence (Pr)of a word is the word frequencydivided by the total number ofwords in the document
  • Revise Zipf’s law as: rxPr = c

for English,c ≈ 0.1

text statistics iv
Text statistics (iv)
  • Verify Zipf’s law using the AP89 dataset:
    • Collection of Associated Press (AP) news stories from 1989 (available at

Total documents 84,678

Total word occurrences 39,749,179

Vocabulary size 198,763

Words occurring > 1000 times 4,169

Words occurring once 70,064

text statistics v
Text statistics (v)
  • Top 50wordsof AP89
vocabulary growth i
Vocabulary growth (i)
  • As the corpus grows, so does vocabulary size
    • Fewer new words when corpus is already large
  • The relationship between corpus size (n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law:

v = kxnβ

    • Constants k and β vary
    • Typically 10 ≤ k ≤ 100 and β ≈ 0.5
vocabulary growth ii
Vocabulary growth (ii)

note values of k and β

vocabulary growth iii
Vocabulary growth (iii)

Web pages crawled from .gov in early 2004

estimating result set size i
Estimating result set size (i)
  • Word occurrence statistics can be used to estimate result set size of a user query
    • Aside from stop words, how many pagescontain all of the query terms?
      • To figure this out, first assume that wordsoccur independently of one another
      • Also assume that the search engine knows N,the number of documents it indexes
estimating result set size ii
Estimating result set size (ii)
  • Given three query terms a, b, and c
    • Probability of a document containing all threeis the product of individual probabilities foreach query term:

P(ab c) = P(a) xP(b) xP(c)

    • P(ab c) is the joint probability ofevents a, b, andc occurring
estimating result set size iii
Estimating result set size (iii)
  • We assume the search engine knows thenumber of documents that a word occurs in
    • Call these na, nb, and nc
      • Note that the book uses fa, fb, and fc
  • Estimate individual query term probabilities:
    • P(a) = na / NP(b) = nb / NP(c) = nc / N
estimating result set size iv
Estimating result set size (iv)
  • Given P(a), P(b), and P(c), we estimatethe result set size as:

nabc = Nx (na / N) x (nb / N) x (nc / N)

nabc = (naxnbxnc) / N2

    • This estimation sounds good, but is lacking due to our query term independence assumption
estimating result set size v
Estimating result set size (v)
  • Using the GOV2 dataset with N = 25,205,179
    • Poor results,because of thequery termindependenceassumption
    • Could use wordco-occurrencedata...
estimating result set size vi
Estimating result set size (vi)
  • Extrapolate based on the sizeof the current result set:
    • The current result set is the subset of documents that have been ranked thus far
    • Let C be the number of documents found thus far containing all the query words
    • Let s be the proportion of the total documents ranked (use least frequently occurring term)
    • Estimate result set size via nabc = C / s
estimating result set size vii
Estimating result set size (vii)
  • Given example query: tropical fish aquarium
    • Least frequently occurring term is aquarium (which occurs in 26,480 documents)
    • After ranking 3,000 documents,258 documents contain all three query terms
    • Thus, nabc = C / s = 258 / (3,000 ÷ 26,480) = 2,277
    • After processing 20% of the documents, the estimate is 1,778
      • Which overshoots actual value of 1,529
what next
What next?
  • Read and study Chapter 4
  • Do Exercises 4.1, 4.2, and 4.3
  • Start thinking about how to write code to implement the stopping & stemming techniques of Ch.4