The college of saint rose csc 460 cis 560 search and information retrieval david goldschmidt ph d
Download
1 / 19

What can text statistics reveal? { week 05a} - PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. What can text statistics reveal? { week 05a}. from Search Engines: Information Retrieval in Practice , 1st edition

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' What can text statistics reveal? { week 05a}' - tekla


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The college of saint rose csc 460 cis 560 search and information retrieval david goldschmidt ph d

The College of Saint Rose

CSC 460 / CIS 560 – Search and Information Retrieval

David Goldschmidt, Ph.D.

What can text statistics reveal?{week 05a}

from Search Engines: Information Retrieval in Practice, 1st edition

by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0


Text transformation
Text transformation

how do we best

convert documentsto their index terms

how do we make

acquired documentssearchable?


Find replace
Find/Replace

  • Simplest approach is find, whichrequires no text transformation

    • Useful in user applications,but not in search (why?)

    • Optional transformationhandled during the findoperation: case sensitivity


Text statistics i
Text statistics (i)

  • English documents are predictable:

    • Top two most frequently occurring words are “the” and “of” (10% of word occurrences)

    • Top six most frequently occurring wordsaccount for 20% of word occurrences

    • Top fifty most frequently occurring words account for 50% of word occurrences

    • Given all unique words in a (large) document, approximately 50% occur only once


Text statistics ii
Text statistics (ii)

George Kingsley Zipf

(1902-1950)

  • Zipf’s law:

    • Rank words in order of decreasing frequency

    • The rank (r) of a word times its frequency (f) is approximately equal to a constant (k)

      rxf = k

    • In other words, the frequency of the rth most common word is inversely proportional to r


Text statistics iii
Text statistics (iii)

  • The probability of occurrence (Pr)of a word is the word frequencydivided by the total number ofwords in the document

  • Revise Zipf’s law as: rxPr = c

for English,c ≈ 0.1


Text statistics iv
Text statistics (iv)

  • Verify Zipf’s law using the AP89 dataset:

    • Collection of Associated Press (AP) news stories from 1989 (available at http://trec.nist.gov):

Total documents 84,678

Total word occurrences 39,749,179

Vocabulary size 198,763

Words occurring > 1000 times 4,169

Words occurring once 70,064


Text statistics v
Text statistics (v)

  • Top 50wordsof AP89


Vocabulary growth i
Vocabulary growth (i)

  • As the corpus grows, so does vocabulary size

    • Fewer new words when corpus is already large

  • The relationship between corpus size (n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law:

    v = kxnβ

    • Constants k and β vary

    • Typically 10 ≤ k ≤ 100 and β ≈ 0.5


Vocabulary growth ii
Vocabulary growth (ii)

note values of k and β


Vocabulary growth iii
Vocabulary growth (iii)

Web pages crawled from .gov in early 2004


Estimating result set size i
Estimating result set size (i)

  • Word occurrence statistics can be used to estimate result set size of a user query

    • Aside from stop words, how many pagescontain all of the query terms?

      • To figure this out, first assume that wordsoccur independently of one another

      • Also assume that the search engine knows N,the number of documents it indexes


Estimating result set size ii
Estimating result set size (ii)

  • Given three query terms a, b, and c

    • Probability of a document containing all threeis the product of individual probabilities foreach query term:

      P(ab c) = P(a) xP(b) xP(c)

    • P(ab c) is the joint probability ofevents a, b, andc occurring


Estimating result set size iii
Estimating result set size (iii)

  • We assume the search engine knows thenumber of documents that a word occurs in

    • Call these na, nb, and nc

      • Note that the book uses fa, fb, and fc

  • Estimate individual query term probabilities:

    • P(a) = na / NP(b) = nb / NP(c) = nc / N


Estimating result set size iv
Estimating result set size (iv)

  • Given P(a), P(b), and P(c), we estimatethe result set size as:

    nabc = Nx (na / N) x (nb / N) x (nc / N)

    nabc = (naxnbxnc) / N2

    • This estimation sounds good, but is lacking due to our query term independence assumption


Estimating result set size v
Estimating result set size (v)

  • Using the GOV2 dataset with N = 25,205,179

    • Poor results,because of thequery termindependenceassumption

    • Could use wordco-occurrencedata...


Estimating result set size vi
Estimating result set size (vi)

  • Extrapolate based on the sizeof the current result set:

    • The current result set is the subset of documents that have been ranked thus far

    • Let C be the number of documents found thus far containing all the query words

    • Let s be the proportion of the total documents ranked (use least frequently occurring term)

    • Estimate result set size via nabc = C / s


Estimating result set size vii
Estimating result set size (vii)

  • Given example query: tropical fish aquarium

    • Least frequently occurring term is aquarium (which occurs in 26,480 documents)

    • After ranking 3,000 documents,258 documents contain all three query terms

    • Thus, nabc = C / s = 258 / (3,000 ÷ 26,480) = 2,277

    • After processing 20% of the documents, the estimate is 1,778

      • Which overshoots actual value of 1,529


What next
What next?

  • Read and study Chapter 4

  • Do Exercises 4.1, 4.2, and 4.3

  • Start thinking about how to write code to implement the stopping & stemming techniques of Ch.4


ad