- By
**tekla** - Follow User

- 58 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' What can text statistics reveal? { week 05a}' - tekla

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

The College of Saint Rose### What can text statistics reveal?{week 05a}

CSC 460 / CIS 560 – Search and Information Retrieval

David Goldschmidt, Ph.D.

from Search Engines: Information Retrieval in Practice, 1st edition

by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

Text transformation

how do we best

convert documentsto their index terms

how do we make

acquired documentssearchable?

Find/Replace

- Simplest approach is find, whichrequires no text transformation
- Useful in user applications,but not in search (why?)
- Optional transformationhandled during the findoperation: case sensitivity

Text statistics (i)

- English documents are predictable:
- Top two most frequently occurring words are “the” and “of” (10% of word occurrences)
- Top six most frequently occurring wordsaccount for 20% of word occurrences
- Top fifty most frequently occurring words account for 50% of word occurrences
- Given all unique words in a (large) document, approximately 50% occur only once

Text statistics (ii)

George Kingsley Zipf

(1902-1950)

- Zipf’s law:
- Rank words in order of decreasing frequency
- The rank (r) of a word times its frequency (f) is approximately equal to a constant (k)

rxf = k

- In other words, the frequency of the rth most common word is inversely proportional to r

Text statistics (iii)

- The probability of occurrence (Pr)of a word is the word frequencydivided by the total number ofwords in the document
- Revise Zipf’s law as: rxPr = c

for English,c ≈ 0.1

Text statistics (iv)

- Verify Zipf’s law using the AP89 dataset:
- Collection of Associated Press (AP) news stories from 1989 (available at http://trec.nist.gov):

Total documents 84,678

Total word occurrences 39,749,179

Vocabulary size 198,763

Words occurring > 1000 times 4,169

Words occurring once 70,064

Text statistics (v)

- Top 50wordsof AP89

Vocabulary growth (i)

- As the corpus grows, so does vocabulary size
- Fewer new words when corpus is already large
- The relationship between corpus size (n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law:

v = kxnβ

- Constants k and β vary
- Typically 10 ≤ k ≤ 100 and β ≈ 0.5

Vocabulary growth (ii)

note values of k and β

Vocabulary growth (iii)

Web pages crawled from .gov in early 2004

Estimating result set size (i)

- Word occurrence statistics can be used to estimate result set size of a user query
- Aside from stop words, how many pagescontain all of the query terms?
- To figure this out, first assume that wordsoccur independently of one another
- Also assume that the search engine knows N,the number of documents it indexes

Estimating result set size (ii)

- Given three query terms a, b, and c
- Probability of a document containing all threeis the product of individual probabilities foreach query term:

P(ab c) = P(a) xP(b) xP(c)

- P(ab c) is the joint probability ofevents a, b, andc occurring

Estimating result set size (iii)

- We assume the search engine knows thenumber of documents that a word occurs in
- Call these na, nb, and nc
- Note that the book uses fa, fb, and fc
- Estimate individual query term probabilities:
- P(a) = na / NP(b) = nb / NP(c) = nc / N

Estimating result set size (iv)

- Given P(a), P(b), and P(c), we estimatethe result set size as:

nabc = Nx (na / N) x (nb / N) x (nc / N)

nabc = (naxnbxnc) / N2

- This estimation sounds good, but is lacking due to our query term independence assumption

Estimating result set size (v)

- Using the GOV2 dataset with N = 25,205,179
- Poor results,because of thequery termindependenceassumption
- Could use wordco-occurrencedata...

Estimating result set size (vi)

- Extrapolate based on the sizeof the current result set:
- The current result set is the subset of documents that have been ranked thus far
- Let C be the number of documents found thus far containing all the query words
- Let s be the proportion of the total documents ranked (use least frequently occurring term)
- Estimate result set size via nabc = C / s

Estimating result set size (vii)

- Given example query: tropical fish aquarium
- Least frequently occurring term is aquarium (which occurs in 26,480 documents)
- After ranking 3,000 documents,258 documents contain all three query terms
- Thus, nabc = C / s = 258 / (3,000 ÷ 26,480) = 2,277
- After processing 20% of the documents, the estimate is 1,778
- Which overshoots actual value of 1,529

What next?

- Read and study Chapter 4
- Do Exercises 4.1, 4.2, and 4.3
- Start thinking about how to write code to implement the stopping & stemming techniques of Ch.4

Download Presentation

Connecting to Server..