Discussion Class 6

1 / 9

# Discussion Class 6 - PowerPoint PPT Presentation

Discussion Class 6. Ranking Algorithms. Discussion Classes. Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Discussion Class 6' - missy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Discussion Class 6

Ranking Algorithms

### Discussion Classes

Format:

Question

Provide opportunity for others to comment

Give your name. Make sure that the TA hears it.

Stand up

Speak clearly so that all the class can hear

Question 1: Inverted Document Frequency (IDF)

In class, I first introduced Salton's original term weighting, known as Inverted Document Frequency:

wik = fik / dk

The reading gives Sparck Jones's term weighting, Inverted Document Frequency (IDF):

IDFi= log2 (N/ni)+ 1

or

IDFi= log2 (maxn/ni)+ 1

What is the relationship between these alternatives?

Q1 (continued): Definitions of Terms

wik weight given to term k in document i

fik frequency with which term k appears in document i

dk number of documents that contain term k

N number of documents in the collection

ni total number of occurrences of term i in the collection

maxn maximum frequency of any term in the collection

Question 2: Within-Document Frequency

(a) Why does term weighting using within document frequency improve ranking?

(b) Why is it necessary to normalize within-document frequency?

(c) Explain Croft's normalization:

cfreqij = K + (1 - K) freqij/maxfreqj

(d) How does Salton and Buckley's recommendation term weighting fit with Croft's normalization?

Question 3: Salton/Buckley Recommendation

similarity (Q,D) =

t

t

t

 (wiq x wij)

i = 1

i = 1

i = 1

( )

wiq= 0.5 + x IDFi

 wiq2 x  wij2

0.5 freqiq

maxfreqq

where

and wij= freqij x IDFj

freqiq = frequency of term i in query q

maxfreqq = maximum frequency of any term in query q

IDFi = IDF of term i in entire collection

freqij = frequency of term i in document j

Question4: Zipf's Law

"... significant performance inprovement using ... the inverted document frequency ... that is based on Zipf's distribution ..."

What has Zipf's law to do with IDF?

Question 4: Probabilistic Models

The section on probabilistic models is rather unsatisfactory because it relies on a mathematical foundation that has been left out.

Can you summarize the basic ideas?

Question 5: TF.IDF compared with Google PageRank

(a) TF.IDF and PageRank are based on fundamentally different considerations. What are the fundamental differences?

(b) Under which circumstances would you expect each to excel?