IR Homework #2

IR Homework #2 By J. H. Wang Mar. 25, 2008

Programming Exercise #2: Term Weighting • Goal: to assign TF-IDF weights for each index term in inverted files • Input: inverted index files • (the output of HW#1) • Output: term weighting files • (exact format to be described later)

Input: Inverted Index • Two files • Vocabulary file: a sorted list of words (each word in a separate line) • Occurrences file: for each word, a list of occurrences in the original text • [word#] [term freq.] [ (doc#, char#) pairs] • 1 7 (1, 12) (1, 28) (3, 31) (8, 39) (8, 65) (10, 16) (11, 91) • 2 2 (3, 44) (8, 72) • …

TF-IDF Weighting • Term-document matrix (N*M) • Each row i contains the TF-IDF term weights wij for term ti in document dj • 0.3 0.7 0.0 0.2 0.9 0.0 0.0 0.1 0.1 0.9 0.0 0.4 0.1 0.0… • N: # of terms, M: # of documents • Ex: 20k words * 400 docs = 8M entries!  But many of them are 0’s! • Sparse matrix  how to store them in an efficient way?

Output Format • wij = tfij * log (N/dfi) • We only keep entries with nonzero tfij • Similar to occurrences file • For each word, a list of nonzero entries in the term-document matrix • [word#] [doc freq.] [ (doc# (j), wij) pairs] • 1 4 (1, 0.3) (2, 0.7) (4, 0.2) (5, 0.9) • 2 5 (1, 0.1) (2, 0.1) (3, 0.9) (5, 0.4) (6, 0.1) • …

Implementation Issues • You will need both TF (term frequency) and DF (document frequency) factors for each term • You can calculate the term frequencies and document frequencies at the same time when you build the index • That is, you can combine HW#2 into HW#1 if necessary • You may want to remove stopwords to further reduce the number of rows in the matrix

Optional Features • Optional functionalities • Other weighting schemes, such as: probabilistic weighting • Stopword removal • Dimension reduction strategies, such as Latent Semantic Indexing (or SVD) • They should be able to be turned off by a parameter trigger

Submission • Your submission should include • The source code (and optionally your executable file) • A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, able to deal with multiple formats, …) • Major difficulties encountered • Special requirements for execution environments (ex: Java Runtime Environment) • The names and the responsible parts of each individual member should be clearly identified for team work • Due: three weeks (Apr. 16, 2008)

Evaluation • The TF-IDF weighting files generated by your program will be checked for correctness • Optional features such as probabilistic weighting and latent semantic indexing will be considered as bonus • You might be required to demo if the program submitted was unable to run by TA

Questions?

IR Homework #2

IR Homework #2

Presentation Transcript

IR Homework #1

Homework : “Colons and Semicolons” Bring your IR book tomorrow.

Homework! Oh, Homework!

IR COMD POLS COMD IR IR Global COMD POLS POLS IR Global Psychology IR COMD IR IR

IR Homework #2

IR Homework #1

IR Homework #3

“IR”

Homework 1 Homework 2 Homework 3 Homework 4