IR Homework #2

1 / 10

# IR Homework #2 - PowerPoint PPT Presentation

IR Homework #2. By J. H. Wang Mar. 25, 2008. Programming Exercise #2: Term Weighting. Goal: to assign TF-IDF weights for each index term in inverted files Input : inverted index files (the output of HW#1) Output : term weighting files (exact format to be described later).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' IR Homework #2' - raven-weaver

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### IR Homework #2

By J. H. Wang

Mar. 25, 2008

Programming Exercise #2: Term Weighting
• Goal: to assign TF-IDF weights for each index term in inverted files
• Input: inverted index files
• (the output of HW#1)
• Output: term weighting files
• (exact format to be described later)
Input: Inverted Index
• Two files
• Vocabulary file: a sorted list of words (each word in a separate line)
• Occurrences file: for each word, a list of occurrences in the original text
• [word#] [term freq.] [ (doc#, char#) pairs]
• 1 7 (1, 12) (1, 28) (3, 31) (8, 39) (8, 65) (10, 16) (11, 91)
• 2 2 (3, 44) (8, 72)
TF-IDF Weighting
• Term-document matrix (N*M)
• Each row i contains the TF-IDF term weights wij for term ti in document dj
• 0.3 0.7 0.0 0.2 0.9 0.0 0.0 0.1 0.1 0.9 0.0 0.4 0.1 0.0…
• N: # of terms, M: # of documents
• Ex: 20k words * 400 docs = 8M entries!  But many of them are 0’s!
• Sparse matrix  how to store them in an efficient way?
Output Format
• wij = tfij * log (N/dfi)
• We only keep entries with nonzero tfij
• Similar to occurrences file
• For each word, a list of nonzero entries in the term-document matrix
• [word#] [doc freq.] [ (doc# (j), wij) pairs]
• 1 4 (1, 0.3) (2, 0.7) (4, 0.2) (5, 0.9)
• 2 5 (1, 0.1) (2, 0.1) (3, 0.9) (5, 0.4) (6, 0.1)
Implementation Issues
• You will need both TF (term frequency) and DF (document frequency) factors for each term
• You can calculate the term frequencies and document frequencies at the same time when you build the index
• That is, you can combine HW#2 into HW#1 if necessary
• You may want to remove stopwords to further reduce the number of rows in the matrix
Optional Features
• Optional functionalities
• Other weighting schemes, such as: probabilistic weighting
• Stopword removal
• Dimension reduction strategies, such as Latent Semantic Indexing (or SVD)
• They should be able to be turned off by a parameter trigger
Submission