1 / 10

IR Homework #2

IR Homework #2. By J. H. Wang Mar. 25, 2008. Programming Exercise #2: Term Weighting. Goal: to assign TF-IDF weights for each index term in inverted files Input : inverted index files (the output of HW#1) Output : term weighting files (exact format to be described later).

Download Presentation

IR Homework #2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IR Homework #2 By J. H. Wang Mar. 25, 2008

  2. Programming Exercise #2: Term Weighting • Goal: to assign TF-IDF weights for each index term in inverted files • Input: inverted index files • (the output of HW#1) • Output: term weighting files • (exact format to be described later)

  3. Input: Inverted Index • Two files • Vocabulary file: a sorted list of words (each word in a separate line) • Occurrences file: for each word, a list of occurrences in the original text • [word#] [term freq.] [ (doc#, char#) pairs] • 1 7 (1, 12) (1, 28) (3, 31) (8, 39) (8, 65) (10, 16) (11, 91) • 2 2 (3, 44) (8, 72) • …

  4. TF-IDF Weighting • Term-document matrix (N*M) • Each row i contains the TF-IDF term weights wij for term ti in document dj • 0.3 0.7 0.0 0.2 0.9 0.0 0.0 0.1 0.1 0.9 0.0 0.4 0.1 0.0… • N: # of terms, M: # of documents • Ex: 20k words * 400 docs = 8M entries!  But many of them are 0’s! • Sparse matrix  how to store them in an efficient way?

  5. Output Format • wij = tfij * log (N/dfi) • We only keep entries with nonzero tfij • Similar to occurrences file • For each word, a list of nonzero entries in the term-document matrix • [word#] [doc freq.] [ (doc# (j), wij) pairs] • 1 4 (1, 0.3) (2, 0.7) (4, 0.2) (5, 0.9) • 2 5 (1, 0.1) (2, 0.1) (3, 0.9) (5, 0.4) (6, 0.1) • …

  6. Implementation Issues • You will need both TF (term frequency) and DF (document frequency) factors for each term • You can calculate the term frequencies and document frequencies at the same time when you build the index • That is, you can combine HW#2 into HW#1 if necessary • You may want to remove stopwords to further reduce the number of rows in the matrix

  7. Optional Features • Optional functionalities • Other weighting schemes, such as: probabilistic weighting • Stopword removal • Dimension reduction strategies, such as Latent Semantic Indexing (or SVD) • They should be able to be turned off by a parameter trigger

  8. Submission • Your submission should include • The source code (and optionally your executable file) • A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, able to deal with multiple formats, …) • Major difficulties encountered • Special requirements for execution environments (ex: Java Runtime Environment) • The names and the responsible parts of each individual member should be clearly identified for team work • Due: three weeks (Apr. 16, 2008)

  9. Evaluation • The TF-IDF weighting files generated by your program will be checked for correctness • Optional features such as probabilistic weighting and latent semantic indexing will be considered as bonus • You might be required to demo if the program submitted was unable to run by TA

  10. Questions?

More Related