Ir homework 2
This presentation is the property of its rightful owner.
Sponsored Links
1 / 10

IR Homework #2 PowerPoint PPT Presentation


  • 65 Views
  • Uploaded on
  • Presentation posted in: General

IR Homework #2. By J. H. Wang Mar. 25, 2008. Programming Exercise #2: Term Weighting. Goal: to assign TF-IDF weights for each index term in inverted files Input : inverted index files (the output of HW#1) Output : term weighting files (exact format to be described later).

Download Presentation

IR Homework #2

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ir homework 2

IR Homework #2

By J. H. Wang

Mar. 25, 2008


Programming exercise 2 term weighting

Programming Exercise #2: Term Weighting

  • Goal: to assign TF-IDF weights for each index term in inverted files

  • Input: inverted index files

    • (the output of HW#1)

  • Output: term weighting files

    • (exact format to be described later)


Input inverted index

Input: Inverted Index

  • Two files

    • Vocabulary file: a sorted list of words (each word in a separate line)

    • Occurrences file: for each word, a list of occurrences in the original text

      • [word#] [term freq.] [ (doc#, char#) pairs]

      • 1 7 (1, 12) (1, 28) (3, 31) (8, 39) (8, 65) (10, 16) (11, 91)

      • 2 2 (3, 44) (8, 72)


Tf idf weighting

TF-IDF Weighting

  • Term-document matrix (N*M)

    • Each row i contains the TF-IDF term weights wij for term ti in document dj

      • 0.3 0.7 0.0 0.2 0.9 0.0 0.0 0.1 0.1 0.9 0.0 0.4 0.1 0.0…

    • N: # of terms, M: # of documents

      • Ex: 20k words * 400 docs = 8M entries!  But many of them are 0’s!

    • Sparse matrix  how to store them in an efficient way?


Output format

Output Format

  • wij = tfij * log (N/dfi)

    • We only keep entries with nonzero tfij

      • Similar to occurrences file

    • For each word, a list of nonzero entries in the term-document matrix

      • [word#] [doc freq.] [ (doc# (j), wij) pairs]

      • 1 4 (1, 0.3) (2, 0.7) (4, 0.2) (5, 0.9)

      • 2 5 (1, 0.1) (2, 0.1) (3, 0.9) (5, 0.4) (6, 0.1)


Implementation issues

Implementation Issues

  • You will need both TF (term frequency) and DF (document frequency) factors for each term

  • You can calculate the term frequencies and document frequencies at the same time when you build the index

    • That is, you can combine HW#2 into HW#1 if necessary

  • You may want to remove stopwords to further reduce the number of rows in the matrix


Optional features

Optional Features

  • Optional functionalities

    • Other weighting schemes, such as: probabilistic weighting

    • Stopword removal

    • Dimension reduction strategies, such as Latent Semantic Indexing (or SVD)

    • They should be able to be turned off by a parameter trigger


Submission

Submission

  • Your submission should include

    • The source code (and optionally your executable file)

    • A one-page description that includes the following

      • Major features in your work (ex: high efficiency, low storage, able to deal with multiple formats, …)

      • Major difficulties encountered

      • Special requirements for execution environments (ex: Java Runtime Environment)

      • The names and the responsible parts of each individual member should be clearly identified for team work

  • Due: three weeks (Apr. 16, 2008)


Evaluation

Evaluation

  • The TF-IDF weighting files generated by your program will be checked for correctness

  • Optional features such as probabilistic weighting and latent semantic indexing will be considered as bonus

  • You might be required to demo if the program submitted was unable to run by TA


Questions

Questions?


  • Login