ir homework 1
Skip this Video
Download Presentation
IR Homework #1

Loading in 2 Seconds...

play fullscreen
1 / 10

IR Homework #1 - PowerPoint PPT Presentation

  • Uploaded on

IR Homework #1. By J. H. Wang Mar. 5, 2008. Programming Exercise #1: Indexing. Goal: to build an index for a text collection using inverted files Input : a set of documents concatenated into a single large file (to be described later) Output : inverted index files

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'IR Homework #1' - duc

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ir homework 1

IR Homework #1

By J. H. Wang

Mar. 5, 2008

programming exercise 1 indexing
Programming Exercise #1: Indexing
  • Goal: to build an index for a text collection using inverted files
  • Input: a set of documents concatenated into a single large file
    • (to be described later)
  • Output: inverted index files
    • (exact format to be described later)
input the test collection
Input: the Test Collection
  • Test collections held at University of Glasgow:
    • LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI, each in different formats
      • Ex: The Time Collection: 423 documents (1.5MB)
    • You have to do some preprocessing for different test collections
output inverted index
Output: Inverted Index
  • Two files
    • Vocabulary file: a sorted list of words (each word in a separate line)
    • Occurrences file: for each word, a list of occurrences in the original text
      • [word#] [term freq.] [ (doc#, char#) pairs]
      • 1 7 (1, 12) (1, 28) (3, 31) (8, 39) (8, 65) (10, 16) (11, 91)
      • 2 2 (3, 44) (8, 72)
implementation issues
Implementation Issues
  • Note: char# means the character position in the FILE (not the document)
    • This can facilitate easier implementation in later steps after indexing
  • Document preprocessing should be handled with care
    • Digits, hyphens, punctuation marks, …
implementation issues1
Implementation Issues
  • You can have a separate data structure (e.g. trie, which is more efficient) to store the vocabularies and occurrences in your program to speed up the indexing process, but the output should be in the designated format
  • Optional functionality
    • Stopword removal
    • Stemming
    • They should be able to be turned off by a parameter trigger
  • Your submission *should* include
    • The source code (and optionally your executable file)
    • A one-page description that includes the following
      • Major features in your work (ex: high efficiency, low storage, able to deal with multiple formats, …)
      • Major difficulties encountered
      • Special requirements for execution environments (ex: Java Runtime Environment)
      • The names and the responsible parts of each individual member should be clearly identified for team work
  • Due: two weeks (Mar.19, 2008)
submission instructions
Submission Instructions
  • Programs or homework in electronic files must be submitted directly to the TA by e-mail as follows
    • Before submission: one single compressed file (including source codes and documentation), for example, 9659xxxx-HW1.ZIP
      • Remember to specify your name and student ID in the files and documentation
    • E-mail of TA: [email protected]
  • You will get a confirmation e-mail from the TA after receiving your submission
    • If you cannot successfully e-mail your work, please contact with the TA or the instructor
  • Minimum requirement: the Time Collection as provided on the Web page will be used as input, and the inverted index generated by your program will be checked for correctness
  • Optional features such as stemming and stopword removal will be considered as bonus
  • You might be required to demo if the program submitted was unable to run by TA