Ir homework 1
This presentation is the property of its rightful owner.
Sponsored Links
1 / 10

IR Homework #1 PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on
  • Presentation posted in: General

IR Homework #1. By J. H. Wang Mar. 5, 2008. Programming Exercise #1: Indexing. Goal: to build an index for a text collection using inverted files Input : a set of documents concatenated into a single large file (to be described later) Output : inverted index files

Download Presentation

IR Homework #1

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ir homework 1

IR Homework #1

By J. H. Wang

Mar. 5, 2008


Programming exercise 1 indexing

Programming Exercise #1: Indexing

  • Goal: to build an index for a text collection using inverted files

  • Input: a set of documents concatenated into a single large file

    • (to be described later)

  • Output: inverted index files

    • (exact format to be described later)


Input the test collection

Input: the Test Collection

  • Test collections held at University of Glasgow: http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/

    • LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI, each in different formats

      • Ex: The Time Collection: 423 documents (1.5MB)

    • You have to do some preprocessing for different test collections


Output inverted index

Output: Inverted Index

  • Two files

    • Vocabulary file: a sorted list of words (each word in a separate line)

    • Occurrences file: for each word, a list of occurrences in the original text

      • [word#] [term freq.] [ (doc#, char#) pairs]

      • 1 7 (1, 12) (1, 28) (3, 31) (8, 39) (8, 65) (10, 16) (11, 91)

      • 2 2 (3, 44) (8, 72)


Implementation issues

Implementation Issues

  • Note: char# means the character position in the FILE (not the document)

    • This can facilitate easier implementation in later steps after indexing

  • Document preprocessing should be handled with care

    • Digits, hyphens, punctuation marks, …


Implementation issues1

Implementation Issues

  • You can have a separate data structure (e.g. trie, which is more efficient) to store the vocabularies and occurrences in your program to speed up the indexing process, but the output should be in the designated format

  • Optional functionality

    • Stopword removal

    • Stemming

    • They should be able to be turned off by a parameter trigger


Submission

Submission

  • Your submission *should* include

    • The source code (and optionally your executable file)

    • A one-page description that includes the following

      • Major features in your work (ex: high efficiency, low storage, able to deal with multiple formats, …)

      • Major difficulties encountered

      • Special requirements for execution environments (ex: Java Runtime Environment)

      • The names and the responsible parts of each individual member should be clearly identified for team work

  • Due: two weeks (Mar.19, 2008)


Submission instructions

Submission Instructions

  • Programs or homework in electronic files must be submitted directly to the TA by e-mail as follows

    • Before submission: one single compressed file (including source codes and documentation), for example, 9659xxxx-HW1.ZIP

      • Remember to specify your name and student ID in the files and documentation

    • E-mail of TA: [email protected]

  • You will get a confirmation e-mail from the TA after receiving your submission

    • If you cannot successfully e-mail your work, please contact with the TA or the instructor


Evaluation

Evaluation

  • Minimum requirement: the Time Collection as provided on the Web page will be used as input, and the inverted index generated by your program will be checked for correctness

  • Optional features such as stemming and stopword removal will be considered as bonus

  • You might be required to demo if the program submitted was unable to run by TA


Questions

Questions?


  • Login