1 / 7

IR Project

IR Project. 黃楹芸 90522017 孫怡明 90522026. Reference Collections. The TREC Collection Built under the TIPSTER program Documents from all sub-collections are tagged with SGML to allow easy parsing. FBIS ( Foreign Broadcast Information Service) Size : 470 Mb Number : 130,471 Docs

connor
Download Presentation

IR Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IR Project 黃楹芸90522017 孫怡明90522026

  2. Reference Collections • The TREC Collection • Built under the TIPSTER program • Documents from all sub-collections are tagged with SGML to allow easy parsing. • FBIS (Foreign Broadcast Information Service) • Size : 470 Mb • Number : 130,471 Docs • Words/Doc. (median) : 322 • Words/Docs. (mean) : 543.6

  3. Document Parsing: sample document

  4. Document Parsing • Process each document to extract: • Document ID • Segment the text into tokens • In our case, separate the text by white-spaces and newlines • Case conversion (make all tokens lowercase) • Discard stopwords and other non-content words (e.g. numbers) • Word stemming • Count term frequencies, record positions • Update indices • Write out the index to file, according to alphabetical order from a to z

  5. Project Introduction • 作業平台 • a. CPU :Celeron 450 MHz • b. RAM 大小:256 RAM • c. 作業系統:Win 2000 Server • d. 處理程式:Java + JDBC • e. 資料儲存:SQL Server 2000 • 使用的Indexing方法 • Inverted indexing

  6. System Architecture

  7. Implement • Our Use Interface • http://140.115.156.81/IR/ • Indexing Time • 120 sec ~ 140 sec per file • Total ~ 16 Hour • Searching Time • “Information”- 13999 Records ~ 15 sec • “mobilize” – 866 Records ~ 3 sec • Indexing File • 850 MB

More Related