1 / 16

Information Retrieval and Extraction 2010 Term Project – Modern Web Search Advisor: 陳信希 TA: 許名宏 & 王界人

Information Retrieval and Extraction 2010 Term Project – Modern Web Search Advisor: 陳信希 TA: 許名宏 & 王界人. Overview (in English). Goal Using advanced approaches to enhance Okapi-BM25 Group 1~3 person(s) per group; email the name list to the TA Approach

justice
Download Presentation

Information Retrieval and Extraction 2010 Term Project – Modern Web Search Advisor: 陳信希 TA: 許名宏 & 王界人

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval and Extraction2010 Term Project – Modern Web SearchAdvisor: 陳信希TA: 許名宏 & 王界人

  2. Overview (in English) • Goal • Using advanced approaches to enhance Okapi-BM25 • Group • 1~3 person(s) per group; email the name list to the TA • Approach • No limitations; Any resources on the Web is usable. • Date of system demo and report submission • 6/24 Thursday (provisional) • Grading criteria • Originality and reasonableness of your approach • Effort for implementation / per person • Retrieval performance (training & testing) • Completeness of the report (分工、結果分析)

  3. Overview (in Chinese) • 專題目標 • 以進階IR技術提升Okapi-BM25的效能 • 分組 • 1~3人/組,請組長將組員名單(學號、姓名) e-mail給TA • 方法 • 不限,可使用任何 toolkit or resource on Web • Demo及報告繳交 • 6/25 Friday • 評分標準 • 所採用的方法創意、合理性 • Effort of implementation / per person • 檢索效能 (training 、 testing) • 報告完整性、分工及檢索結果分析

  4. Content of Report • Detail description about your approach • Parameter setting (if parametric) • System performance on the training topics • The baseline (Okapi-BM25) performance • The performance of your approach • Division of the work (如何分工) • What you have learned (心得) • Others (optional)

  5. Baseline Implementation: Okapi-BM25 • Parametric probabilistic model • Parameter setting • k1=1.2, k2=0, k3=0, b =0.75, R =r =0 (initial guess) • Stemming: Porter’s stemmer

  6. Possible Approaches • Pseudo relevance feedback (PRF) • Supported by Lemur API • Simple and effective, but no originality • Query expansion • Using external resources ex: WordNet, Wikipedia, query log (AOL) ...etc • Word sense disambiguation in docs/query • Combining Results from 2 or more IR systems • Latent semantic analysis (LSI) • Others • learning to rank, clustering/classification, …

  7. Experimental Dataset • A partial collection of TREC WT10g • ~10k documents • Link information is provided • 30 topics for system development (training) • Another 20 topics in demo (testing)

  8. Topic Example <top> <num> Number: 476 <title> Jennifer Aniston <desc> Description: Find documents that identify movies and/or television programs that Jennifer Aniston has appeared in. <narr> Narrative: Relevant documents include movies and/or television programs that Jennifer Aniston has appeared in. </top>

  9. Document Example <DOC> <DOCNO>WTX010-B01-2</DOCNO> <DOCOLDNO>IA011-000115-B026-169</DOCOLDNO> <DOCHDR> http://www.lpitr.state.sc.us:80/reports/jsrf14.htm 167.7.18.68 19970216181104 text/html 264 HTTP/1.0 200 OK Date: Sunday, 16-Feb-97 18:19:32 GMT Server: NCSA/SMI-1.0 MIME-version: 1.0 Content-type: text/html Last-modified: Friday, 02-Feb-96 19:51:15 GMT Content-length: 82 </DOCHDR> <sup>1</sup> Mr. Delleney did not participate in deliberation of this candidate. </DOC>

  10. Link Information • For approaches with PageRank/HITS • In-links • “A B C” B and C contain links to A ex: WTX010-B01-118 WTX010-B01-114 WTX010-B01-121 • Out-links • “A B C” A contains links pointed to B or C ex: WTX010-B01-127 WTX010-B01-89 WTX010-B01-119

  11. Evaluation • Evaluate top 100 retrieved documents • Evaluation metrics • Mean average precision (MAP) • P@20 • Use the program “trec_eval”to evaluate system performance • Usage of trec_eval

  12. Example Result for Evaluation (topic-num) (dummy) (docno) (rank) (score) (run-tag) 465 Q0 WTX017-B13-74 1 5 test 465 Q0 WTX017-B38-11 2 4.5 test 465 Q0 WTX017-B38-41 3 4.3 test 465 Q0 WTX017-B38-42 4 4.2 test 465 Q0 WTX017-B40-46 5 4.1 test 465 Q0 WTX018-B44-359 6 3.5 test 465 Q0 WTX018-B44-300 7 3 test 465 Q0 WTX012-B01-121 8 2.5 test 465 Q0 WTX019-B37-27 9 2 test 465 Q0 WTX019-B37-31 10 1.9 test 474 Q0 WTX012-B01-151 1 9 test 474 Q0 WTX017-B38-46 2 8 test 474 Q0 WTX018-B44-35 3 7 test 474 Q0 WTX013-B03-335 4 6 test 474 Q0 WTX018-B44-30 5 5 test 474 Q0 WTX015-B25-285 6 4 test 474 Q0 WTX019-B37-27 7 3 test 474 Q0 WTX014-B39-281 8 2 test 474 Q0 WTX018-B14-294 9 1.5 test

  13. Example of Relevance Judgments (topic-num) (dummy) (docno) (relevance) 465 0 WTX017-B13-74 1 465 0 WTX017-B38-46 1 465 0 WTX018-B44-359 1 465 0 WTX019-B37-27 2 474 0 WTX012-B01-151 1 474 0 WTX013-B03-335 1 474 0 WTX014-B39-281 1 474 0 WTX015-B25-285 1 474 0 WTX018-B20-109 2 474 0 WTX018-B14-294 1

  14. Summary of What to Do • Okapi-BM25 implementation (baseline) • With the fixed settings • Evaluate the baseline approach with training topics • using terms in <title> as query • Survey or design your enhanced approach • Evaluate and optimize your approach with training topics • Submit report and demo with testing topics • Evaluate Okapi-BM25 and your approach with testing topics

  15. Dataset Description (1/2) • “training_topics.txt” (file) • 30 topics for system development • “qrels_training_topics.txt” (file) • Relevance judgments for training topics • “documents” (directory) • Including 10 .rar files of raw documents • “in_links.txt” (file) • In-link information • “out_links.txt” (file) • Out-link information

  16. Dataset Description (2/2) • “trec_eval.exe” (file) • Binary evaluation program • “trec_eval.8.1.rar” (file) • Source of trec_eval for making in UNIX

More Related