Information Retrieval at NLC - PowerPoint PPT Presentation

information retrieval at nlc n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Retrieval at NLC PowerPoint Presentation
Download Presentation
Information Retrieval at NLC

play fullscreen
1 / 25
Information Retrieval at NLC
159 Views
Download Presentation
zabrina
Download Presentation

Information Retrieval at NLC

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China

  2. Outline • People • Projects • Systems • Researches

  3. People • Jianfeng Gao, Microsoft Research, China • Guihong Cao, Tianjin University, China • Hongzhao He, Tianjin University, China • Min Zhang, Tsinghua University, China • Jian-Yun Nie, Université de Montréal • Stephen Robertson, Microsoft Research, Cambridge • Stephen Walker, Microsoft Research, Cambridge

  4. Systems • SMART (Master: HongZhao) • Traditional IR system – VSM, TFIDF • Hold more than 500M collection • Linix • Okapi (Master: Guihong) • Modern IR system – Probabilistic Model, BM25 • Hold more than 10G collection • Windows2000

  5. Projects • CLIR – TREC-9 ( Japanese NTCIR-3) • System: SMART • Focus: • Chinese Indexing Unit [Gao et al, 00] [Gao&He, 01] • Query translation [Gao et al, 01] • Web Retrieval – TREC-10 • System: Okapi • Focus: • Blind Feedback … [Zhang et al, 01] • Link-based retrieval (anchor text)… [Craswell et al, 01]

  6. Researches • Best indexing unit for Chinese IR • Query translation • Using link information for web retrieval • Blind feedback for web retrieval • Improving the effectiveness of IR with clustering and Fusion

  7. Best indexing unit for Chinese IR • Motivation • What is the basic unit of indexing in Chinese IR – word, n-gram, or combined … • Does the accuracy of word segmentation have a significant impact on IR performance • Experiment1 – indexing units • Experiment2 – the impact of word segmentation

  8. Experiment1 – settings • System – SMART (modified version) • Corpus – TREC5&6 Chinese collection • Experiments • Impact of dict. – using the longest matching with a small dict. and with a large dict. • Combining the first method with single characters • Using full segmentation • Using bi-grams and uni-grams (characters) • Combining words with bi-grams and characters • Unknown word detection using NLPWin

  9. Experiment1 – results • Word + character + (bigram) + unknown words

  10. Experiment2 – settings • System • SMART System • Songrou’s Segmentation & Evaluation System • Corpus • (1) Trec 5&6 for Chinese IR • (2) Songrou’s Corpus • 12rst.txt 181KB • 12rst.src 250KB ( Standard segmentation of 12rst.txt made by linguists ) • (3) Sampling from Songrou’s Coupus • test.txt 20KB ( Random sampling from 12rst.txt ) • standard.src 28KB ( Standard segmentation corresponding to test.txt )

  11. Experiment2 – results Notes A: 1 Baseline; 2 Disambiguration; 3 Number; 4 Propernoun; 5 Suffix Notes B: Feedback parameters are (10, 500, 0.5, 0.5 ) and (100, 500, 0.5, 0.5 )

  12. Query Translation • Motivation – problems of simple lexicon-based approaches • Lexicon is incomplete • Difficult to select correct translations • Solution – improved lexicon-based approach • Term disambiguation using co-occurrence • Phrase detecting and translation using LM • Translation coverage enhancement using TM

  13. Term disambiguation • Assumption – correct translation words tend to co-occur in Chinese language • A greedy algorithm: • for English terms Te = (e1…en), • find their Chinese translations Tc = (c1…cn), such that Tc = argmax SIM(c1, …, cn) • Term-similarity matrix – trained on Chinese corpus

  14. Phrase detection and translation • Multi-word phrase is detected by base NP detector • Translation pattern (PATTe), e.g. • <NOUN1 NOUN2>  <NOUN1 NOUN2> • <NOUN1 of NOUN2>  <NOUN2 NOUN1> • Phrase translation: • Tc = argmax P(OTc|PATTe)P(Tc) • P(OTc|PATTe): prob. of the translation pattern • P(Tc): prob. of the phrase in Chinese LM

  15. Using translation model (TM) • Enhance the coverage of the lexicon • Using TM • Tc = argmax P(Te|Tc)SIM(Tc) • Mining parallel texts from the Web for TM training

  16. Experiments on TREC-5&6 • Monolingual • Simple translation: lexicon looking up • Best-sense translation: 2 + manually selecting • Improved translation (our method) • Machine translation: using IBM MT system

  17. Translation Method Avg.P. % of Mono. IR 1 Monolingual 0.5150 2 Simple translation (m-mode) 0.2722 52.85% 3 Simple translation (u-mode) 0.3041 59.05% 4 Best-sense translation 0.3762 73.05% 5 Improved translation 0.3883 75.40% 6 Machine translation 0.3891 75.55% 7 5 + 6 0.4400 85.44% Summary of Experiments

  18. Using link information for web retrieval • Motivation • The effectiveness of link-based retrieval • The evaluation on TREC web collection • Link-based Web retrieval – the state-of-the-art • Recommendation – high in-degree is better • Topic locality – connected pages are similar • Anchor description – represented by anchor text • Link-based retrieval in TREC – No good results

  19. Experiments on TREC-9 • Baseline – Content based IR • Anchor description • Used alone – Much worse than baseline • Combined with content description – trivial improvement • Re-ranking – trivial improvement • Spreading – No positive effect

  20. Summary of Experiments

  21. Blind feedback for web retrieval • Motivation • Web query is short • Web collection is huge and highly mixed • Blind feedback – refine web queries • Using global web collection • Using local web collection • Using other well-organized collection, i.e. Encarta

  22. Experiments on TREC-9 • Baseline – 2-stage pseudo-relevance feedback (PFB) using global web collection • Local context analysis [Xu et al., 96] – 2-stage PFB using local web collection retrieved by the first stage • 2-stage PFB using Encarta collection in the first stage

  23. Summary of Experiments • ???

  24. Improving the effectiveness of IR with clustering and Fusion • Clustering Hypothesis – Documents that are relevant to the same query are more similar than non-relevant documents, and can be clustered together. • Fusion Hypothesis – Different ranked lists usually have a high overlap of relevant documents and a low overlap of non-relevant documents.

  25. Thanks !