1 / 25

Information Retrieval at NLC

Information Retrieval at NLC . Jianfeng Gao NLC Group, Microsoft Research China. Outline. People Projects Systems Researches. People. Jianfeng Gao, Microsoft Research, China Guihong Cao, Tianjin University, China Hongzhao He, Tianjin University, China

zabrina
Download Presentation

Information Retrieval at NLC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China

  2. Outline • People • Projects • Systems • Researches

  3. People • Jianfeng Gao, Microsoft Research, China • Guihong Cao, Tianjin University, China • Hongzhao He, Tianjin University, China • Min Zhang, Tsinghua University, China • Jian-Yun Nie, Université de Montréal • Stephen Robertson, Microsoft Research, Cambridge • Stephen Walker, Microsoft Research, Cambridge

  4. Systems • SMART (Master: HongZhao) • Traditional IR system – VSM, TFIDF • Hold more than 500M collection • Linix • Okapi (Master: Guihong) • Modern IR system – Probabilistic Model, BM25 • Hold more than 10G collection • Windows2000

  5. Projects • CLIR – TREC-9 ( Japanese NTCIR-3) • System: SMART • Focus: • Chinese Indexing Unit [Gao et al, 00] [Gao&He, 01] • Query translation [Gao et al, 01] • Web Retrieval – TREC-10 • System: Okapi • Focus: • Blind Feedback … [Zhang et al, 01] • Link-based retrieval (anchor text)… [Craswell et al, 01]

  6. Researches • Best indexing unit for Chinese IR • Query translation • Using link information for web retrieval • Blind feedback for web retrieval • Improving the effectiveness of IR with clustering and Fusion

  7. Best indexing unit for Chinese IR • Motivation • What is the basic unit of indexing in Chinese IR – word, n-gram, or combined … • Does the accuracy of word segmentation have a significant impact on IR performance • Experiment1 – indexing units • Experiment2 – the impact of word segmentation

  8. Experiment1 – settings • System – SMART (modified version) • Corpus – TREC5&6 Chinese collection • Experiments • Impact of dict. – using the longest matching with a small dict. and with a large dict. • Combining the first method with single characters • Using full segmentation • Using bi-grams and uni-grams (characters) • Combining words with bi-grams and characters • Unknown word detection using NLPWin

  9. Experiment1 – results • Word + character + (bigram) + unknown words

  10. Experiment2 – settings • System • SMART System • Songrou’s Segmentation & Evaluation System • Corpus • (1) Trec 5&6 for Chinese IR • (2) Songrou’s Corpus • 12rst.txt 181KB • 12rst.src 250KB ( Standard segmentation of 12rst.txt made by linguists ) • (3) Sampling from Songrou’s Coupus • test.txt 20KB ( Random sampling from 12rst.txt ) • standard.src 28KB ( Standard segmentation corresponding to test.txt )

  11. Experiment2 – results Notes A: 1 Baseline; 2 Disambiguration; 3 Number; 4 Propernoun; 5 Suffix Notes B: Feedback parameters are (10, 500, 0.5, 0.5 ) and (100, 500, 0.5, 0.5 )

  12. Query Translation • Motivation – problems of simple lexicon-based approaches • Lexicon is incomplete • Difficult to select correct translations • Solution – improved lexicon-based approach • Term disambiguation using co-occurrence • Phrase detecting and translation using LM • Translation coverage enhancement using TM

  13. Term disambiguation • Assumption – correct translation words tend to co-occur in Chinese language • A greedy algorithm: • for English terms Te = (e1…en), • find their Chinese translations Tc = (c1…cn), such that Tc = argmax SIM(c1, …, cn) • Term-similarity matrix – trained on Chinese corpus

  14. Phrase detection and translation • Multi-word phrase is detected by base NP detector • Translation pattern (PATTe), e.g. • <NOUN1 NOUN2>  <NOUN1 NOUN2> • <NOUN1 of NOUN2>  <NOUN2 NOUN1> • Phrase translation: • Tc = argmax P(OTc|PATTe)P(Tc) • P(OTc|PATTe): prob. of the translation pattern • P(Tc): prob. of the phrase in Chinese LM

  15. Using translation model (TM) • Enhance the coverage of the lexicon • Using TM • Tc = argmax P(Te|Tc)SIM(Tc) • Mining parallel texts from the Web for TM training

  16. Experiments on TREC-5&6 • Monolingual • Simple translation: lexicon looking up • Best-sense translation: 2 + manually selecting • Improved translation (our method) • Machine translation: using IBM MT system

  17. Translation Method Avg.P. % of Mono. IR 1 Monolingual 0.5150 2 Simple translation (m-mode) 0.2722 52.85% 3 Simple translation (u-mode) 0.3041 59.05% 4 Best-sense translation 0.3762 73.05% 5 Improved translation 0.3883 75.40% 6 Machine translation 0.3891 75.55% 7 5 + 6 0.4400 85.44% Summary of Experiments

  18. Using link information for web retrieval • Motivation • The effectiveness of link-based retrieval • The evaluation on TREC web collection • Link-based Web retrieval – the state-of-the-art • Recommendation – high in-degree is better • Topic locality – connected pages are similar • Anchor description – represented by anchor text • Link-based retrieval in TREC – No good results

  19. Experiments on TREC-9 • Baseline – Content based IR • Anchor description • Used alone – Much worse than baseline • Combined with content description – trivial improvement • Re-ranking – trivial improvement • Spreading – No positive effect

  20. Summary of Experiments

  21. Blind feedback for web retrieval • Motivation • Web query is short • Web collection is huge and highly mixed • Blind feedback – refine web queries • Using global web collection • Using local web collection • Using other well-organized collection, i.e. Encarta

  22. Experiments on TREC-9 • Baseline – 2-stage pseudo-relevance feedback (PFB) using global web collection • Local context analysis [Xu et al., 96] – 2-stage PFB using local web collection retrieved by the first stage • 2-stage PFB using Encarta collection in the first stage

  23. Summary of Experiments • ???

  24. Improving the effectiveness of IR with clustering and Fusion • Clustering Hypothesis – Documents that are relevant to the same query are more similar than non-relevant documents, and can be clustered together. • Fusion Hypothesis – Different ranked lists usually have a high overlap of relevant documents and a low overlap of non-relevant documents.

  25. Thanks !

More Related