1 / 32

A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective

A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective. by Ying Alvarado (24401693). CSE 8337 Lecturer : Dr. Margaret Dunham April 26, 2007. Outline. Introduction Concept Why important Approach CLIR problems Resource Approaches

ray
Download Presentation

A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective by Ying Alvarado (24401693) CSE 8337 Lecturer : Dr. Margaret Dunham April 26, 2007

  2. Outline • Introduction • Concept • Why important • Approach • CLIR problems • Resource • Approaches • Example Techniques • A CLIR application system • CLIR effectiveness • CLIR future tasks • CLIR communities • References

  3. Cross Language IR • Definition: Users enter their query in one language and the system retrieves relevant documents in other languages. • For example, a user may pose their query in English but retrieve relevant documents written in French. • Example CLIR applications • Cross-Language retrieval from texts • Cross-Language retrieval from audio and images In this presentation, we focus on text IR only! [1] Wikipedia, http://en.wikipedia.org/wiki/Cross-language_information_retrieval [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005

  4. IR system Request (L1) Results(L1) Documents (L1 ) Monolingual vs. Bilingual vs. Multilingual • Monolingual IR: Documents and user requests in the same language • Cross-language IR: • Documents and user requests are in different languages (bilingual IR) Cross-language IR (CLIR) system Request (L1) Results(L2) Documents (L2 ) Source language Target language [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005

  5. Monolingual vs. Bilingual vs. Multilingual (con.) • Multilingual IR: • Documents in collection in different languages, search requests in any language Multilingual IR (MLIR) system Request (L?) Results (L2, L3 or L4) Documents (L4 ) Documents (L3) Documents (L2 ) e.g. the Web

  6. Why CLIR? Mar. 10, 2007 [3] Internet World Stats, http://www.internetworldstats.com/stats7.htm

  7. Why CLIR? (con.) • A collection may contains documents in many different languages, e.g. the Web. It would be impractical to form a query in each language. • The documents may be expressed in more than one languages. For example, • Technical documents in which English jargon appears intermixed with narrative text in another language. • Academic works which cite the titles of documents in different languages. • The user is not sufficiently fluent to express a query in a language, but is able to make use of the documents that are identified. • The user is monolingual and wants to query in their native language. Because he • can judge relevance even if results not translated • have access to document translation [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996

  8. CLIR problems • Handling non-ASCII character sets • Untranslatable search keys (OOV): e.g. compound words, proper names, special terms • Multi-word concepts, e.g. phrases and idioms • Ambiguity, e.g. Homonymy and polysemy • Word Inflections, e.g. plurals and gender [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 [5] Ari Pirkola, et al. Dictionary-Based Cross-Language Information Retrieval_ Problems, Methods, and Research Findings. Information Retrieval, Vol. 4. 2001

  9. Resources for Translation • Ontology • Representation of concepts and relationships • Thesaurus • it more commonly means a listing of words with similar, related, or opposite meanings • It does not include the definition of words • Bilingual dictionary • a list of words together with additional word-specific information. • Bilingual controlled vocabulary • carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search • Corpora • The document collection itself [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996 [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006 [1] Wikipedia. Related pages. [7] Metamodel.com. What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model? http://www.metamodel.com/article.php?story=20030115211223271. 2004

  10. An example of controlled vocabulary The hierarchical relationships The equivalence relationship Women’s Pants:   BT Pants   NT Casual Pants   NT Dress Pants   NT Sports Pants [14] Boxes and Arrows, http://www.boxesandarrows.com/view/what_is_a_controlled_vocabulary

  11. What to translate? • Document translation • Text translation • E.g., translate entire document collection into English → search collection in English • Vector translation • Query translation • E.g., translate English query into Chinese query → search Chinese document collection [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006

  12. Tradeoffs • Document Translation • Documents can be translate and stored offline • Dependent on high quality automatic machine translation (MT) system • Does not easily deal with changing document sets • Query Translation • Often easier • Disambiguation of query terms may be difficult with short queries [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996 [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006

  13. Approaches to query translation • Knowledge-based: Several aspects of domain knowledge is manually encoded in to a lexicon. • Ontology-based (concept driven) • Thesaurus-based • Dictionary-based Expensive to construct lexicons; Lag behind the common use of terminology. • Corpus-based: directly exploit statistical information about term usage in a corpora; automatically construct lexicon. • Parallel corpora: document pairs, sentence pairs, term pairs • Comparable corpora: document pairs, similar content • Unaligned corpora: documents from the same domain, not translations of one another, not linked in any other way [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996 [8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001 [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

  14. Applying monolingual IR techniques • Query expansion • Relevance feedback • Stemming • Latent semantic analysis • Parsing • Part of speech tagging …… [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996

  15. Multilingual Thesauri • Three construction techniques • Build it from scratch • Translate an existing thesaurus • Merge monolingual thesauri • For example EuroWordNet • 7 languages • Built from existing lexical resources • Has the same structure as Princeton WordNet [8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001 [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

  16. Pseudo-Relevance Feedback • Also call Blind feedback • Assume that the top n documents in the result set actually are relevant. • Enter query terms in French • Find top French documents in parallel corpus • Construct a query from English translations • Perform a monolingual free text search French Query Terms Top ranked French Documents English Web Pages English Translations French Text Retrieval System Parallel Corpus AltaVista [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

  17. Different level alignment in parallel corpora • Document alignment • Already exists • Collected from existing corpora • Examine document external features • Examine document internal features • Sentence alignment • Easily constructed from aligned documents • Match pattern of relative sentence lengths • Good first step for term alignment • Term alignment • Using co-occurrence-based translation [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

  18. CSE8337是一门关于信息存储和检索的课程。 CSE8337 is a class about information storage and retrieval. Example of term alignment

  19. Co-occurrence-based translation • Align terms using co-occurrence statistics • assumed that the correct translations of query terms tend to co-occur in target language documents • How often do a term pair occur in sentence pairs? • Weighted by relative position in the sentences • Retain term pairs that occur unusually often [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

  20. Exploiting Unaligned Corpora • Example approach: category-based translation • Extract a large number of terms from unaligned coprora of the first and second languages • Assign a category to each extracted term by accessing monolingual thesauri of the first and second languages • Estimate category-to-category translation probabilities • Estimate term-to-term translation probabilities using said category-to-category translation probabilities [15] David Hull, Terminology translation for unaligned comparable corpora using category based translation probabilities. United States Patent 6885985. Filing date: Dec 18, 2000. Issue date: Apr 26, 2005

  21. Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Controlled Vocabulary Free Text Knowledge-based Corpus-based Ontology-based Dictionary-based Term-aligned Sentence-aligned Document-aligned Unaligned Thesaurus-based Parallel Comparable In Summary [8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001

  22. An experimental system Automatic construction of parallel English-Chinese corpus for CLIR • A parallel text mining system- PTMiner • Finds parallel text from web • Parallel Text Mining Algorithm • Search for candidate sites - Using existing Web search engines, search for the candidate sites that may contain parallel pages; (by using text anchor) • File name fetching - For each candidate site, fetch the URLs of Web pages that are indexed by the search engines; • Host crawling - Starting from the URLs collected in the previous step, search through each candidate site separately for more URLs; • Pair scan - From the obtained URLs of each site, scan for possible parallel pairs; (by analyzing document external features) • Download and verifying - Download the parallel pages, determine file size, language and character set, text length, HTML structure, and filter out non-parallel pairs. [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

  23. The workflow of the mining process • Sample anchor texts: “english version” [“in english”, ……] • Sample document external features: “file-ch.html” vs. “file-en.html” “…/chinese/…/file.html” vs. “…/english/…file.html” • Sample document internal features: Character set, HTML structure [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

  24. An alignment example [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

  25. Part of the lexicons • t: ture • f: false Other techniques and tools used: • Encoding scheme transformation (for Chinese) • Sentence level segmentation • Chinese word segmentation • English expression extraction • SILC: language and encoding identification system [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

  26. Results • 14820 pairs of texts (lexicon) • C-E has a precision of 77% • E-C has a precision of 81.5% • CLIR results • Test corpus: TREC5 and TREC6 Chinese track [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

  27. Does CLIR work? • Best systems at TREC-6 (1997): • English-French: 49% of highest French monolingual • English-German: 64% of highest German monolingual • Best systems at CLEF (2002): • English-French: 83% of highest French monolingual • English-German: 86% of highest German monolingual • Best systems at CLEF (2006): • English-French: 93.82% of best French monolingual • English-Portuguese: 90.91% of best Portuguese monolingual [2]Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 [16] Giorgio M. Di Nunzio, CLEF 2006: Ad Hoc Track Overview. 2006

  28. Future tasks • Extend study scope: • Web pages, medical literature, USENET newsgroup articles, records of legislative and legal proceedings… • Lower cost, improve efficiency • Pay more attention on indexing-time optimizations to improve query-time efficiency • Consider user’s perspective • Improve the utility of ranked lists • Define suitable criteria for the construction of a valid multilingual Web corpus • Get resources for resource-poor languages [11] D.W. Oard, When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. SIGIR 2002 CLIR [12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR

  29. CLIR Communities • TREC Cross Language Track currently focuses on the Arabic language, • Cross-Language Evaluation Forum (CLEF) – a spinoff from TREC - covering many European languages, • NTCIR Asian Language Evaluation (covering Chinese, Japanese and Korean). [12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR

  30. CLEF In CLEF 2006, eight tracks were offered to evaluate the performance of systems: • multilingual document retrieval on news collections (Ad-hoc) • cross-language structured scientific data (Domain-specific) • interactive cross-language retrieval • multiple language question answering • cross-language retrieval on image collections • cross-language speech retrieval • multilingual web retrieval • cross-language geographic retrieval. [13] Carol Peters, Cross-Language Evaluation Forum - CLEF 2006. D-Lib Magazine October 2006

  31. References [1] Wikipedia, http://en.wikipedia.org/wiki/Cross-language_information_retrieval [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 [3] Internet World Stats, http://www.internetworldstats.com/stats7.htm [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996 [5] Ari Pirkola, et al. Dictionary_Based Cross-Language Information Retrieval_ Problems, Methods, and Research Findings. Information Retrieval, Vol. 4. 2001 [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006 [7] Metamodel.com. What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model? http://www.metamodel.com/article.php?story=20030115211223271. 2004 [8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001 [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007 [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000 [11] D.W. Oard, When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. SIGIR 2002 CLIR [12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR [13] Carol Peters, Cross-Language Evaluation Forum - CLEF 2006. D-Lib Magazine October 2006 [14] Boxes and Arrows, http://www.boxesandarrows.com/view/what_is_a_controlled_vocabulary [15] David Hull, Terminology translation for unaligned comparable corpora using category based translation probabilities. United States Patent 6885985. Filing date: Dec 18, 2000. Issue date: Apr 26, 2005 [16] Giorgio M. Di Nunzio, CLEF 2006: Ad Hoc Track Overview. 2006

  32. Thank you!

More Related