1 / 66

跨語言資訊檢索導論

跨語言資訊檢索導論. Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University. Outline. Multilingual Environments What is Cross-Language Information Retrieval? Major Problems in CLIR Major Approaches in CLIR Case Study: CLIR in NPDM Summary.

theodora
Download Presentation

跨語言資訊檢索導論

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 跨語言資訊檢索導論 Hsin-Hsi Chen (陳信希) Department of Computer Science and Information Engineering National Taiwan University

  2. Outline • Multilingual Environments • What is Cross-Language Information Retrieval? • Major Problems in CLIR • Major Approaches in CLIR • Case Study: CLIR in NPDM • Summary

  3. Multilingual Collections • There are 6,703 languages listed in the Ethnologue • Digital libraries • OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languages • World Wide Web • Around 40% of Internet users do not speak English, however, 80% of Web sites are still in English

  4. 真實世界語言使用人口 ( http://www.g11n.com/faq.htm) 西班牙語 孟加拉語 阿拉伯語 中文 英語 日語 葡萄牙語 印度語 俄語

  5. 荷蘭語 葡萄牙語 義大利語 西班牙語 韓文 瑞典語 中文 法語 德語 日語 (Statistics from Euro-Marketing Associates, 1998)

  6. 中文人口 比例(6.1%) < 法文人口 比例(8.8%) (1998年) (Statistics from Euro-Marketing Associates, 1999) http://www.glreach.com/globstats/

  7. 網路世界語言使用人口

  8. 網際網路內容 (Network Wizards Jan 99 Internet Domain Survey) 33,878 1,687 1,684 654 546 546 473 458 432 英語 40%的Internet使用者 不懂英文,但是80% 的Internet內容是英文 西班牙語 瑞典語 日語 法語 中文 芬蘭語 荷蘭語 德語

  9. (Source: http://www.emarketer.com)

  10. What is Cross-Language Information Retrieval? • Definition: Select information in one language based on queries in another. • Terminologies • Cross-Language Information Retrieval(ACM SIGIR 96 Workshop on Cross-Linguistic Information Retrieval) • Translingual Information Retrieval(Defense Advanced Research Project Agency - DARPA)

  11. Generalization: Multi- & Cross- Lingual Information Access

  12. MLIR Applications • Multilingual information access in multilingual country, organization, enterprise, etc. • Cross- language information retrieval for users who read a second language (large passive vocabulary) but are not able to formulate good queries (small active vocabulary). • Monolingual users may retrieve images by taking advantage of multilingual captions. • Monolingual users may retrieve documents and have them translated (automatically or manually) in their language.

  13. Why is Cross- Language Information Retrieval Important? • More information workers with less time require fast access to global resources • global B2B interactions (virtual enterprises) • global B2C interactions (online trading, travelling) • time critical information (translation comes too late)

  14. History • 1970 Salton runs retrieval experiments with a small English/ German dictionary • 1972 Pevzner shows for English and Russian that a controlled thesaurus can be used effectively for query term translation • 1978 ISO Standard 5964 for developing multilingual thesauri (revised in 1985) • 1990 Latent Semantic Indexing (LSI) applied to CLIR

  15. History (Continued) • 1994 1st PhD thesis on CLIR by Khaled Radwan • 1996 Similarity thesaurus applied to CLIR (ETH Zurich) • 1996 Dictionary based retrieval applied to CLIR (Umass & XEROX Grenoble) • 1997 Generalized Vector Space Model (GVSM) applied to CLIR (CMU)

  16. History (Continued) • 1997 CLIR (Cross- Language Information Retrieval) track starts within TREC • 1998 NTCIR starts in Japan • 1999 TIDES (Translingual Information Detection, Extraction, and Summarization) starts in U. S. • 2000 CLEF starts in Europe

  17. An Architecture of Multilingual Information Access

  18. Major Problems of CLIR • Queries and documents are in different languages. • translation • Words in a query may be ambiguous. • disambiguation • Queries are usually short. • expansion

  19. Major Problems of CLIR (Continued) • Queries may have to be segmented. • segmentation • A document may be in terms of various languages. • language identification

  20. Enhancing TraditionalInformation Retrieval Systems • Which part(s) should be modified for CLIR? Documents Queries (1) (3) Document Representation Query Representation (2) (4) Comparison

  21. Enhancing Traditional Information Retrieval Systems (Continued) • (1): text translation • (2): vector translation • (3): query translation • (4): term vector translation • (1) and (2), (3) and (4): interlingual form

  22. What are the Problems? • Ambiguous terms (e.g., performance) • Multiword phrases may correspond to single-word phrases (e. g. South Africa => 南非,Südafrika) • Coverage of the vocabulary • There is not a one-to-one mapping between two languages • Translating queries automatically (lack of syntax) • Translating documents automatically (performance, …) • Computing mixed result lists

  23. Cross-Language Information Retrieval

  24. Query Translation Based CLIR Translation Device English Query Chinese Query Monolingual Chinese Retrieval System Retrieved Chinese Documents

  25. Translating the 400 Millionnon-English Pages of the WWW • ... would take 100’000 days (300 years) on one fast PC. Or, 1 month on 3’600 PC’s.

  26. Knowledge-Based • Examples • Subject Thesaurus • Hierarchical and associative relations. • Unique term assigned to each node. • Concept List • Term space partitioned into concept spaces. • Term List • List of cross-language synonyms. • Lexicon • Machine readable syntax and/or semantics.

  27. Ontology-Based Approaches • Exploit complex knowledge representations e.g., EuroWordNet • A Proposal for Conceptual Indexing using EuroWordNet

  28. Dictionary-Based Approaches • Exploit machine-readable dictionaries. • Problems • translation ambiguity + target polysemy • coverage (unknown words, abbreviations, ...)

  29. Dictionary-Based Approaches(Continued) • Issue 1: selection strategy • Select all. • Select N randomly. • Select best N. • Issue 2: which level • word • phrase

  30. Selection Strategy: Select All • Hull and Grefenstette 1996 • Take concatenation of all term translation.E: politically motivated civil disturbancesF: troubles civils a caractere politiquetrouble - turmoil, discord, trouble, unrest, disturbance, disordercivil - civil, civilian, courteouscaractere - character, naturepolitique - political, diplomatic, politician, policy • Original English (0.393) vs. Automatic word-based transfer dictionary (0.235): 59.8%. • errors: multi-word expressions and ambiguity

  31. Selection Strategy: Select All(Continued) • Davis 1997 (TREC5) • Replace each English query term with all of its Spanish equivalent terms from the Collins bilingual dictionary. • Monolingual (0.2895) vs. All-equivalent substitution (0.1422): 49.12%

  32. Evaluation Method • Average Precision (5-, 9-, 11-points) • Model TREC Spanish Corpus Mono IR Engine Spanish Query TREC Spanish Corpus Bilingual Dictionary Spanish Equivalents Mono IR Engine English Query TREC Spanish Corpus POS Bilingual Dictionary Spanish Equivalents by POS Mono IR Engine English Query

  33. Selection Strategy: Select N • Simple word-by-word translation • Each query term is replaced by the word or group of words given for the first sense of the term’s definition. • 50-60% drop in performance (average precision)

  34. Selection Strategy: Select N(Continued) • word/phrase translation • Take at most three translations of each word, one from each of the first three senses. Take phrase translation if appearing in dictionary. • 30-50% worse than good translation • Well-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements. • WBW (0.0244), phrasal (0.0148), good phrasal (0.0610) -39.3% +150.3%

  35. Selection Strategy: Select Best N • Hayashi, Kikui and Susaki 1997 • search for a dictionary entry corresponding to the longest sequence of words from left to right • choose the most frequently used word (or phrases) in a text corpus collected from WWW • no report for this query translation approach • Davis 1997 (TREC5) • POS disambiguation • Monolingual (0.2895) vs. All-equivalent substitution (0.1422) vs. POS disambiguation (0.1949): near 67.3%

  36. Corpus-Based Approaches • Categorization • Term-Aligned • Sentence-Aligned • Document-Aligned (Parallel, Comparable) • Unaligned • Usage • Setup Thesaurus • Vector Mapping

  37. Term-Aligned Corpora • Fine-grained alignment in parallel corpora • Oard 1996 • Term alignment is a challenging problem. English Query Parallel Binlingual Corpus Machine Translation System Spanish Query Translation Tables Cooccurrance Statistics

  38. Sentence-Aligned Corpora • Davis & Dunning 1996 (TREC4) • High-frequency Terms

  39. Brief Summary • dictionary-based methods • Specialized vocabulary not in the dictionaries will not be translated. • Ambiguities will add extraneous terms to the query. • parallel/comparable corpora-based methods • Parallel corpora are not always available. • Available corpora tend to be relative small or to cover only a small number of subjects. • Performance is dependent on how well the corpora are aligned.

  40. Brief Summary (Continued) • Dictionaries are very useful. • Achieve 50% on their own • Parallel corpora have limitations. • Domain shifts • Term alignment accuracy • Dictionaries and corpora are complementary. • Dictionaries provide broad and shallow coverage. • Corpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.

  41. Hybrid Methods • What knowledge can be employed? • lexical knowledge • corpus knowledge • ...

  42. Hybrid Methods (Continued) • Query Expansion • Issue 1: context • pseudo relevance feedback (local feedback)::A query is modified by the addition of terms found in the top retrieved documents. • local context analysis::Queries are expanded by the addition of the top ranked concepts from the top passages.

  43. Hybrid Methods (Continued) • Issue 2: when • before query translation • after query translation

  44. Hybrid Methods (Continued) Original Spanish TREC Queries human translation English (BASE) Queries • Ballesteros & Croft 1997 query expansion automatic dictionary translation Spanish Queries English Queries automatic dictionary translation query expansion Spanish Queries Spanish Queries INQUERY

  45. Hybrid Methods (Continued) • Performance Evaluation • pre-translationMRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1139) +33.5% +38.5% • post-translationMRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1% • combined pre- and post-translationMRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.1358) +51.0% +65.0% • 32% below a monolingual baseline

  46. Cross-Language Evaluation Forum • A collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST) • Extension of CLIR track at TREC (1997-1999)

  47. Main Goals • Promote research in cross-language system development for European languages by providing an appropriate infrastructure for: • CLIR system evaluation, testing and tuning • Comparison and discussion of results

  48. CLEF 2000 Task Description • Four evaluation tracks in CLEF 2000 • multilingual information retrieval • bilingual information retrieval • monolingual (non-English) information retrieval • domain-specific IR

  49. Case Study: CLIR for NPDM

  50. 3M in Digital Libraries/Museums • Multi-media • Selecting suitable media to represent contents • Multi-linguality • Decreasing the language barriers • Multi-culture • Integrating multiple cultures

More Related