1 / 22

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval. Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi. Goal. Outline. What is Multilingual Information Retrieval (MLIR). Basic Approaches to MLIR. Resource Requirements for MLIR.

yasuo
Download Presentation

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi

  2. Goal Outline • What is Multilingual Information Retrieval (MLIR) • Basic Approaches to MLIR • Resource Requirements for MLIR • Xerox Experimental Approach • Experimental Results • Detailed Query Analysis • Sample Query Profile • Conclusions and Future Extensions

  3. IS Goal • To explore the most important factors in making MLIR effective ISNOT • To build a fully-functional MLIR ( too much time and resources needed )

  4. IR in any language otherthan English 5 Definitions for MLIR IR on a parallel document collection or on a multilingual document collection where the search space is restricted to the query language IR on a monolingual document collection which can be queried in multiple languages IR on a multilingual document collection, where queries can retrieve documents in multiple languages IR on multilingual documents, i.e. more than one language can be present in the individual documents

  5. IR systems rank documents according to statistical similarity measures based on the cooccurrence of terms in queries and documents Basic Approaches to MLIR • Mechanism for query or document translation • Query translation is easier but doesn’t provide much context • Document translation could be betterbut is costing (time, storage resources) • Techniquesfor the problem of interlingual term correspondence • Text Translation • Term Vector Translation • Latent Semantic Coindexing

  6. High-end approach to MLIR (NLP and text generation techniques) Text Translation • Direct Mapping of query from the source language into one or more target languages by using an MT system • Direct Resolution of ambiguity by using structural information from the source language text • PRO • Extensive body of research on MT • Commercial products available • CONS • Low performance of current MT systems [Radwan, 1994]

  7. Direct Mapping of each word in the query written in the source language into all of its possible definitions in the target languages Term Vector Translation • Uses transfer dictionaries or parallel aligned corpus for the direct mapping • Vector Space Models can be used as retrieval strategies • Issues related with term weighting strategies • Should each term be weighted according to the number of translations? • Should more common translations be weighted proportionally higher? • What resources do we use to obtain this information?

  8. Indirect Derivation of query translation by using a training corpus Latent Semantic Coindexing • Create a reduced-dimension Semantic Space in which related terms are near each other • Uses Singular Value Decomposition of parallel document collection to obtain term vector representation • Term vector representaion are comparable across all the languages of the collection (documents are represented as language-independent numerical vectors) • Query can retrieve a relevant document even if they have no words in common

  9. Standard Vector Model LSI vs Standard Vector Model • Treat words as if they are independent • Represent documents as linear combinations of orthogonal terms • LSI • Term-term inter-relationships are automatically modeled and used to improve retrieval by numerically analysing existing texts (no need for external dictionaries, thesauri or knowledge bases) • Represents terms as continuous values on each of the k orthogonal indexing dimensions

  10. Support for character set of each language is needed Resource Requirements • Facilities for automatic language recognition • Morphological Analyzer (PoS recognition, stemming algorithms, inflectional analyzers) • Ex: German word Weingärtnergenossenschaften is analyzed as the feminine plural noun Wein#Gärtner# Genosse(n)#schajt • Crucial to find term entries in bilingual dictionaries • Resources for query translation • Machine Translation System • Transfer Dictionaries • Parallel texts and/or monolingual domain-specific corpora

  11. MT System Resources for Query Translation • For direct query translation • Transfer dictionaries (Bilingual Thesauri) • For direct term vector translation • Extracted from bilingual general dictionaries which include lots of “noise” vocabulary • Parallel Texts • To extract relationships between terms for term vector translation or to get indirect query translation (ex. SLI) • Domain-specific monolingual corpora • Source of terminology to be used when parallel texts are not available

  12. Transfer Dictionaries vs Parallel Texts • Transfer Dictionaries • Conversion from bilingual dictionaries is a non-trivial effort • Translation probabilities are not available • Most technical terminology is missing • Provide broad but shallow coverage of the language • Parallel Corpora • Needed in large quantity to train statistical models of great sophistication • Generate term translation vectors with probabilities [Brown, 1993] • Provide narrow but deep coverage (probabilities are domain specific)

  13. Evaluation in Multilingual IR Xerox Experimental Approach 1 • Uses query with known relevance judgement • Start with queries, documents, and relevance judgments in a single language • Translates the queries into another language by human translators • Translated queries are retranslated by the MLIR system • Results are compared to the original queries to get a good sense of the relative performance of the MLIR system

  14. Experimental Setting Xerox Experimental Approach 2 • Translated French queries and English documents • TIPSTER text collection and queries 51-100 from TREC experiments [Harman, 1995] • Term vector translation model • Bilingual Transfer Dictionary to generate the model • Short version of queries (average lenght of 7 words) • Conversion of an online bilingual French => English dictionary to a WORD-BASED transfer dictionary suitable for text retrieval

  15. Xerox Experimental Approach 3 • MLIR Process Query is morphologically analyzed and each term is replaced by its inflectional root Each root is looked up in the bilingual transfer dictionary and builds a translated query by taking the concatenation of all term translations The translated query is sent to a traditional monolingual IR system • Notes • Specialized term weighting and resolving ambiguity in translation are ignored • Vector Space Model is used to measure similarity between query and each document

  16. Comparing the original English queries to three retranslation generated by different versions of the transfer dictionary Experimental Results • Three tranfer dictionary versions: automatic word-based, manual word-based and manual multi-word transfer dictionary • Average precision at 5,10,15 and 20 documents retrieved for the original English queries and the translation given by the different TD

  17. Comparison of the performance of the translated (Tr) and original (Orig) English queries. Values given are the number of queries in each category Detailed Query Analysis 1 • Improvement in performance as more manual effort is applied to the dictionary construction process • Some queries which perform much better in their translated versions

  18. Detailed Failure Analysis Detailed Query Analysis 2 • Carried out on the worse 17 queries when using word-based dictionary • 9 queries lost information as a result of the failure to translate multi-word expressions correctly, 8 had problems due to ambiguity in translation (i.e. extraneous definitions added to query), and 4 suffered from a loss in retranslation (meaning decays with repeated translations) • Recognizing and translating multi-word expressions is crucial to success in MLIR (in contrast to monolingual IR) • Individual components of phrases often have very diferent meanings in translation, so the entire sense of the phrase is often lost

  19. English: original intent or interpretation of amendments to the U.S. Constitution • French: l’intention premkre ou une interpretation d’un amendment de la constitution des USA • Term vector retranslation • intention - intention benefit • premier - first initial bottom early front top leading basic primary original • interpretation - interpretation • amendment - amendment enrichment enriching agent • constitution - formation settlement constitution • USA - USA Sample Query Profile 1

  20. Sample Query Profile 2 • The decay in performance of query 76 from the original English (orig Eng) to the translated English (traus Eng) due to translation ambiguity (TA) and loss in retranslation (LR)

  21. Conclusions • Two primary sources of error in the current MLIR system Future Extensions • missing translations of multi-word expressions and unresolved ambiguity in word-based translation • Additional loss in retranslation errors due to the experimented design which cannot be avoided (i.e. the ambiguity introduced by the human translator) • Improving automatically generated transfer dictionaries • Extracting MWE (gathering terminology lists from various specialized domains, performing terminology extraction from corpora • Resolving ambiguity (using target language texts, term weighting strategies, user interactive tools) • Using models other than the vector space model (i.e. weighted boolean model)

  22. THANK YOU!

More Related