querying across languages a dictionary based approach to multilingual information retrieval n.
Skip this Video
Loading SlideShow in 5 Seconds..
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval PowerPoint Presentation
Download Presentation
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 22

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval - PowerPoint PPT Presentation

  • Uploaded on

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval. Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi. Goal. Outline. What is Multilingual Information Retrieval (MLIR). Basic Approaches to MLIR. Resource Requirements for MLIR.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
querying across languages a dictionary based approach to multilingual information retrieval

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval

Doctorate Course

Web Information Retrieval


Gaia Trecarichi




  • What is Multilingual Information Retrieval (MLIR)
  • Basic Approaches to MLIR
  • Resource Requirements for MLIR
  • Xerox Experimental Approach
  • Experimental Results
  • Detailed Query Analysis
  • Sample Query Profile
  • Conclusions and Future Extensions



  • To explore the most important factors in making MLIR effective


  • To build a fully-functional MLIR ( too much time and resources needed )
5 definitions for mlir

IR in any language otherthan English

5 Definitions for MLIR

IR on a parallel document collection or on a multilingual document collection where the search space is restricted to the query language

IR on a monolingual document collection which can be queried in multiple languages

IR on a multilingual document collection, where queries can retrieve documents in multiple languages

IR on multilingual documents, i.e. more than one language can be present in the individual documents

basic approaches to mlir

IR systems rank documents according to statistical similarity measures based on the cooccurrence of terms in queries and documents

Basic Approaches to MLIR

  • Mechanism for query or document translation
  • Query translation is easier but doesn’t provide much context
  • Document translation could be betterbut is costing (time, storage resources)
  • Techniquesfor the problem of interlingual term correspondence
  • Text Translation
  • Term Vector Translation
  • Latent Semantic Coindexing
text translation

High-end approach to MLIR (NLP and text generation techniques)

Text Translation

  • Direct Mapping of query from the source language into one or more target languages by using an MT system
  • Direct Resolution of ambiguity by using structural information from the source language text
  • PRO
  • Extensive body of research on MT
  • Commercial products available
  • CONS
  • Low performance of current MT systems [Radwan, 1994]
term vector translation

Direct Mapping of each word in the query written in the source language into all of its possible definitions in the target languages

Term Vector Translation

  • Uses transfer dictionaries or parallel aligned corpus for the direct mapping
  • Vector Space Models can be used as retrieval strategies
  • Issues related with term weighting strategies
  • Should each term be weighted according to the number of translations?
  • Should more common translations be weighted proportionally higher?
  • What resources do we use to obtain this information?
latent semantic coindexing

Indirect Derivation of query translation by using a training corpus

Latent Semantic Coindexing

  • Create a reduced-dimension Semantic Space in which related terms are near each other
  • Uses Singular Value Decomposition of parallel document collection to obtain term vector representation
  • Term vector representaion are comparable across all the languages of the collection (documents are represented as language-independent numerical vectors)
  • Query can retrieve a relevant document even if they have no words in common
lsi vs standard vector model

Standard Vector Model

LSI vs Standard Vector Model

  • Treat words as if they are independent
  • Represent documents as linear combinations of orthogonal terms
  • LSI
  • Term-term inter-relationships are automatically modeled and used to improve retrieval by numerically analysing existing texts (no need for external dictionaries, thesauri or knowledge bases)
  • Represents terms as continuous values on each of the k orthogonal indexing dimensions
resource requirements

Support for character set of each language is needed

Resource Requirements

  • Facilities for automatic language recognition
  • Morphological Analyzer (PoS recognition, stemming algorithms, inflectional analyzers)
  • Ex: German word Weingärtnergenossenschaften is analyzed as the feminine plural noun Wein#Gärtner# Genosse(n)#schajt
  • Crucial to find term entries in bilingual dictionaries
  • Resources for query translation
  • Machine Translation System
  • Transfer Dictionaries
  • Parallel texts and/or monolingual domain-specific corpora
resources for query translation

MT System

Resources for Query Translation

  • For direct query translation
  • Transfer dictionaries (Bilingual Thesauri)
  • For direct term vector translation
  • Extracted from bilingual general dictionaries which include lots of “noise” vocabulary
  • Parallel Texts
  • To extract relationships between terms for term vector translation or to get indirect query translation (ex. SLI)
  • Domain-specific monolingual corpora
  • Source of terminology to be used when parallel texts are not available
transfer dictionaries vs parallel texts

Transfer Dictionaries vs Parallel Texts

  • Transfer Dictionaries
  • Conversion from bilingual dictionaries is a non-trivial effort
  • Translation probabilities are not available
  • Most technical terminology is missing
  • Provide broad but shallow coverage of the language
  • Parallel Corpora
  • Needed in large quantity to train statistical models of great sophistication
  • Generate term translation vectors with probabilities [Brown, 1993]
  • Provide narrow but deep coverage (probabilities are domain specific)
xerox experimental approach 1

Evaluation in Multilingual IR

Xerox Experimental Approach 1

  • Uses query with known relevance judgement
  • Start with queries, documents, and relevance judgments in a single language
  • Translates the queries into another language by human translators
  • Translated queries are retranslated by the MLIR system
  • Results are compared to the original queries to get a good sense of the relative performance of the MLIR system
xerox experimental approach 2

Experimental Setting

Xerox Experimental Approach 2

  • Translated French queries and English documents
  • TIPSTER text collection and queries 51-100 from TREC experiments [Harman, 1995]
  • Term vector translation model
  • Bilingual Transfer Dictionary to generate the model
  • Short version of queries (average lenght of 7 words)
  • Conversion of an online bilingual French => English dictionary to a WORD-BASED transfer dictionary suitable for text retrieval
xerox experimental approach 3

Xerox Experimental Approach 3

  • MLIR Process

Query is morphologically analyzed and each term is replaced by its inflectional root

Each root is looked up in the bilingual transfer dictionary and builds a translated query by taking the concatenation of all term translations

The translated query is sent to a traditional monolingual IR system

  • Notes
  • Specialized term weighting and resolving ambiguity in translation are ignored
  • Vector Space Model is used to measure similarity between query and each document
experimental results

Comparing the original English queries to three retranslation generated by different versions of the transfer dictionary

Experimental Results

  • Three tranfer dictionary versions: automatic word-based, manual word-based and manual multi-word transfer dictionary
  • Average precision at 5,10,15 and 20 documents retrieved for the original English queries and the translation given by the different TD
detailed query analysis 1

Comparison of the performance of the translated (Tr) and original (Orig) English queries. Values given are the number of queries in each category

Detailed Query Analysis 1

  • Improvement in performance as more manual effort is applied to the dictionary construction process
  • Some queries which perform much better in their translated versions
detailed query analysis 2

Detailed Failure Analysis

Detailed Query Analysis 2

  • Carried out on the worse 17 queries when using word-based dictionary
  • 9 queries lost information as a result of the failure to translate multi-word expressions correctly, 8 had problems due to ambiguity in translation (i.e. extraneous definitions added to query), and 4 suffered from a loss in retranslation (meaning decays with repeated translations)
  • Recognizing and translating multi-word expressions is crucial to success in MLIR (in contrast to monolingual IR)
  • Individual components of phrases often have very diferent meanings in translation, so the entire sense of the phrase is often lost
sample query profile 1

English: original intent or interpretation of amendments to the U.S. Constitution

  • French: l’intention premkre ou une interpretation d’un amendment de la constitution des USA
  • Term vector retranslation
    • intention - intention benefit
    • premier - first initial bottom early front top leading basic primary original
    • interpretation - interpretation
    • amendment - amendment enrichment enriching agent
    • constitution - formation settlement constitution
    • USA - USA

Sample Query Profile 1

sample query profile 2

Sample Query Profile 2

  • The decay in performance of query 76 from the original English (orig Eng) to the translated English (traus Eng) due to translation ambiguity (TA) and loss in retranslation (LR)
future extensions


  • Two primary sources of error in the current MLIR system

Future Extensions

  • missing translations of multi-word expressions and unresolved ambiguity in word-based translation
  • Additional loss in retranslation errors due to the experimented design which cannot be avoided (i.e. the ambiguity introduced by the human translator)
  • Improving automatically generated transfer dictionaries
  • Extracting MWE (gathering terminology lists from various specialized domains, performing terminology extraction from corpora
  • Resolving ambiguity (using target language texts, term weighting strategies, user interactive tools)
  • Using models other than the vector space model (i.e. weighted boolean model)