iterative translation disambiguation for cross language information retrieval n.
Skip this Video
Loading SlideShow in 5 Seconds..
Iterative Translation Disambiguation for Cross Language Information Retrieval PowerPoint Presentation
Download Presentation
Iterative Translation Disambiguation for Cross Language Information Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 24

Iterative Translation Disambiguation for Cross Language Information Retrieval - PowerPoint PPT Presentation

  • Uploaded on

Iterative Translation Disambiguation for Cross Language Information Retrieval. Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies University of Maryland SIGIR 2005. INTRODUCTION. Query translation requires access to some form of translation dictionary

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Iterative Translation Disambiguation for Cross Language Information Retrieval' - wren

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
iterative translation disambiguation for cross language information retrieval

Iterative Translation Disambiguation forCross Language Information Retrieval

Christof Monz and Bonnie J. Dorr

Institute for Advanced Computer Studies

University of Maryland

SIGIR 2005

  • Query translation requires access to some form of translation dictionary
    • Use machine translation system to translate the entire query into the target language
    • Use of a dictionary to produce a number of target-language translations for words or phrases in the source language
    • Use of a parallel corpus to estimate the probabilities that w in the source language translates into w’ in target language
  • An approach that does not require a parallel corpus to induce translation probabilities
    • a machine-readable dictionary (without any rankings or frequency statistics)
    • a monolingual corpus in the target language
translation selection
  • Translation ambiguity is very common
  • Apply word sense disambiguation
    • For most languages the appropriate resources do not exist
    • Word-sense disambiguation is a non-trivial enterprise
translation selection1
  • Our approach uses co-occurrences between terms to modeling context for the problem of word selection
  • Ex. S1=>t11,t21,t31



translation selection2
  • Computing co-occurrence statistics for a larger number of terms induces a data-sparseness issue
    • Use very large corpora (Web)
    • Apply smoothing techniques
iterative disambiguation
  • Only examine pairs of terms in order to gather partial evidence for the likelihood of a translation in a given context
iterative disambiguation1
  • Assume that ti1 occurs more frequently with tj1 than any other pair of candidates between a translation for si and sj
  • On the other hand, assume that ti1 and tj1 do not co-occur with tk1 at all, but ti2 and tj2 do
  • Which should be preferred:
    • ti1 and tj1 or ti2 and tj2
iterative disambiguation2
  • Associate with each translation candidate a weight (t is a translation candidate for si)
  • Each term weight is recomputed based on two different inputs the weights of the terms that link to the term (WL(t,t’)=link weight between t and t’ )
iterative disambiguation3
  • Normalize term weights
  • The iteration stops if the changes in term weights become smaller than some threshold (WT=the vector of all term weights Vk=kth element in the vector)
iterative disambiguation4
  • There are a number of ways to compute the association strength between two terms
    • MI
    • Dice coefficient
    • log likelihood
experiment set up
  • Test Data
    • CLEF 2003 English to German bilingual data
    • Contains 60 topics, four of which were removed by the CLEF organizers, as no relevant documents
    • Each topic has a title, a description, and a narrative field, for our experiments, we used only the title field to formulate the queries
experiment set up1
  • Morphological normalization
    • Since the dictionary only contains base forms, the words in the topics must be mapped to their respective base forms as well
    • Compounds are very frequent in German
    • Instead of de-compounding, we use character 5-grams, an approach that yields almost the same retrieval performance as decompounding
experiment set up2
  • Ex. Topics
  • Intermediate results of the query formulation process
experiment set up3
  • Retrieval Model - Lnu.ltc weighting scheme
  • we used sl=0.1 ,pv=the average number of unique words per document, uwd= refers to the number of unique words in document d, w(i) = weight of term i
experimental results1
Experimental Results
  • Individual average precision decreases for a number of queries
    • 6% of all English query terms were not in the dictionary
    • Unknown words are treated as proper names, and the original word from the source language is included in the target language query
    • Ex. the word Women is falsely considered a proper noun, although faulty translations of this type affect both the baseline system and the run using term weights, the latter is affected more severely
  • Pirkola’s approach does not consider disambiguation at all
  • Jang’s approach use MI to re-compute translation probabilities for cross-language retrieval
    • Only considers mutual information between consecutive terms in the query
    • they do not compute the translation probabilities in an iterative fashion
  • Adriani’s approach is similar to the approach by Jang
    • does not benefit from using multiple iterations.
  • Gao use a decaying mutual-information score in combination with syntactic dependency relations
    • We did not consider distances between words
  • Maeda compare a number of co-occurrence statistics with respect to their usefulness for improving retrieval effectiveness
    • They consider all pairs of possible translations of words in the query
    • use co-occurrence information to select translations of words from the topic for query formulation, instead of re-weighting them
  • Kikui
    • Only need a dictionary and monolingual resources in the target language
    • Computes the coherence between all possible combinations of translation candidates of the source terms
  • introduced a new algorithm for computing topic dependent translation probabilities for cross-language information retrieval
  • We experimented with different term association measures, experimental results show Log Likelihood Ratio has the strongest positive impact on retrieval effectiveness
  • An important advantage of our approach is that it only requires a bilingual dictionary and a monolingual corpus
  • An issue that remains open at this point is the computation of query terms that are not covered by the bilingual dictionary