A merging strategy proposal: The 2-step retrieval status value method

A merging strategy proposal:The 2-step retrieval status value method Fernando Martínez-Santiago · L. Alfonso Ure ña-López · Maite Martín-Valdivia Department of Computer Science, University of Jaén, Jaén, Spain Inf Retrieval (2006) 9: 71–93

Merging problem query Language 1 Language 2 Language 3 Result lists from per language d11 d21 d31 d12 d22 d32 d13 d23 d33 ….. ….. ….. Merge to a single result list d31 d32 Merge strategy d21 d11 d12 d23 d13 ….

Traditional solution • Round-Robin • Language1 list d11 d12 d13… • Language2 list d21 d22 d23… • Language3 list d31 d32 d33… • Marge  d11 d21 d31 d12 d22 d32 … • Raw-scoring • Normalized scoring • 1) • 2)

Traditional solution • Logistic regression (Calv´e and Savoy (2000), Savoy (2003a)) • LVQ neural networks (Mart´ın et al. 2003)

2-step retrieval status value method • Step 1: • translating and searching the query on each monolinqual collection,produces two results: • a concept T’ consist of each term together with its corresponding translation • Mutilinqual collection D’,as result of the union of the 1000 retrieved documents for each language.

2-step retrieval status value method • Step 2: • re-indexing the D’ ,but considering solely the T’ vocabulary. • given a concept , its document frequency is the result of grouping together the document frequencies of the terms which makes up the concept

2-step retrieval status value method • For Example: • Spanish word  casa translate to English word is  house ,home Given a document , term frequency will be calculate as usual , document frequency will be the sum of the document frequency of “casa”, “house” ,“home”

Mixed 2-step RSV • Not aligned words • Raw mixed 2-step RSV method • for a given τi j , term j into the monolingual collection i , the document frequency value will be: • As 2-step method ,if τi j is aligned. • the initial weight in the first step of the method, if the translation of τi j into the other languages is unknown. • RSVi = α · RSVialign + (1 − α) ·RSVinonalign • α= 0.75

Mixed 2-step RSV • Normalized mixed 2-step RSV method • α= 0.75

Mixed 2-step RSV • Learning–based algorithm • Logistic regression • α, β1, β2 and β3 must be estimated by using iteratively re-weighted least squares method • LVQ Neural network (Mart´ın et al. 2003)

Use machine translation to align word • Pen =“Pesticides in baby food” • Unigrams Pen = {Pesticides, baby, food} • Bigrams Pen = {Pesticides baby, baby food} • the translated expression is: • EXPen={Pesticides in baby food}{Pesticides,baby, food}{Pesticides baby,baby food } • Then we have, • Psp = {Pesticidas alimento niños} • Unigrams Psp = {Pesticidas, bebé, alimento} (Unigrams Psp is the translation of Unigrams Pen ) • Bigrams Psp = {Pesticidas bebés, alimento niños} (Bigrams Psp is the translation of Bigrams Pen )

Use machine translation to align word • For each wordisp ∈ Unigrams Psp do • (a) if wordisp ∈ Psp, then remove wordisp from Psp, and add (wordisp , wordien ) to the set of aligned words ALIGNED • Thus, we obtain: • Psp = {ni˜nos} • ALIGNED = {(pesticidas,pesticides),(alimento,food)}

Use machine translation to align word • For each bigram bigramspi∈ BigramsPsp • (a) if (wordsp1, worden1 ) ∈ ALIGNED (wordsp1 is aligned with worden1 ) and wordsp2 ∈ Psp then remove wordsp 2 from Psp and add (wordsp 2, worden2 ) to ALIGNED set. • (b) if (wordsp1, worden2 ) ∈ ALIGNED and wordsp2 ∈ Psp then remove wordsp2 from Psp and add (wordsp2, worden1 ) to ALIGNED set.

Use machine translation to align word • (c) if (wordsp2, worden1 ∈ ALIGNED and wordsp1 ∈ Psp then remove wordsp1 from Psp andadd (wordsp1, worden2 ) to ALIGNED set. • (d) if (wordsp2, worden2 ∈ ALIGNED and wordsp1 ∈ Psp, then remove wordsp1 from Psp and add (wordsp1, worden1 ) to ALIGNED set. • Psp = ∅ • ALIGNED = {(pesticidas,pesticides),(alimento,food) (ni˜nos,baby)

Method conclusion • Fully aligned word • 2-step method • Partial aligned word • Raw-mixed 2-step RSV • Normalized mixed 2-step RSV • Logistic regression mixed 2-step RSV • Neural network mixed 2-step RSV • Algorithm to align phrase and translations

Experiment • Document • CLEF 2003 have two task CLEF 2003-8 and CLEF 2003-4 . CLEF 2003-4 is limited to four language(English , France , German and Spanish ) • Query (Title + Description )

Experiment • they are indexed with the Zprise IR system, using the OKAPI probabilistic model (fixed at b = 0.75 and k1 = 1.2) • Translation strategies • Machine Readable Dictionary (Babylon) • to pick the first translation available (under the heading “Babylon 1”) or the first two terms (indicated under the label “Babylon 2”) • Machine Translation (MT, Babelfish) • Mixed MT and MDR • by taking together Babelfish and Babylon 1 translations.

Experiment1 –multilinqual results with fully aligned queries

Experiment1 – analysis of failures Too many documents from the Spanish collection for this query

Experiment1 – analysis of failures

Experiment2 –multilinqual results with partially aligned queries • Based on MDR translation approach

Experiment2 –multilinqual results with partially aligned queries • Based on MT translation approach • with the CLEF 2001–2002 test collection and CLEF2001+CLEF2002+CLEF2003 query set (160 queries, five languages, EN, SP, DE, FR, IT)

Conclusion • Future effort • Dealing with translation probabilities. • Testing the method with other translation strategies such as the Multilingual Similarity Thesaurus. • n-grams indexing. • continue studying strategies in order to deal with aligned and non-aligned term queries: the integration of both sorts of terms by means of bayesian networks

A merging strategy proposal: The 2-step retrieval status value method