1 / 25

A merging strategy proposal: The 2-step retrieval status value method

A merging strategy proposal: The 2-step retrieval status value method. Fernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia Department of Computer Science, University of Ja´en, Ja´en, Spain Inf Retrieval (2006) 9: 71–93. Merging problem. query. Language 1.

yoshe
Download Presentation

A merging strategy proposal: The 2-step retrieval status value method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A merging strategy proposal:The 2-step retrieval status value method Fernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia Department of Computer Science, University of Ja´en, Ja´en, Spain Inf Retrieval (2006) 9: 71–93

  2. Merging problem query Language 1 Language 2 Language 3 Result lists from per language d11 d21 d31 d12 d22 d32 d13 d23 d33 ….. ….. ….. Merge to a single result list d31 d32 Merge strategy d21 d11 d12 d23 d13 ….

  3. Traditional solution • Round-Robin • Language1 list d11 d12 d13… • Language2 list d21 d22 d23… • Language3 list d31 d32 d33… • Marge  d11 d21 d31 d12 d22 d32 … • Raw-scoring • Normalized scoring • 1) • 2)

  4. Traditional solution • Logistic regression (Calv´e and Savoy (2000), Savoy (2003a)) • LVQ neural networks (Mart´ın et al. 2003)

  5. 2-step retrieval status value method • Step 1: • translating and searching the query on each monolinqual collection,produces two results: • a concept T’ consist of each term together with its corresponding translation • Mutilinqual collection D’,as result of the union of the 1000 retrieved documents for each language.

  6. 2-step retrieval status value method • Step 2: • re-indexing the D’ ,but considering solely the T’ vocabulary. • given a concept , its document frequency is the result of grouping together the document frequencies of the terms which makes up the concept

  7. 2-step retrieval status value method • For Example: • Spanish word  casa translate to English word is  house ,home Given a document , term frequency will be calculate as usual , document frequency will be the sum of the document frequency of “casa”, “house” ,“home”

  8. Mixed 2-step RSV • Not aligned words • Raw mixed 2-step RSV method • for a given τi j , term j into the monolingual collection i , the document frequency value will be: • As 2-step method ,if τi j is aligned. • the initial weight in the first step of the method, if the translation of τi j into the other languages is unknown. • RSVi = α · RSVialign + (1 − α) ·RSVinonalign • α= 0.75

  9. Mixed 2-step RSV • Normalized mixed 2-step RSV method • α= 0.75

  10. Mixed 2-step RSV • Learning–based algorithm • Logistic regression • α, β1, β2 and β3 must be estimated by using iteratively re-weighted least squares method • LVQ Neural network (Mart´ın et al. 2003)

  11. Use machine translation to align word • Pen =“Pesticides in baby food” • Unigrams Pen = {Pesticides, baby, food} • Bigrams Pen = {Pesticides baby, baby food} • the translated expression is: • EXPen={Pesticides in baby food}{Pesticides,baby, food}{Pesticides baby,baby food } • Then we have, • Psp = {Pesticidas alimento ni˜nos} • Unigrams Psp = {Pesticidas, beb´e, alimento} (Unigrams Psp is the translation of Unigrams Pen ) • Bigrams Psp = {Pesticidas beb´es, alimento ni˜nos} (Bigrams Psp is the translation of Bigrams Pen )

  12. Use machine translation to align word • For each wordisp ∈ Unigrams Psp do • (a) if wordisp ∈ Psp, then remove wordisp from Psp, and add (wordisp , wordien ) to the set of aligned words ALIGNED • Thus, we obtain: • Psp = {ni˜nos} • ALIGNED = {(pesticidas,pesticides),(alimento,food)}

  13. Use machine translation to align word • For each bigram bigramspi∈ BigramsPsp • (a) if (wordsp1, worden1 ) ∈ ALIGNED (wordsp1 is aligned with worden1 ) and wordsp2 ∈ Psp then remove wordsp 2 from Psp and add (wordsp 2, worden2 ) to ALIGNED set. • (b) if (wordsp1, worden2 ) ∈ ALIGNED and wordsp2 ∈ Psp then remove wordsp2 from Psp and add (wordsp2, worden1 ) to ALIGNED set.

  14. Use machine translation to align word • (c) if (wordsp2, worden1 ∈ ALIGNED and wordsp1 ∈ Psp then remove wordsp1 from Psp andadd (wordsp1, worden2 ) to ALIGNED set. • (d) if (wordsp2, worden2 ∈ ALIGNED and wordsp1 ∈ Psp, then remove wordsp1 from Psp and add (wordsp1, worden1 ) to ALIGNED set. • Psp = ∅ • ALIGNED = {(pesticidas,pesticides),(alimento,food) (ni˜nos,baby)

  15. Method conclusion • Fully aligned word • 2-step method • Partial aligned word • Raw-mixed 2-step RSV • Normalized mixed 2-step RSV • Logistic regression mixed 2-step RSV • Neural network mixed 2-step RSV • Algorithm to align phrase and translations

  16. Experiment • Document • CLEF 2003 have two task CLEF 2003-8 and CLEF 2003-4 . CLEF 2003-4 is limited to four language(English , France , German and Spanish ) • Query (Title + Description )

  17. Experiment • they are indexed with the Zprise IR system, using the OKAPI probabilistic model (fixed at b = 0.75 and k1 = 1.2) • Translation strategies • Machine Readable Dictionary (Babylon) • to pick the first translation available (under the heading “Babylon 1”) or the first two terms (indicated under the label “Babylon 2”) • Machine Translation (MT, Babelfish) • Mixed MT and MDR • by taking together Babelfish and Babylon 1 translations.

  18. Experiment1 –multilinqual results with fully aligned queries

  19. Experiment1 –multilinqual results with fully aligned queries

  20. Experiment1 – analysis of failures Too many documents from the Spanish collection for this query

  21. Experiment1 – analysis of failures

  22. Experiment2 –multilinqual results with partially aligned queries • Based on MDR translation approach

  23. Experiment2 –multilinqual results with partially aligned queries • Based on MDR translation approach

  24. Experiment2 –multilinqual results with partially aligned queries • Based on MT translation approach • with the CLEF 2001–2002 test collection and CLEF2001+CLEF2002+CLEF2003 query set (160 queries, five languages, EN, SP, DE, FR, IT)

  25. Conclusion • Future effort • Dealing with translation probabilities. • Testing the method with other translation strategies such as the Multilingual Similarity Thesaurus. • n-grams indexing. • continue studying strategies in order to deal with aligned and non-aligned term queries: the integration of both sorts of terms by means of bayesian networks

More Related