170 likes | 308 Views
Generalising lexical translation strategies for MT using comparable corpora. Bogdan Babych, Serge Sharoff, Anthony Hartley Centre for Translation Studies, University of Leeds Leeds, UK {b.babych,s.sharoff,a.hartley}@leeds.ac.uk. Overview.
E N D
Generalising lexical translation strategies for MT using comparable corpora Bogdan Babych, Serge Sharoff, Anthony Hartley Centre for Translation Studies, University of Leeds Leeds, UK {b.babych,s.sharoff,a.hartley}@leeds.ac.uk
Overview • Indirect translation equivalents in MT: current limitations • Increasing the range of translation equivalents used by MT • Equivalent-oriented vs. strategy-oriented approaches • Methodology for discovering translation strategies using comparable corpora • Applications for terminology research • Conclusions and future work LREC 2008 Generalising Lexical Translation Strategies for MT
Indirect equivalents in MT Data-driven MT (statistical & example-based) • Reusing equivalents learnt from parallel corpora • Problem: Lack of generalisation • Equivalents expressed as word patterns • Do not generalise beyond lemmas • Cannot generate indirect equivalents for ‘unseen’ expressions • Difficult to maintain many specific patterns • Fundamental limits on the range of translation solutions generated by MT LREC 2008 Generalising Lexical Translation Strategies for MT
Indirect equivalents: Change of perspective Problems for MT: non-fluent translations & mistranslations • Ru: Изкризисов такого рода как парламентский можно выходитьза счет демократических методов. • lit.: 'Fromcrises of such type as parliamentary it is possible to go out by means of democratic methods • RBMT: Such as parliamentary it is possible to leave crisesdue to democratic methods. • SMT: This kind of crisis as a parliamentary, can go through democratic methods. • HT: We can escape crises like these through democratic means LREC 2008 Generalising Lexical Translation Strategies for MT
From equivalents to lexical translation strategies • Indirect equivalents = ‘creative’ solutions to non-trivial problems • Parallel corpora: too small, sparse and specialised • The same problem often solved idiosyncratically: no clear statistical model • Set of ‘indirect’ translation problems is open • Our solution: higher order model • Generalising classes of equivalents as strategies • By similarity of usage in comparable corpora • Equivalents to unseen expressions are generated from discovered strategies LREC 2008 Generalising Lexical Translation Strategies for MT
Current methodology • One fixed strategy: rephrasing words using similarity of ‘collocation vectors’ ~ near-synonyms • Generator of equivalents from ASSIST project • выходить из кризиса (go out of crisis) ~ {to approach, to face, to get over} crisis • Выходить(goout).sim задходить(come).dict + collocations of (crisis) to approach • No other strategies yet implemented • Transposition (change of syntactic perspective) Modulation (change of lexical perspective) … • Further goal: to find ~ escape from crisis … via … LREC 2008 Generalising Lexical Translation Strategies for MT
Strategy evaluation • Coverage of problems vs. coverage of solutions • Several strategies cover the same problem (variation) • Ru:Механизм принятия решений будет публичным. (lit.: 'The mechanism of making decisions will be public‘) • публичный механизм (‘public mechanism’) • Public process / … a greater public interaction (Current re-phrasing strategy) • The answer will come from the people. (Change-of-perspective strategy) • It is harder to match solutions: diversity of strategies LREC 2008 Generalising Lexical Translation Strategies for MT
Coverage of translation problems by re-phrasing strategy • Characterising linguistic productivity of the strategy • Experiment: 12 translators suggest indirect solutions to the same set of problems • 36 translation problems (25 Ru & 11 En) • 210 different human solutions (5.83 solutions / problem) • Task of the system: to generate a possible solution for each problem LREC 2008 Generalising Lexical Translation Strategies for MT
Coverage of translation problems by re-phrasing strategy • For 75% of problems: at least 1 match by re-phrasing strategy • Average coverage of a set of human solutions: 34.7% LREC 2008 Generalising Lexical Translation Strategies for MT
Coverage of translation solutions by re-phrasing strategy • Comparing coverage of indirect equivalents by: • (1) bilingual dictionary solutions (Oxford Russian) • (2) solutions extracted from word alignment in parallel corpus: • Training Set: Ru-En news, 700k wd. • Test Set: Euronews Ru-En interviews, 100k wd. • (3) strategy-based (i.e. re-phrasing) solutions: • Collocations vectors from monolingual corpora (BNC, RNC) ~ 100M • Filtered by co-occurrence in news corpora ~200M LREC 2008 Generalising Lexical Translation Strategies for MT
Coverage of solutions by re-phrasing strategy • Task of the system: to generate an exact solution for each problem LREC 2008 Generalising Lexical Translation Strategies for MT
Coverage of solutions by re-phrasing strategy Conclusions • Learning individual equivalents is not efficient • Low coverage of unseen problems • Lower generalisation of idiosyncratic alignments • Re-phrasing strategy: productive but not sufficient LREC 2008 Generalising Lexical Translation Strategies for MT
On-going project: beyond re-phrasing strategy Modelling transposition and modulation strategies • Learning strategies from parallel data • Aligning ‘indirect’ solutions (discontinuous MWEs) • выходить из кризиса (go out of crisis)<~> escapecrisis • Generalising equivalents with similarity classes • Covering unseen expressions: • {Выходить / выводить…} из {конфликта / застоя / депрессии…}(go out / lead out from crisis, stagnation, depression) <~> to escape conflict/ controversy, to flee difficulty, to survive disaster/ tragedy … LREC 2008 Generalising Lexical Translation Strategies for MT
MT-oriented evaluation Improvements for incomprehensible translations and mistranslations: • MT: Es verdad que empezamos vacilantes pero era lógico. (lit: started hesitant) • HT: Of course we had our doubts to begin with but that's normal • SMT: It is true that we started to waver but was logical (unacceptable literal translation) • empezar vacilante ~ begin doubt(modulation) • Indirect solutions: we had our fears/ doubts to start with; we began with fear/ scepticism/ worries...; we were not convinced then; after our early scepticism; we were soon/gradually/quickly convinced LREC 2008 Generalising Lexical Translation Strategies for MT
Application to terminological research • Terminological equivalents are usually direct • Rarely change lexical or syntactic perspective • Standard fixed equivalents preferred • Distributional similarity framework • Yields a network of related terms (not paraphrases) • Useful for automating terminological research • Prototype terminological workbench for translators • English—French corpora in a specialised domain (2M words in total); Giza alignments; termbanks • Translators explore systems of related terms LREC 2008 Generalising Lexical Translation Strategies for MT
Terminological interface for translators • French term plan and the English term plain LREC 2008 Generalising Lexical Translation Strategies for MT
Conclusions and future work • Making testable predictions for indirect equivalents • Model for re-phrasing, transposition & modulation strategies • Match human translators’ solutions for unseen phrases • Future work • Automatic identification of phrases which need non-literal translation • Building fluent equivalents around solutions • Integrating strategy-based generator into SMT decoder • Evaluation of the improvement in coverage • Evaluation of the productivity / reusability of strategies LREC 2008 Generalising Lexical Translation Strategies for MT