1 / 16

Patent documentation - comparison of two MT strategies

Patent documentation - comparison of two MT strategies. Lene Offersgaard, Claus Povlsen Center for Sprogteknologi, University of Copenhagen loff@cst.dk, claus@cst.dk. A comparison of two different MT strategies. RBMT and SMT, similarities and differences, in a patent documentation context

macha
Download Presentation

Patent documentation - comparison of two MT strategies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Patent documentation -comparison of two MT strategies Lene Offersgaard, Claus Povlsen Center for Sprogteknologi, University of Copenhagen loff@cst.dk, claus@cst.dk

  2. A comparison of two different MT strategies • RBMT and SMT, similarities and differences, in a patent documentation context • What requirements should be met in order to develop an SMT production system within the area of patent documentation? • The two strategies: • PaTrans: A transfer and rule based translation system, used the last 15 years at Lingtech A/S (Ørsnes, 1996). • SpaTrans: A SMT system based on the Pharaoh framework (Koehn, 2004). Investigations supported by Danish Research Council. • Subdomain: chemical patents MT-Summit, Sep 2007

  3. A comparison of two different MT strategies -2 • PaTrans: Transfer and rule based • En-Da, linguistic development • Grammatical coverage tailored to the text type of Patents • Tools for terminology selection and coding • Handling of formulas and references • SpaTrans: An SMT system based on Pharaoh framework • En-Da, research version • Word and grammatical coverage determined by training corpus • No termilology handling yet • Simple handling of formulas and references MT-Summit, Sep 2007

  4. Translation Workflow English Patent • SpaTrans: • Statistical • resources • PaTrans: • Linguistic resources Preprocessing lexicon Translation Engine Language model srilm 3 PaTrans Engine Pharaoh Decoder grammar Phrase table termbases Postprocessing Danish Patent Proff reading MT-Summit, Sep 2007

  5. BLEU Evaluation • Reference translations are two post-edited PaTrans translations • The PaTrans system is favoured: term bases, wording and sentence structure • Some SpaTrans errors are caused by incomplete treatment of formulas and references • BLEU differs for the two patents • Very promising results for the SpaTrans system MT-Summit, Sep 2007

  6. Human evaluation of the SMT system • Limited resources for manual evaluation • Proof readers have post-edited SMT output and focussed on: • Post editing time • Quality of output • Intelligibility (understandable?) • Fidelity (same meaning?) • Fluency (fluent Danish?) • Conclusions: • Usable translation quality • Both intelligibility and fidelity scores are best without reordering • Annoying agreement errors • New terms has to be included in the SMT system easily MT-Summit, Sep 2007

  7. SpaTrans translation results • A dominant error pattern is the frequent occurrence of agreement errors in nominal phrases • Examples • Gender disagreement: • (lit:… control of the full spectrum) • … kontrol af den fulde spektrum … kontrol af den[DET_common_sing] fulde spektrum[N_neuter_sing] Corrected output: … kontrol af det[DET_neuter_sing] fulde spektrum[N_neuter_sing] MT-Summit, Sep 2007

  8. SpaTrans translation results - 2 • Number disagreement: • (lit: … the active ingredients) • … den aktive bestanddele … den[DET_common_sing] aktive bestanddele[N_common_plur] Corrected output: … de[DET_common_plur] aktive bestandele[N_common_plur] Definiteness disagreement: (lit: ... this constant erosion) ... denne konstant erosion ... denne[DET_definite]konstant[ADJ_indefinite] erosion Corrected output: ... denne[DET_definite]konstante[ADJ_definite] erosion Lets give linguistic information a try! MT-Summit, Sep 2007

  9. Adding linguistic information to SMT: MOSES • MOSES • Open source system replacing Pharaoh (Koehn et al. 2007) • State-of-the-art phrase-based approach • Using factored translation models • Comparison SpaTrans-Pharao and Moses decoder • Reuse of statistical resources • Pharao parameters for monotonic setup optimised based on development tests MT-Summit, Sep 2007

  10. Adding linguistic information using MOSES • Using factored translation models • Makes it possible to build translation models based on surface forms, part-of-speech, morphology etc. • We use: • Translation model: word->word, pos->pos • Generation model determine the output Input Output word word pos+morf pos+morf MT-Summit, Sep 2007

  11. Adding POS-tags and morphology • Pos-tagging training material: • Brill tagger used • Different tagsets for Danish and English text • Experiments with language model (lm) order • order 3 or 5 • Results not significant: • Test Patent A: +0.1% BLEU • Test Patent B: -0.1% BLEU • Perhaps training material too small to do lm order • experiments • Training parameters kept: phrase-length 3, lm order 3 • No tuning of parameters, just training. MT-Summit, Sep 2007

  12. Results adding pos-tags – by inspection • With inclusion of morpho-syntactic information: • (lit:… control of the full spectrum) • ... kontrol af det fulde spektrum (gender agreement) • (lit: … the active ingredients) • ... de aktive bestanddele (number agreement) • (lit: ... this constant erosion) • ... denne konstante erosion (definiteness agreement) MT-Summit, Sep 2007

  13. Results using pos-tags - BLEU BLEU not designed to test linguistic improvement, anyway: Significant improvement! MT-Summit, Sep 2007

  14. ConclusionsMOSES • En-Da Patents: best results when no reordering • Agreement errors can be reduced by applying factored training using pos+mophology • Experiments using a ”language” model order > 3 for POS-tags might give even better results MT-Summit, Sep 2007

  15. Conclusions SMT test results for patent text • Usable • translation quality comparable with RBMT systems in production • low cost development for new domain • possible to have SMT-systems tailored to different domains of patents - if training data are available • Patent texts always contain new terms/concepts • Therefore new terms have to be handled in SMT production systems • Agreement errors can be reduced by applying factored training with pos-information - BLEU score improved! MT-Summit, Sep 2007

  16. Acknowledgements • Thanks! • The work was partly financed by the Danish Research Council. • Special thanks to Lingtech A/S and Ploughmann & Vingtoft for providing us with training material and proofread patents. MT-Summit, Sep 2007

More Related