1 / 19

Transforming Parallel Corpora to Translation Memory

Transforming Parallel Corpora to Translation Memory. Steve Legrand IPN 29th Sept. 2006. Parallel text or bitext. Aligned translation of text from one language to another. Practical uses in NLP: Word sense disambiguation Automatic translation Translation memories. Translation Memory.

Download Presentation

Transforming Parallel Corpora to Translation Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transforming Parallel Corpora to Translation Memory Steve Legrand IPN 29th Sept. 2006

  2. Parallel text or bitext • Aligned translation of text from one language to another. Practical uses in NLP: • Word sense disambiguation • Automatic translation • Translation memories

  3. Translation Memory • Helps the translator by using already translated text segments to cue in the translation of new text segments • Translation memory correspondence level can usually be set (e.g., 56%) • Automatic translation can be combined with translation memories  post-editing of automatic translation for translation memory uses.

  4. Translation memory format (.tmx) • .tmx (translation memory exchange) is a standardized format for application interoperability. • tu: translation unit, unit father of every element to be translated. It can contain a unique identifier (tuid). • tuv: translation unit variant, unit that contains the language code of the translation (xml:lang). • seg: segment, it contains the translated text.

  5. TMX Example

  6. Poor man’s guide to translation memories • Trados the best known and probably one of the best commercial TM applications available. • There are cheaper one-user versions, but in spite of that the price is often prohibitive. • To avoid excessive costs, one could: • Use a demo versions of the commercial software • Use Open Source products.

  7. OmegaT • Open Source translation memory • Needs Java Run-time • Needs Open Office to convert .doc format to .odt or .swx- format (open standard) • Creates tmx.files • Tmx-files can also be exported from other applications

  8. Parallel corpora  tmx • To be able to use a parallel corpora as a translation memory we need first to convert it to the tmx format. • We can either use a existing parallel corpora or create our own. • There are many open source web resources for creating our own parallel corpora

  9. Using open parallel corpora resources – English source • Jack London published about 40 books in English. Almost all his English- language works are publicly available at • Project Gutenberg in: http://www.gutenberg.org/wiki/Main_Page

  10. Using open parallel corpora resources – Spanish source (s) • Among the many sources of Spanish translations of Jack London’s books there is: http://apuntes.rincondelvago.com/trabajos_global/literatura/

  11. Aligning parallel texts For example: Download “White Fang” by Jack London from Project Gutenberg and its translation “Colmillo Blanco” from rincondelvago • Use bitext2tmx (free open source application) for alignment

  12. bitext2tmx aligner: configuration

  13. bitext2tmx aligner: text alignment

  14. Bitext2tmx producing a tmx-file

  15. The tmx-file produced by bitext2tmx can be added to OmegaT’s tm directory to be used as part of the translation memory

  16. Other tools with Omegat • .tmx-files can be cleaned with tmxcleaner • .tmx-files can be merged with tmxmerger • .tmx-files can be validated with tmxvalidator • (can be downloaded from the OmegaT site • It is important at least to validate the files before adding them to OmegaT’s translation memory.

  17. Current work: Using these Open Source resources, translating a book from English to Spanish with the students of applied linguistics at Colima University with IPN backing. Ready by the middle of November. Linguistica Computacional

  18. Save your money. Use Open Source!

More Related