1 / 22

Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th October 2014

A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora. Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th October 2014. Overview. Parallel Corpus Problem Motivation. Background. Random Forest Classifier Statistical Phrase Alignment

Download Presentation

Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th October 2014

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14th October 2014

  2. Overview • Parallel Corpus • Problem • Motivation Background • Random Forest Classifier • Statistical Phrase Alignment • Hybrid Approach Methods Experiments • English-Greek & English-Romanian • Error Analysis Conclusions • Discussion • Future Work

  3. Background: Parallel Corpus “A parallel corpus is a collection of documents in a source language paired with their direct translation in a target language” Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer English η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού Greek

  4. Background: Parallel Corpus Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer English • 1) Useful for SMT • 2) Relatively scarce resources • Koehn (2005) trained 110 SMT systems (11 languages) • in three weeks. • Available finance, law, medicine etc. • 3) Excellent resources for mining bilingual terminologies • Exact translations => No missing translations of terms • sentence aligned => limited search space of candidate translations • Same size => term frequencies are comparable η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου του µαστού Greek

  5. Background: Problem Parallel Corpus Dictionary of MWT Term Alignment Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού metastatic breast cancer µεταστατικού καρκίνου µαστού

  6. Background: Biomedical Domain Existing resources in the biomedical domain remain incomplete • A multilingual terminological resource (more than 20 languages) • Indexes ~7.6M English terms UMLS expand UMLS for English-Greek and English-Romanian ~6.3M missing tranlsations

  7. Methodology: Term Alignment Pipeline Parallel Corpus Link to UMLS MetaMap Term Alignment Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer C0278488, Neoplastic Process C0278488, Neoplastic Process η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού

  8. Methodology: Term Alignment Algorithms • Supervised machine learning method • Exploits internal structure of terms • (character n-gram feature representation) • Requires positive and negative instances for training • Out-of-domain seed dictionary (i.e. BabelNet) Random Forest Classifier (EACL 2014, EMNLP 2014) • Unsupervised approach • Part of Moses SMT (Koehn et al., 2007) • (Out of the box solution) • Exploits co-occurrences of source and target terms • Works well for frequently occurring terms • Performance decreases for rare terms Statistical Phrase Alignment (Koehn et al., 2003)

  9. Methodology: Hybrid Approach • For s to be translated, RF and SPA suggest N ranked candidate translations Translation probability Classification margin type 2 diabetes mellitus SPA RF του σακχαρώδη διαβήτη τύπου 2 σακχαρώδη διαβήτη τύπου 2 σακχαρώδους διαβήτη τύπου 2 διαβήτη τύπου 2 διαβήτη τύπου 2 και καρδιακή σακχαρώδη διαβήτη τύπου 2

  10. Methodology: Hybrid Approach • Dictionaries containing N candidate translations have a limited number of applications • (e.g., SMT) • To enrich existing terminologies, human curators need to post-edit the output • of term alignment methods • Objective is to improve the precision of higher ranking candidates (precision@N=1) • Intersection of RF and SPA; ranking candidates according to translation probability by SPA type 2 diabetes mellitus SPA RF του σακχαρώδη διαβήτη τύπου 2 σακχαρώδη διαβήτη τύπου 2 σακχαρώδους διαβήτη τύπου 2 διαβήτη τύπου 2 διαβήτη τύπου 2 και καρδιακή σακχαρώδη διαβήτη τύπου 2 Voting σακχαρώδη διαβήτη τύπου 2

  11. Experiments: Corpora • EMEA (Tiedemann, 2009), a biomedical parallel corpus from European Medicines Agency • 1.5K sentence aligned documents in 22 languages • Drug usage guidelines en el en ro - 372K sentences - 17,907 unique English MWTs - 321K sentences - 16,625 unique English MWTs

  12. Experiments: Evaluation • Randomly sampled 1,000 English MWTs • for each English MWT, we selected the top 20 translation candidates. en-el en-ro RF SPA Voting RF SPA Voting

  13. Experiments: Results English-Greek dataset

  14. Experiments: Results English-Romanian dataset

  15. Experiments: Results English-Greek dataset

  16. Experiments: Results English-Romanian dataset

  17. Error Analysis • Partial matches (disorder) (cycle) (urea) διαταραχών του κύκλου της ουρίας urea cycle disorder RF • discontinuous translations (diseases) (metabolic) (hereditary) boliereditarede metabolism metabolic diseases • Statistically-based tool. • -Performance largely affected by term frequency SPA • top-20 precision on terms having varying frequency

  18. Error Analysis Performance decreases for lower frequency terms English-Greek dataset

  19. Error Analysis English-Romanian dataset

  20. Discussion • Hybrid approach • Compilation of bilingual terminologies from parallel corpora • Enrich UMLS with two under-resource languages • Observations: • Substantially improves top-1 precision of RF and SPA • Outperforms SPA when translating low-frequency terms • Low recall

  21. Future Work • Investigate integration of bilingual terminologies with SMT SMT SPA Parallel corpus Phrase table SPA LM RF Lower top-1 precision Poor performance for low-frequency terms

  22. Questions ?

More Related