1 / 16

Mutual bilingual terminology extraction

Mutual bilingual terminology extraction. Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad de Sevilla *** Universidad de Malaga E-mail: *{l.a.ha,r.mitkov}@wlv.ac.uk, **gfernan@us.es, ***gcorpas@ya.com. Introduction.

allene
Download Presentation

Mutual bilingual terminology extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad de Sevilla *** Universidad de Malaga E-mail: *{l.a.ha,r.mitkov}@wlv.ac.uk, **gfernan@us.es, ***gcorpas@ya.com

  2. Introduction • Terms and Terminology • Terms: linguistic units which have specialised use. • Terminology: the system of terms in a subject field. • Terminology is vital for specialised communication, in both mono lingual and multi lingual contexts.

  3. Mono and multi lingual terminology processing • Mono lingual terminology processing • Three steps: extraction, validation, and organisation. • Automatic extraction approaches: linguistic (may produce noises), statistical (may overlook important but low frequency terms), and hybrid approaches • Bilingual/Multilingual term extraction • The same three steps as in monolingual terminology processing: extraction, validation, and organisation • Relying on parallel corpora aligned at a certain level • Different models to align term candidates • Alignment as an independent step

  4. Our approach: mutual bilingual term extraction • Alignment plays an active role in term extraction. • Automatic alignment is used to propagate the strengths of terminology extraction from one language into another. • Relying on the availability of parallel corpora aligned at sentence level.

  5. Mutual term extraction: Three step • 1: lists of term candidates are extracted for the source and target languages; • 2: term candidates from the target language are aligned to those in the source language; • 3: if a term candidate in the target language is aligned to a term candidate in the source language, its term score is increased: this candidate promoted. • Steps 1-3 can be repeated many times.

  6. Mono-lingual term extraction • Lexical-syntactic-statistical approach • Lexical-syntactic POS patterns • English: [AN]*(NP)?[AN]*N • Spanish: N[NA]*(PN)?[NA]* • Statistical measures • Different measures tested • Frequency is chosen

  7. Term alignment • Contingency table-based method: log-likelihood is used to estimate the likelihood of a term candidate in the source language is translated into another term candidate in the target language • The table is built using a parallel corpus aligned at sentence level

  8. Contingency table for “lymph node” and “ganglio linfático”

  9. Boosting algorithms • Hypothesis: the term score of a term candidate in one language can be used to improve the term score of its aligned candidate in the other language, and vice versa via boosting processes • Given that: AL(T1,T2): alignment score of the two term candidates T1 and T2. TCs[T]: term score of the candidate T in the source language TCt[T]: term score of the candidate T in the target language BT(TC1,TC2): boosting function, i.e. how the term score of the aligned term affects the target term score; Example: simple addition: BT(TC1,TC2)=TC1+TC2;

  10. Boosting algorithms (cont.) • Single boosting: boosting process is performed on the target language only: Foreach term candidate Tt in the target language Ts=argmax(AL(Tt,Ti)); TCt[Tt]=BT(TCs[Ts],TCt[Tt]); • Double boosting: boosting process is performed on both source and target languages Foreach term candidate Ts in the source language Tt=argmax(AL(Ts,Ti)); TCs[Ts]=BT(TCs[Ts],TCt[Tt]); Foreach term candidate Tt in the target language Ts=argmax((AL(Tt,Ti)); TCt[Tt]=BT(TCs[Ts],TCt[Tt]); • Recursive boosting: boosting process is repeated for both languages until the term candidate lists are stabilised.

  11. Parameters • Factors affecting the outcome of the proposed algorithms: the alignment function AL, the mechanism to calculate the initial term scores TCs and TCt, and the boosting function BT. • Different combinations of these functions have been experimented with. • The best term score function is frequency, and the best boosting function is simple addition. • In our next research, we propose several probabilistic models which provide better probabilistic foundations for the boosting function.

  12. Evaluation: data, gold standard, and evaluation metrics • Data • MedlinePlus parallel texts (English/Spanish) on the topic of Cancer • 9,250 segments for each language • 31,498 English words, 30344 Spanish words • Aligned by Trados winalign, manually corrected • Gold standard • 389 English terms, 442 Spanish terms, and 357 term pairs have been validated and used as a gold standard. • Evaluation metrics • F-measure

  13. Evaluation: results • Alignment accuracy • In total, the algorithm suggests 472 translation pairs, of which 374 are confirmed as correct translation. This suggests that the accuracy of the alignment is 0.8. • Term extraction performance: improved by 10 to 25%

  14. 0.75 English TF 0.7 Spanish TF F-measure 0.65 English TF (Boosted) 0.6 Spanish TF (Boosted) 0.55 English converge boosted Spanish 0.5 converge boosted 400 500 600 700 800 Number of candidates Results (cont.)

  15. Conclusion and future directions • A promising approach, but • More research will be needed • A better mathematical foundation: • Probabilistic models • More experiments • Other domains and language pairs • Legal • English-Hindi

  16. Thank you very much Questions? Comments? Criticisms?

More Related