1 / 15

En -> Cz MT system based on TR

En -> Cz MT system based on TR. Zden ěk Žabokrtský IFAL, Charles University in Prague. Goals. primary goal to build a high-quality linguistically motivated MT system using the PDT layered framework secondary goal

sonora
Download Presentation

En -> Cz MT system based on TR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. En->Cz MT systembased on TR Zdeněk Žabokrtský IFAL, Charles University in Prague

  2. Goals • primary goal • to build a high-quality linguistically motivated MT system using the PDT layered framework • secondary goal • to create a system for testing the true usefulness of various NLP tools within a real-life application

  3. MT pyramid in terms of PDT ? transfer t-layer a-layer analysis synthesis m-layer w-layer source language target language

  4. Building the first prototype... • chosen direction: English -> Czech • main design decisions: • several well-defined, linguistically relevant intermediate levels • modularity - decompose the task into many isolated subtasks • neutral w.r.t. chosen methodology (e.g. rules vs. statistics) • available resources • experience (and sw tools) from PDT and PCEDT • freely available NLP tools for analysis on the English side • an existing module for sentence synthesis on the Czech side

  5. MT pyramid in the prototype src-t-layer trg-t-layer src-p-layer src-a-layer src-m-layer input text output text

  6. VP for PP John for John John Data representation • different types of structures associated with each source sentence • they should be stored simultaneously and interlinked, instead of being rewritten • new data format supported by TrEd • tree bundles (instead of single trees) for each sentence • simplified addition of new attributes

  7. Translation scenarios • translation scenario – a chain of translation modules • modules implemented as (or wrapped by) btred/ntred macros (Perl) • well-defined phases, so that the modules can be easily substituted Scenario 1: Scenario 2: Scenario 3:

  8. Input text  src-m-data 1) segment the input text into sentences (Lingua::EN::Tagger from CPAN) 2) create an empty tree bundle for each sentence 3) tokenize+tag the sentences (Lingua::EN::Tagger from CPAN) 4) lemmatize each token by Schmidt tree-tagger

  9. src-m-data  src-p-data 5) phrase-structure parsing (Lingua::CollinsParser from CPAN) 6) add p-node identifiers

  10. src-p-data  src-a-data 7) mark phrase heads (Collins’s heads + minor arrangements) 8) run constituencydependency transformation 9) assign (selected) analytical functions 10) mark subject nodes 11) add a-node identifiers

  11. src-a-data  src-t-data 12) determine the t-tree topology (collapsing fw. subtrees) 13) label t-nodes with t-lemmas 14) assign coordination/apposition functors 15) mark t-nodes corresponding to finite clauses 16) assign (some of) the remaining functors 17) fill the nodetype attribute 18) detect grammatical co-reference in relative clauses 19) determine the semantic part of speech 20) fill grammateme attributes (number, tense, degree...) 21) detect the sentence modality

  12. src-t-data  trg-t-data 22) clone the source-language t-tree 23) translate t-lemmas using a simple 1:1 probabilistic lexicon 24) set the gender attribute according to the noun lemma 25) set the aspect attribute according to the verb lemma 26) apply specific conversion rules (e.g. for indefinite pronouns)

  13. trg-t-data  output sentence 27) for prepositional groups, guess the target-language surface form 28) run Jan Ptáček’s sentence generator

  14. Translation sample • A Turkish girl has died from bird flu, days after her brother and sister died from the disease. The girl, 11, who lived on a poultry farm in eastern Turkey's Van province, was being treated in hospital after her siblings became infected with bird flu. The cases are the first human deaths from bird flu outside Asia, where the virus has killed more than 70 people. The hospital in Van is treating 15 others, three of whom are in a critical condition, according to a doctor there. The latest victim, Hulya Kocyigit, died early on Friday at the hospital. • Turecká ďouka zemřela z ptačí chřipky dny after, že její bratr a sestra zemřeli z nemoci. Ďouka 11, kdo žilo v drůbeží farmě ve van provincii východního Turecka, jsoucno zacházet v nemocnici, že její sourozenci slušeli nakažený s ptačí chřipkou. Případy jsou přední lidské smrti z ptačí chřipky mimo Asii, kde virus zabilo than 70 lid. Nemocnice ve Van zachází 15 zbývajících, whom three of v kritické podmínce souzvuk lékaře tam. Nejpozdnější oběť Kocyigit Hulya zemřela brzy v pátku v nemocnici.

  15. Final remarks • Indeed, we have just started (<1000 Perl LOCs, <50 development hours) and the performance is limited at this moment... • However, the system works and can be tested and gradually improved. • Every translation error can be traced back to its source. • Any part of the system can be easily “unplugged” and substituted with a better module.

More Related