220 likes | 352 Views
Machine Translation using Tectogrammatics. Zden ěk Žabokrtský IFAL, Charles University in Prague. Overview. Part I - t heoretical b ackground Part II - TectoMT s ystem. MT pyramid ( in terms of PDT ). Key question in MT : optimal level of abstraction?
E N D
Machine Translationusing Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague
Overview • Part I - theoretical background • Part II - TectoMT system
MT pyramid (in terms of PDT) • Key question in MT: optimal level of abstraction? • Our answer: somewhere around tectogrammatics • high generalization over different language characteristics, but still computationally (and mentally!) tractable
Basic facts about "Tecto" • introduced by Petr Sgall in 1960's • implemented in Prague Dep. Treebank 2.0 • each sentence represented as a deep-syntactic dependency tree • functional words accompanying an autosemantic word "collapse" with it into a single t-node, labeled with the autosemantic t-lemma • added t-nodes (e.g. because of pro-drop) • semantically indispensable syntactic and morphological categories rendered by a complex system of t-node attributes (functors+subfunctors, grammatemes for tense, number, degree of comparison, etc.)
SMT and limits of growth • current state-of-the-art approaches to MT • n-grams + large parallel (and also monolingual) corpora + huuuuge computational power • n-grams are very greedy! • availability (or even existence!) of more data? • example: Czech-English parallel data • ~1 MW - easy (just download and align some tens of e-books) • ~10 MW - doable (parallel corpus Czeng) • ~100 MW - not now, but maybe in a couple of years... • ~1 GW - ? • ~10 GW (~ 100 000 books) - Was it ever translated???
How could tecto help SMT? • n-gram view: • manifestations of lexemes are mixed with manifestations of language means expressing the relations between the lexemes and of other grammar rules • inflectional endings, agglutinative affixes, functional words, word order, punctuation orthographic rules ... • Itwill be deliveredtoMr. Green'sassistantsat thenearestmeeting. • training data sparsity • how could tecto ideas help? • within each sentence, clear separation of meaningful "signs" from "signs" which are only imposed by grammar (e.g. imposed by agreement) • clear separation of lexical, syntactical and morphological meaning components • modularization of the translation task potential for a better structuring of statistical models more effective exploatation of the limited training data
"Semitecto" • abstract sentence representation, tailored for MT purposes • motivation: • not to make decisions which are not really necessary for the MT process (such as distinguishing between many types of temporal and directional semantic complementations) • given the target-language "semitecto" tree, we want the sentence generation to be deterministic • slightly "below" tecto (w.r.t. the abstraction axis): • adopting the idea of separating lexical, syntactical and morphological meaning components; adopting the t-tree topology principles • adopting many t-node attributes (especially grammatemes, coreference, etc.) • but (almost) no functors, no subfunctors, no WSD, no pointers to valency dictionary, no tfa... • closer to the surface-syntax • main innovation: concept of formemes
Formemes • formeme = morphosyntactic language means expressing the dependency relation • n:v+6 (in Czech) = semantic noun which is on the surface expressed in the form of prepositional group in locative with preposition "v" • v:that+fin/a (in English) = semantic verb expressed in active voice as a head of subordinating clause introduced with the sub.conjunction "that" • obviously, sets of formeme values are specific for each of the four semantic parts of speech • in fact, formemes are edge labels partially substituting functors • what is NOT captured by formemes: • morphological categories imposed by grammar rules (esp. by agreement), such as gender, number and case for adjectives in attributive positions • morphological categories already represented by grammatemes, such as degree of comparison for adjectives, tense for verbs, number for nouns
Formemes in the tree • Example:It is extremely important that Iraq held elections to a constitutionalassembly.
Some more examples of proposed formemes • English • 661 adj:attr • 568 n:attr • 456 n:subj • 413 n:obj • 370 v:fin/a • 273 n:of+X • 238 adv: • 160 n:poss • 160 n:in+X • 146 v:to+inf/a • 92 adj:compl • 91 n:to+X • ... • 62 v:rc/a • ... • 51 v:that+fin/a • ... • 39 v:ger/a • Czech • 968 adj:attr • 604 n:1 • 552 n:2 • 497 v:fin/a • 308 n:4 • 260 adv: • 169 n:v+6 • 133 adj:compl • 117 v:inf • 104 n:poss • 86 n:7 • 82 v:že+fin/a • 77 v:rc/a • 63 n:s+7 • 53 n:k+3 • 53 n:attr • 50 n:na+6 • 47 n:na+4 • 42 v:aby+fin/a
Three-way transfer • translation process: (I have been asked by him to come -> Požádal mě, abych přišel) • 1. source language sentence analysis up to the "semitecto" layer • 2. tranfer of • lexemes (ask požádat ,come přijít) • formemes (v:fin/pv:fin/a , v:to+inf v:aby+fin/a) • grammatemes (tense=past1past , 0 verbmod=cdn) • 3. target language sentence synthesis from the "semitecto" layer
Adding statistics... translation model (e.g. from parallel corpus Czeng, 30MW) "binode" language model (e.g. from partially parsed Czech National Corpus, 100MW) P(lT |lS) P(fT |fS) P(lgov,ldep,f) source language target language
Goals • primary goal • to build a high-quality linguistically motivated MT system using the PDT layered framework, starting with English -> Czech direction • secondary goals • to create a system for testing the true usefulness of various NLP tools within a real-life application • to exploit the abstraction power of tectogrammatics • to supply data and technology for other projects
MT triangle: interlingua tectogram. surf.synt. morpho. raw text. source target language language Main design decisions • Linux + Perl • set of well-defined, linguistically relevant levels of language representation • neutral w.r.t. chosen methodology (e.g. rules vs. statistics) • in-house OO architecture as the backbone,but easy incorporation of external tools (parsers, taggers, lemmatizers etc.) • accent on modularity: translation scenario as a sequence of translation blocks (modules corresponding to individual NLP subtasks)
TectoMT - Example of analysis (1) • Sample sentence: It is extremely important that Iraq held elections to a constitutionalassembly.
TectoMT - example of analysis (2) • phrase-structure tree:
TectoMT - example of analysis (3) • analytical tree
TectoMT - example of analysis (4) • tectogrammatical tree (with formemes)
Heuristic alignment • Sentence pair: • It is extremely important that Iraq held elections to a constitutionalassembly. • Je nesmírně důležité, že v Iráku proběhly volby do ústavníhoshromáždění.
Formeme pairs extracted from parallel aligned trees • 593 adj:attr adj:attr • 290 v:fin/a v:fin/a • 282 n:1 n:subj • 214 adj:attr n:attr • 165 n:2 n:of+X • 152 adv: adv: • 149 n:4 n:obj • 102 n:2 n:attr • 86 n:v+6 n:in+X • 79 n:poss n:poss • 73 n:1 n:obj • 61 n:2 n:obj • 51 v:inf v:to+inf/a • 50 adj:compl adj:compl • 39 n:2 n: • 34 n:4 n:subj • 34 n:attr n:attr • 32 v:že+fin/a v:that+fin/a • 32 n:2 n:poss • 27 n:4 n:attr • 27 n:2 n:subj • 26 adj:attr n:poss • 25 v:rc/a v:rc/a • 20 v:aby+fin/a v:to+inf/a