1 / 40

Idiom Processing within the EBMT System METIS-II

Dimitra Anastasiou dimitra@d-anastasiou.com Institut für Angewandte Informationsforschung (IAI) Saarland University, Germany School of Computing, Dublin City University, Dublin 15 th October 2008. Idiom Processing within the EBMT System METIS-II. Aim-Methods. Aim

rae
Download Presentation

Idiom Processing within the EBMT System METIS-II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dimitra Anastasiou dimitra@d-anastasiou.com Institut für Angewandte Informationsforschung (IAI) Saarland University, Germany School of Computing, Dublin City University, Dublin 15th October 2008 Idiom Processing within the EBMT System METIS-II

  2. Aim-Methods Aim • Enhancement of translation quality of idiomatic expressions (idiomatic VPs in particular) within the German-to-English EBMT system METIS-II Resources • Bilingual idiom dictionary • Monolingual corpus • Syntactic rules according to the German topological field model

  3. Outlook • EBMT: statistical or rule-based MT? • Interpretation of idioms • Topological field model • Treatment of idioms by MT • METIS-II idiom resources • Translation process of METIS-II • Evaluation of METIS-II

  4. EBMT: Statistical or Rule-Based MT? Two tendencies of EBMT: • Combinations of EBMT with rule-based MT (RBMT) as hybrid systems[Sumita et al., 1990]; • Pure EBMT systems [Sato & Nagao, 1990]. • EBMT lies between RBMT and statistical MT (SMT) [Carl & Way, 2005] Reason: The transfer between SL and TL is always guided by translation examples, even if the replacement and/or modification of the sub-sequences are completely rule- or data-based.

  5. Outlook • EBMT:statistical or rule-based MT? • Interpretation of idioms • Identification by MT • Semantics • Syntax • Grammatical and lexical variants • Topological field model • Treatment of idioms by MT • METIS-II idiom resources • Translation process of METIS-II • Evaluation of METIS-II

  6. Interpretation of Idioms • Diverse terms (and accordingly definitions): idiom, semi-idiom, (cranberry) collocation, idiomatic/figurative/fixed/periphrastic phrase/expression, phraseologism, (dead) metaphor, etc. • Irregularity of idioms depends on: • Fixedness of constituents[Moon, 1998; Trawinski, 2008]; • Degree of compositionality; • Syntactic opaqueness: kick the bucket – die[Jackendoff, 1997; Gazdar et al., 1985]; • Poetic marking of the form, e.g. klipp und klar (clear as daylight)mit Rat und Tat (help and advice)

  7. Idiom Identification by MT jmdn. mit Argusaugen beobachtenso.-with-Argus eyes-observewatch so. like a hawk Er beobachtete den Mann, der die Bank betrat, mit Argusaugen.He was watching the man, who entered the bank, like a hawk. • The contiguous parts of the idiom (mit Argusaugen); • The discontinuous parts of the idiom (beobachten) in any of its declination forms; • The syntactic requirements of the idiom; • The clause boundaries (usually in one clause). • More information can be found in Volk (1998).

  8. Semantics (Degree of Compositionality) • Non-compositional: cranberry/unical constituents, e.g.: • A recent study on cranberry expressions in English and German is that of Trawinski et al. (2008); 2) Partially compositional: light-verb constructions (SVCs) • A recent study on German PP-verb SVCs is that of Krenn (2008); • Strictly compositional: collocations, e.g. as happy as a sandboy on tenterhooks außer Betrieb gehen – go out of service außer Betrieb sein – be out of order Maßnahmen ergreifentake measures

  9. Syntax • Syntactic categories of idioms • Realization of idiomsas for the syntactic gaps Continuous (without gaps) Discontinuous (with gaps)

  10. Syntactic Categories • Noun phrase (NP): pink slip • Prepositional phrase (PP): by hook or crook • Combination NP-PP: dangerforlifeandhealth • Adjective: prim and proper • Verbal phrase (iVP) • NP-Verb: kick thebucket • PP-Verb: fall on deaf ears • NP-PP-Verb throw out the baby with the bath water • Proverb less is sometimes more • Saying gimme a break

  11. Grammatical Variants (1) • Number: pull up stakes *pull up stake Exception! keep tabs on sb/sth keep a tab on sb/sth • Case: auf die Strasse gehen*auf der Strasse gehen take to the streets • Determ.: play a role *play the role • Posses.: in Verbindung treten *in Pos.Pron.Verbindung treten contact Pos.Pron. Ohr leihen *das Ohr leihen listen

  12. Grammatical Variants (2) • Negation: eineRollespielen (play a role)keineRollespielen (any-role-play) *nichteineRollespielen (not-a-role-play) auf keinengrünen Zweig kommen (never get anywhere) nicht/nie auf einengrünen Zweig kommen • Passivization • The more syntactically opaque an idiom has, the less possible it is to undergo passivization. opaque: [kick] [the bucket] – die *The bucket was kicked by him (only literal meaning) transparent: [spill] [the beans] – [tell] [a secret] The beans were spilled by him

  13. Lexical Variants • Substitution: kick the bucket *kick the pail hit the sack *hit the hay • Modifiers • Adjective: keep tabs on keep close tabs on • Adverb: nochgrünhintern den Ohrenseinnochabsolutgrünhintern den Ohrensein be half-baked

  14. Outlook • EBMT: statisticalorrule-based MT? • Interpretation ofidioms • Topologicalfield model • Realizationofidioms • Discontinuouspatterns • Treatment ofidiomsby MT • METIS-II idiom resources • Translation processof METIS-II • Evaluation of METIS-II

  15. Topological Field Model for German The German clauses are divided into five fields; each field can be occupied by a certain number and kind of constituents [Drach, 1963; DUDEN, 1998; Dürscheid, 2000]: • pre-field (PF): only 1 constituent!; • left bracket (LB): finite (modal/auxiliary verb); • middle field (MF): many constituents and in free order; • right bracket (RB): non-finite verb (infinitive/participle form); • post-field (PF): subclause(s).

  16. Realization of Idioms • Continuous form: ( iNPMF | iPPMF | [iNPMFiPPMF] )iVRB Er will nicht bei den Argumenten ständig den Bock (iNPMF)zum Gärtner (iPPMF) machen (iVRB)! He-wants-not-during-the-arguments-always-the-bock-to-the-gardner-make! He does not alwayswanttosetthefoxtokeepthegeeseduringtheargumentation! • Discontinuous form: iVLB (Adverb)*MF ( iNPMF | iPPMF | [iNPMFiPPMF] ) Ermacht(iVLB)oft (Adverb) den Bock (iNPMF)zumGärtner(iPPMF). He-makes-often-the-bock-to-the-gardner. He often sets the foxtokeepthegeese.

  17. Discontinuous patterns Den Bock zumGärtnermachen (set the fox to keep the geese) Ermacht(iVLB)oft (Adverb) den Bock (iNPMF)zumGärtner(iPPMF). Er hat den Bock (iNPMF)zumGärtner(iPPMF) oft gemacht(iVRB). ?Den Bock (iNPPF)zumGärtner(iPPPF) hat eroftgemacht (iVRB). ?Den Bock (iNPPF)hat eroftzumGärtner(iPPMF)gemacht (iVRB).

  18. Outlook • EBMT: statistical or rule-based MT? • Interpretation of idioms • Topological field model • Treatment of idioms by MT • Idioms suitable for EBMT • METIS-II idiom resources • Translation process of METIS-II • Evaluation of METIS-II

  19. Treatment of Idioms by MT • Bar-Hillel (1952): “The only way for a machine to treat idioms is - not to have idioms!” • Power Translator Pro user manual (2000) warns the user to avoid inputting sentences containing idioms! • Power Translator Pro, SYSTRAN, T1 Langenscheidt cannot identify discontinuous idioms.

  20. Idioms suitable for EBMT • Idiomatic expressions are are not suitable for rule-based MT (RBMT), but are suitable for EBMT. “Translation of an idiomatic expression can only be used to translate the same idiomatic expression; it cannot be used to translate a similar expression.” (Sumita et al., 1990: 210). • By contrast, Nomiyama (1992) emphasizes the disadvantage of EBMT’s using only thesauri to define a general semantic distance, resulting in over-generalization, which is a major problem in translating idiomatic expressions. • Related work: Santos (1990), Wehrli (1998), Ryu et al. (1999), and Gangadharaiah; Balakrishnan (2006):

  21. Outlook • EBMT: statistical or rule-based MT? • Interpretation of idioms • Topological field model • Treatment of idioms by MT • METIS-II idiom resources • Idiom lexicon • German corpus (annotation), (statistical analysis) • Syntactic rules • Translation process of METIS-II • Evaluation of METIS-II

  22. Idiom Resources • Bilingual idiom dictionary of 871 entries • Monolingual German corpus of 486 sentences • Syntactic rules according to the German topological field model

  23. METIS-II Project • Hybrid MT system (EBMT, RBMT, SMT); • Time span: 2004-2007; • SLs: Dutch, German, Greek, Spanish; • TL: Bristish English; • Based on patternmatching; • Sources: • Hugemonolingual TL corpus (BNC); • Bilingual dictionaries; • Tokenizer; PoStagger, chunker, lemmatizer; • Manuallyconstructedmatchingrules.

  24. Idiom Dictionary 871 entries 826equalPoS 45differentPoS (verb/VP-interjection) 598 verbs/VPs 28 PPs 163 interject-ions 37 NPs Entry example {de=den_Bock_zum_Gärtner_machen, mde={c=verb}, en=set_the_fox_to_keep_the_geese, men={c=verb}}.

  25. Idiom Corpus three corpus resources 80 MWEs 63 cont. (79%) 17 disc. (21%) Europarl (EP) DWDS (Digital lexicon of the German language in the 20th century) Mixture of data sets (MDS) Manuallyconstructed (IAI) Real examples (Internet) 131 MWEs 91 cont. (69%) 40 disc. (31%) 275 MWEs 205 cont. (75%) 70 disc. (25%)

  26. Annotation of Idioms in the German Corpus Continuous form: Er will nicht bei den Argumenten ständig <MWE id=1> den Bock zum Gärtner machen </MWE id=1>. He does not always want to set the fox to keep the geese during the argumentation. Discontinuous form: Er <MWE id=1> macht </MWE id=1> oft <MWE id=1> den Bockzum Gärtner </MWE id=1>. He often sets the fox to keep the geese .

  27. Statistical Analysis of iVPs’ Syntactic Patterns

  28. Syntactic Rule for Continuous Idioms Er will nicht bei den Argumenten ständig den Bock zum Gärtner machen! En Bloc Pattern = A:match=yes, last idiom’s word=no,[den Bock,zum Gärtner] B: match=yes,last idiom’s word=yes[machen] C: mark_as_continuous_iVP. where A: first idiom constituent - before last B: last idiom constituent C: command to identify/match as continuous • No alien element between A and B!

  29. Syntactic Rule for Discontinuous Idioms Er macht (iVLB)oft (Adverb) den Bock (iNPMF)zum Gärtner (iPPMF). Discontinuous Pattern_LBMF = A: match=yes,field=LB,c=verb, [macht] B: [match=no, field=MF]*, [oft] C: match=yes, field=MF,[den Bock, zum Gärtner] D: mark_as_discontinuous_iVP. where A: idiom’s verb in the left bracket B: arbitrarily many elements C: matched idiom’s constituents D: command to identify/match as discontinuous • Alien element(s) between A and C!

  30. Outlook • History of EBMT • Interpretation of idioms • Topological field model • Treatment of idioms by MT • METIS-II idiom resources • Translation process of METIS-II • METIS-II Idiom Matching Process • Evaluation of METIS-II

  31. METIS-II Translation Process • SL analysis (tokenization, PoS-tagging, lemmatization, and chunking or shallow parsing); • SL-to-TL matching i) The bilingual idiom dictionary;ii) The syntactic matching rules. • TL generation (the main TL resource, BNC, is used as a data-set of examples). The token generator is described in Carl & Schütz (2005).

  32. METIS-II Idiom Matching Process Users • Store an idiom in the bilingual dictionary; • Load the syntactic matching rules; • Enter an input sentence/corpus. System • The system reads the sentence word by word; • If the idiom is continuous and in the same form as stored in the dictionary, it is directly correctly translated; • If the idiom is discontinuous, the system reads the syntactic matching rules (rule by rule), until it finds the appropriate one which is then applied.

  33. Outlook • History of EBMT • Interpretation of idioms • Topological field model • Treatment of idioms by MT • METIS-II idiom resources • Translation process of METIS-II • Evaluation of METIS-II • For continuous idioms • For discontinuous idioms

  34. Evaluation of METIS-II • Hit: correct matching/correct translation • Miss: no matching/reuse of German input • Noise: false matching/literal translation • Presicion: • Recall: • fscore:

  35. Evaluation Results for Continuous iVPs

  36. Evaluation Results for Discontinuous iVPs

  37. Conclusion • Continuous idioms: more than 95% recall and precision • Discontinuous idioms: Almost more than 90% recall and more than 80% precision. • The evaluation figures for continuous idioms of all techniques are higher than these for the discontinuous idioms. • This is attributed to the fact that discontinuous idioms are more difficult to identify due to their spread constituents through the sentence.

  38. Thank you for your attention! Dimitra Anastasiou www.d-anastasiou.com dimitra@d-anastasiou.de

  39. References (1) • Bar-Hillel, Y., (1952), “The Treatment of ‘idioms’ by a Translating Machine”, presented at the Conference on Mechanical Translation at Massachusetts Institute of Technology, June 1952. • Brown, R. D., (1999), “Adding Linguistic Knowledge to a Lexical Example-based Translation System”, in: 8thTMI 1999, Chester, England 22-32. • Carl, M.; Schütz, J., (2005), “A Reversible Lemmatizer/Token-generator for English”, in: EBMT Workshop 2005, MT Summit X, Phuket, Thailand. • Drach, Erich, (1963), Grundgedanken der deutschen Satzlehre, Darmstadt: Wissenschaftliche Buchgesellschaft. • DUDEN Redaktion, (1998), Grammatik der deutschen Gegenwartssprache, Mannheim. • Dürscheid, C., (2000), Syntax: Grundlagen und Theorien, Wiesbaden. • Gangadharaiah, R.; Balakrishnan, N., (2006), “Application of Linguistic Rules to Generalized Example Based Machine Translation for Indian Languages“, in: Proceedings of the First National Symposium on Modeling and Shallow Parsing of Indian Languages (MSPIL), Mumbai, India. • Gazdar, G.; Klein, E.; Pullum, G.; Sag, I., (1985), Generalized Phrase Structure Grammar, Basil Blackwell, Oxford • Jackendoff, Ray. 1997. The Architecture of the Language Faculty. Cambridge, Mass.: MIT Press. • Krenn, B., (2008), “Description of evaluation resource – German PP-verb data, in: MWE Workshop 2009, at LREC Conference, 7-11.

  40. References (2) • Moon, R., (1998), Fixed Expressions and Idioms in English: A Corpus-based Approach, Oxford, England: Clarendon Press. • Ryu, B. R.; Kim Y. K.; Yuh, S. H.; Park S. K., (1999), “FromTo K/E: A Korean English Machine Translation system based on idiom recognition and fail softening”, in: MT Summit VII, Singapore, 469-475. • Santos, D., (1990), “Lexical gaps and idioms in Machine Translation”, in: Karlgren, H. (Ed.), 13th COLING 1990, Helsinki, Finland, 330-335. • Sumita, E.; Iida, H.; Kohyama, H., (1990), “Translating with Examples: A New Approach to Machine Translation”, in: 3rd TMI 1990, Texas, USA, 203-212. • Trawinski, B., Sailer, M., Soehn, J.P., Lemnitzer, L., Richter, F., (2008),“Cranberry Expressions in English and German”, in: MWE Workshop 2009, at LREC Conference, 35-39. • Volk, M., (1998), “The Automatic Translation of Idioms. Machine Translation vs. Translation Memory Systems”, in: Nico Weber (Ed.): Machine Translation: Theory, Applications, and Evaluation. An assessment of the state of the art. St. Augustin: Gardez-Verlag. • Wehrli, E. (1998), “Translating Idioms”, in: 17thCOLING 1998, Vol. 2, 1388-1392.

More Related