1 / 30

Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT

Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT. ACIDCA ’2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble. Outline. Introduction Multilingual MT-R (for revisors): linguistic methodology & basic software

yule
Download Presentation

Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT ACIDCA ’2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble

  2. Outline • Introduction • Multilingual MT-R (for revisors): linguistic methodology & basic software • Goals and linguistic methodology • Ariane-G5, an MT shell for building multilingual MT-R systems • What has been and is done with Ariane-G5:MT-R, MT-A (for authors), MT of speech • Representation of input documents • Structuration of corpuses • Functionalities during processing

  3. MULTILINGUAL MT-R: GOALS AND LINGUISTIC METHODOLOGY • Produce RAW translation GOOD ENOUGH to be revised • Specialize to SUBLANGUAGES and use • MULTILEVEL TRANSFER (semantic + traces) • HEURISTIC PROGRAMMING

  4. MULTILINGUAL MT-R: BASIC DIAGRAM

  5. Ariane-G5 (1978-99) : structure

  6. relative to “variants” => Ariane-G5: 2 specialized DB • DB of lingware components • Declaration of variables (= typed attributes), templates… • Dictionaries • Grammars (rules = transitions of abstract automata) • DB of texts • Corpuses • Source texts • Intermediate results • Translations (± revisions)

  7. What has been and is done with Ariane-G5: • MT-R (for revisors) • Large, operational systems: RU—>FR, FR—>EN • Prototypes: EN—>MY, TH, FR • Lots of mockups • MT-A (for authors) • LIDIA mockups: FR—>DE, EN, RU (adding CH) • MT of speech (for task-oriented dialogues) • CSTAR demo system (EN, DE, KR, IT, FR, JP)

  8. MT-R examples of translation (1) • français-anglais en aéronautique (avant révision humaine)

  9. MT-R examples of translation (2)

  10. Question 1 O des tasses bleues et des assiettes bleues O des assiettes bleues et des tasses Question 2 O capitaine de marine O capitaine d’aviation O capitaine d’artillerie O capitaine d’infanterie O capitaine de cavalerie O … MT-A example of a disambiguation dialogue • Le capitaine a rapporté des tasses et des assiettes bleues • —> The captain has brought back blue bowls and plates/ bowls and blue plates

  11. Interaction in source for the “quality MT for all” • Example scenario : multilingual e-mail (UNL) interactive disambiguation server 4 analysis server 5 6 7 3 1 enconversion server e-mail server 2 e-mail tool Nicknames + language preferences 8 decoding server 9 decoding server decoding server decoding server 10 decoding server deconversion servers Addressees’ e-mail servers

  12. Other future possibility: production of multilingual “self-explaining documents”

  13. Analysis into IF Backgeneration Speech Translation:advantages of an Interchange Format • N target languages for the cost of one analysis • Translating into one’s language from N source languages with one generation • Using the same generation to “backgenerate” IF

  14. Interchange Format : example • la semaine du 12nous avons des chambres simples et doublesdisponibles • give-information+availability+room(room-type=(single ; double), time=(week, md12)) • give-information • +availability+room • (room-type=(single ; double), time=(week, md12)) Acte de dialogue Concepts Arguments

  15. Reconnaissance IF Génération Interface of CLIPS++ CSTAR-II demonstrator Rétrogénération (pour contrôler la “compréhension”)

  16. Contrôle, IFF Synthèse VC IU FIF Reco Ethernet Montpellier RNIS Hardware architecture of the CLIPS++ CSTAR-II demonstrator Grenoble

  17. Steps in translating a text • Build its hierarchical structure • Chapters, sections, paragraphs, [sentences] • Segment into translation units • According to current length parameter [min..max] • Translate each segment • Adding segment results to text results for desired phases • Revise (manually) the whole translations, keep the revisions

  18. Representations of input documents • 3 main questions: • how to represent the writing system, • separate formatting tags from the text or not, • how to handle non-textual elements (figures, icons, or formulas) contained in utterances • Transliterations of textual elements • Keeping formatting tags in the texts • Non-textual elements

  19. Transliterations of textual elements • Facilitate string-matching operations • Diminish the size of dictionaries • Represent diacritics • Make some processing easier for some tools • kataba —> ktb$aaa, katub —> ktb$au- or ktb$-ua

  20. Transliterations of textual elements (2) • Represent writing systems using non Roman characters • "мать" (mother) —> "MATQ" and not "MAT6" 今日は京都へ行きます。 (Today theme Kyoto dest go.) —> KYOU <kj k1=kon k2=nichi> WA <hg ha> KYOUTO <kj k1=higashi k2=toukyo-no-tou> E <hg he> IKI <kj k1=iku> MASU.

  21. Keeping formatting tags in the texts • If the translation units get larger, almost all tags become “inside tags” • Tags often have a linguistic role For example, a sentence may contain • a bullet list • or a numbered list which are normally linguistically homogeneous. <P>For example, a sentence may contain</P> <UL> <LI>a bullet list <LI>or a numbered list </UL> <P>which are normally linguistically homogeneous. </P>

  22. Non-textual elements • Formulas, figures, icons, brand names, anchors, links… • are often best replaced by tags or special occurrences • The situation may be recursive (text inside figures) *IF x2+5y>3 , x+y IS CONVENIENT . *IF <relation 1> , <entity 2> IS CONVENIENT . *IF $$R-1 , $$E-2 IS CONVENIENT .

  23. Structuration of corpuses • Motivations for corpuses • Segmentation and structuration • Representation of texts, intermediate results, translations and revisions

  24. Motivations for corpuses • Corpus = collection of texts sharing • some factual characteristics: • natural language • transliteration and method for handling formatting information and non-textual elements • segmentation method • structuration method • some management information: • source (journal/volume, book/chapter…) • usage destination (send back, postedit, tests…)

  25. Segmentation and structuration • "segmentation" • = input texts —> words, sentences… • best done by the morphological analyzer • & units of translation • "structuration" • = segmentation —> higher level units • paragraphs, sections, etc. • + production of a corresponding tree structure • In Ariane-G5, up to 7 hierarchical separators • for a given corpus

  26. Representation of texts, intermediate results, translations and revisions • Corpus = list of text files + descriptor • Text = (transliterated) text + descriptor • (+ non-textual elements replaced by tags or spec.occs) • Intermediate result = list of decorated trees • + descriptor (lingware variant + interval processed) • Translation = (transliterated) text + descriptor • (transliterated form may reduce morph. gen. size) • Revision = (transliterated) text + descriptor • (usually another, more natural transliteration)

  27. Functionalities during processsing • Ensuring coherence between lingware and results • Stopping & restarting processing of a text • Reusing intermediate results • recovery from interruptions • debugging • multitarget translation (analysis ≈ 2/3 of translation time)

  28. Conclusion and perspectives (1) • Text & corpus handling in complete MT systems is quite complex and interesting… • handling texts and corpuses not a straightforward problem, • suggests many interesting technological and scientific issues

  29. Conclusion and perspectives (2) • but more is coming: • Synergy MT systems <—> TA (Translation Aids) • unification of the representations of texts in both worlds: • • MT: revised texts structured as input texts, • => the text data base will become a kind of multilevel translation memory (texts, translations/revisions, intermediate results) • • TA: translation memories from "bags" to structured translation memories (keeping the sequential context) • both: multiple-layer translation memories • • lemmatized forms • • "concrete" syntactic trees & "abstract" logico-semantic trees • • formatting tags

  30. Conclusion and perspectives (3) • Structuration may be used to « distribute the work » to MT and TA by segmenting according to the « best engine » • some sublanguages are good for MT, bad for TA • weather bulletins • others are good for TA, bad for MT • weather related warnings, slightly modified versions of already translated documents • and others are best kept for specialists • Fine-tune legal sentences

More Related