Handling texts and corpuses in ariane g5 a complete environment for multilingual mt
Download
1 / 30

Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT. ACIDCA ’2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble. Outline. Introduction Multilingual MT-R (for revisors): linguistic methodology & basic software

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT' - yule


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Handling texts and corpuses in ariane g5 a complete environment for multilingual mt

Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT

ACIDCA ’2000, Monastir, 21-24/3/2000

Christian Boitet

GETA, CLIPS, IMAG, Grenoble


Outline
Outline

  • Introduction

  • Multilingual MT-R (for revisors): linguistic methodology & basic software

    • Goals and linguistic methodology

    • Ariane-G5, an MT shell for building multilingual MT-R systems

    • What has been and is done with Ariane-G5:MT-R, MT-A (for authors), MT of speech

  • Representation of input documents

  • Structuration of corpuses

  • Functionalities during processing


  • Multilingual mt r goals and linguistic methodology
    MULTILINGUAL MT-R: GOALS AND LINGUISTIC METHODOLOGY

    • Produce RAW translation GOOD ENOUGH to be revised

    • Specialize to SUBLANGUAGES and use

    • MULTILEVEL TRANSFER (semantic + traces)

    • HEURISTIC PROGRAMMING


    Multilingual mt r basic diagram
    MULTILINGUAL MT-R: BASIC DIAGRAM



    Ariane g5 2 specialized db

    relative to “variants”

    =>

    Ariane-G5: 2 specialized DB

    • DB of lingware components

      • Declaration of variables (= typed attributes), templates…

      • Dictionaries

      • Grammars (rules = transitions of abstract automata)

  • DB of texts

    • Corpuses

    • Source texts

    • Intermediate results

    • Translations (± revisions)


  • What has been and is done with ariane g5
    What has been and is done with Ariane-G5:

    • MT-R (for revisors)

      • Large, operational systems: RU—>FR, FR—>EN

      • Prototypes: EN—>MY, TH, FR

      • Lots of mockups

    • MT-A (for authors)

      • LIDIA mockups: FR—>DE, EN, RU (adding CH)

    • MT of speech (for task-oriented dialogues)

      • CSTAR demo system (EN, DE, KR, IT, FR, JP)


    Mt r examples of translation 1
    MT-R examples of translation (1)

    • français-anglais en aéronautique (avant révision humaine)



    Mt a example of a disambiguation dialogue

    Question 1

    O des tasses bleues et des assiettes bleues

    O des assiettes bleues et des tasses

    Question 2

    O capitaine de marine

    O capitaine d’aviation

    O capitaine d’artillerie

    O capitaine d’infanterie

    O capitaine de cavalerie

    O …

    MT-A example of a disambiguation dialogue

    • Le capitaine a rapporté des tasses et des assiettes bleues

    • —> The captain has brought back blue bowls and plates/ bowls and blue plates


    Interaction in source for the quality mt for all
    Interaction in source for the “quality MT for all”

    • Example scenario : multilingual e-mail (UNL)

    interactive disambiguation server

    4

    analysis server

    5

    6

    7

    3

    1

    enconversion server

    e-mail server

    2

    e-mail tool

    Nicknames + language preferences

    8

    decoding server

    9

    decoding server

    decoding server

    decoding server

    10

    decoding server

    deconversion servers

    Addressees’ e-mail servers



    Speech translation advantages of an interchange format

    Analysis into IF “self-explaining documents”

    Backgeneration

    Speech Translation:advantages of an Interchange Format

    • N target languages for the cost of one analysis

    • Translating into one’s language from N source languages with one generation

    • Using the same generation to “backgenerate”

    IF


    Interchange format example
    Interchange Format “self-explaining documents” : example

    • la semaine du 12nous avons des chambres simples et doublesdisponibles

      • give-information+availability+room(room-type=(single ; double), time=(week, md12))

        • give-information

        • +availability+room

        • (room-type=(single ; double), time=(week, md12))

    Acte de dialogue

    Concepts

    Arguments


    Interface of clips cstar ii demonstrator

    Reconnaissance “self-explaining documents”

    IF

    Génération

    Interface of CLIPS++ CSTAR-II demonstrator

    Rétrogénération (pour contrôler la “compréhension”)


    Hardware architecture of the clips cstar ii demonstrator

    Contrôle, IF “self-explaining documents”F

    Synthèse

    VC

    IU

    FIF

    Reco

    Ethernet

    Montpellier

    RNIS

    Hardware architecture of the CLIPS++ CSTAR-II demonstrator

    Grenoble


    Steps in translating a text
    Steps in translating a text “self-explaining documents”

    • Build its hierarchical structure

      • Chapters, sections, paragraphs, [sentences]

    • Segment into translation units

      • According to current length parameter [min..max]

    • Translate each segment

      • Adding segment results to text results for desired phases

    • Revise (manually) the whole translations, keep the revisions


    Representations of input documents
    Representations of input documents “self-explaining documents”

    • 3 main questions:

      • how to represent the writing system,

      • separate formatting tags from the text or not,

      • how to handle non-textual elements (figures, icons, or formulas) contained in utterances

  • Transliterations of textual elements

  • Keeping formatting tags in the texts

  • Non-textual elements


  • Transliterations of textual elements
    Transliterations of textual elements “self-explaining documents”

    • Facilitate string-matching operations

    • Diminish the size of dictionaries

    • Represent diacritics

    • Make some processing easier for some tools

      • kataba —> ktb$aaa, katub —> ktb$au- or ktb$-ua


    Transliterations of textual elements 2
    Transliterations of textual elements (2) “self-explaining documents”

    • Represent writing systems using non Roman characters

      • "мать" (mother) —> "MATQ" and not "MAT6"

    今日は京都へ行きます。 (Today theme Kyoto dest go.) —>

    KYOU <kj k1=kon k2=nichi> WA <hg ha> KYOUTO <kj k1=higashi k2=toukyo-no-tou> E <hg he> IKI <kj k1=iku> MASU.


    Keeping formatting tags in the texts
    Keeping formatting tags in the texts “self-explaining documents”

    • If the translation units get larger, almost all tags become “inside tags”

    • Tags often have a linguistic role

    For example, a sentence may contain

    • a bullet list

    • or a numbered list

    which are normally linguistically homogeneous.

    <P>For example, a sentence may contain</P>

    <UL>

    <LI>a bullet list

    <LI>or a numbered list

    </UL>

    <P>which are normally linguistically homogeneous. </P>


    Non textual elements
    Non-textual elements “self-explaining documents”

    • Formulas, figures, icons, brand names, anchors, links…

      • are often best replaced by tags or special occurrences

    • The situation may be recursive (text inside figures)

    *IF x2+5y>3 , x+y IS CONVENIENT .

    *IF <relation 1> , <entity 2> IS CONVENIENT .

    *IF $$R-1 , $$E-2 IS CONVENIENT .


    Structuration of corpuses
    Structuration of corpuses “self-explaining documents”

    • Motivations for corpuses

    • Segmentation and structuration

    • Representation of texts, intermediate results, translations and revisions


    Motivations for corpuses
    Motivations for corpuses “self-explaining documents”

    • Corpus = collection of texts sharing

      • some factual characteristics:

        • natural language

        • transliteration and method for handling formatting information and non-textual elements

        • segmentation method

        • structuration method

      • some management information:

        • source (journal/volume, book/chapter…)

        • usage destination (send back, postedit, tests…)


    Segmentation and structuration
    Segmentation and structuration “self-explaining documents”

    • "segmentation"

      • = input texts —> words, sentences…

        • best done by the morphological analyzer

    • & units of translation

  • "structuration"

    • = segmentation —> higher level units

      • paragraphs, sections, etc.

  • + production of a corresponding tree structure

  • In Ariane-G5, up to 7 hierarchical separators

    • for a given corpus


  • Representation of texts intermediate results translations and revisions
    Representation of texts, intermediate results, translations and revisions

    • Corpus = list of text files + descriptor

    • Text = (transliterated) text + descriptor

      • (+ non-textual elements replaced by tags or spec.occs)

    • Intermediate result = list of decorated trees

      • + descriptor (lingware variant + interval processed)

    • Translation = (transliterated) text + descriptor

      • (transliterated form may reduce morph. gen. size)

    • Revision = (transliterated) text + descriptor

      • (usually another, more natural transliteration)


    Functionalities during processsing
    Functionalities during processsing and revisions

    • Ensuring coherence between lingware and results

    • Stopping & restarting processing of a text

    • Reusing intermediate results

      • recovery from interruptions

      • debugging

      • multitarget translation (analysis ≈ 2/3 of translation time)


    Conclusion and perspectives 1
    Conclusion and perspectives (1) and revisions

    • Text & corpus handling in complete MT systems is quite complex and interesting…

      • handling texts and corpuses not a straightforward problem,

      • suggests many interesting technological and scientific issues


    Conclusion and perspectives 2
    Conclusion and perspectives (2) and revisions

    • but more is coming:

      • Synergy MT systems <—> TA (Translation Aids)

        • unification of the representations of texts in both worlds:

          • • MT: revised texts structured as input texts,

          • => the text data base will become a kind of multilevel translation memory (texts, translations/revisions, intermediate results)

          • • TA: translation memories from "bags" to structured translation memories (keeping the sequential context)

        • both: multiple-layer translation memories

          • • lemmatized forms

          • • "concrete" syntactic trees & "abstract" logico-semantic trees

          • • formatting tags


    Conclusion and perspectives 3
    Conclusion and perspectives (3) and revisions

    • Structuration may be used to « distribute the work » to MT and TA by segmenting according to the « best engine »

      • some sublanguages are good for MT, bad for TA

        • weather bulletins

      • others are good for TA, bad for MT

        • weather related warnings, slightly modified versions of already translated documents

      • and others are best kept for specialists

        • Fine-tune legal sentences


    ad