Eliciting a corpus of word aligned phrases for mt
1 / 27

Eliciting a corpus of word-aligned phrases for MT - PowerPoint PPT Presentation

  • Uploaded on

Eliciting a corpus of word-aligned phrases for MT. Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University. Introduction. Problem: Building Machine Translation systems for languages with scarce resources:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Eliciting a corpus of word-aligned phrases for MT' - leigh

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Eliciting a corpus of word aligned phrases for mt

Eliciting a corpus of word-aligned phrases for MT

Lori Levin, Alon Lavie, Erik Peterson

Language Technologies Institute

Carnegie Mellon University


  • Problem: Building Machine Translation systems for languages with scarce resources:

    • Not enough data for Statistical MT and Example-Based MT

    • Not enough human linguistic expertise for writing rules

  • Approach:

    • Elicit high quality, word-aligned data from bilingual speakers

    • Learn transfer rules from the elicited data

Modules of the avenue milliradd rule learning system and mt system

Word-aligned elicited data

English Language Model

Learning Module

Run Time Transfer System

Word-to-Word Translation Probabilities

Transfer Rules

{PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2))



Translation Lexicon

Modules of the AVENUE/MilliRADD rule learning system and MT system


  • Demo of elicitation interface

  • Description of elicitation corpus

  • Overview of automated rule learning

Demo of elicitation tool
Demo of Elicitation Tool

  • Speaker needs to be bilingual and literate: no other knowledge necessary

  • Mappings between words and phrases: Many-to-many, one-to-none, many-to-none, etc.

  • Create phrasal mappings

  • Fonts and character sets:

    • Including Hindi, Chinese, and Arabic

  • Add morpheme boundaries to target language

  • Add alternate translations

  • Notes and context

Testing of elicitation tool
Testing of Elicitation Tool

  • DARPA Hindi Surprise Language Exercise

  • Around 10 Hindi speakers

  • Around 17,000 phrases translated and aligned

    • Elicitation corpus

    • NPs and PPs from Treebanked Brown Corpus

Elicitation corpus basic principles
Elicitation Corpus: Basic Principles

  • Minimal pairs

  • Syntactic compositionality

  • Special semantic/pragmatic constructions

  • Navigation based on language typology and universals

  • Challenges

Elicitation corpus minimal pairs

Eng: I fell.

Sp: Caí

M: Tranün

Eng: You (John) fell.

Sp: Tu (Juan) caiste

M: Eymi tranimi (Kuan)

Eng: You (Mary) fell. ;;

Sp: Tu (María) caiste

M: Eymi tranimi (Maria)

Eng: I am falling.

Sp: Estoy cayendo

M: Tranmeken

Eng: You (John) are falling.

Sp: Tu (Juan) estás cayendo

M: Eimi(Kuan) tranmekeymi

Elicitation Corpus: Minimal Pairs

Mapudungun: Spoken by around one million people in Chile and Argentina.

Using feature vectors to detect minimal pairs
Using feature vectors to detect minimal pairs

  • np1:(subj-of cl1).pro-pers.hum.2.sg. masc.no-clusn.no-def.no-alien

  • cl1:(subj np1).intr-ag.past.complete

    • Eng: You (John) fell.

      Sp: Tu (Juan) caiste

      M: Eymi tranimi (Kuan)

  • np1:(subj-of cl1).pro-pers.hum.2.sg. fem.no-clusn.no-def.no-alien

  • cl1:(subj np1).intr-ag.past.complete

    • Eng: You (Mary) fell. ;;

      Sp: Tu (María) caiste

      M: Eymi tranimi (Maria)

Feature vectors can be extracted from the output of a parser for English or Spanish. (Except for features that English and Spanish do not have…)

Syntactic compositionality
Syntactic Compositionality

  • The tree

  • The tree fell.

  • I think that the tree fell.

  • We learn rules for smaller phrases

    • E.g., NP

  • Their root nodes become non-terminals in the rules for larger phrases.

    • E.g., S containing an NP

  • Meaning of a phrase is predictable from the meanings of the parts.

  • Special semantic and pragmatic constructions
    Special Semantic and Pragmatic Constructions

    • Meaning may not be compositional

      • Not predictable from the meanings of the parts

    • May not follow normal rules of grammar.

      • Suggestion: Why not go?

    • Word-for-word translation may not work.

    • Tend to be sources of MT mismatches

      • Comparative:

        • English: Hotel A is [closer than Hotel B]

        • Japanese: Hoteru A wa [Hoteru B yori] [tikai desu]

          Hotel A TOP Hotel B than close is

        • “Closer than Hotel B” is a constituent in English, but “Hoteru B yori tikai” is not a constituent in Japanese.

    Examples of semantic pragmatic categories
    Examples of Semantic/Pragmatic Categories

    • Speech Acts: requests, suggestions, etc.

    • Comparatives and Equatives

    • Modality: possibility, probability, ability, obligation, uncertainty, evidentiality

    • Correllatives: (the more the merrier)

    • Causatives

    • Etc.

    A challenge combinatorics
    A Challenge: Combinatorics

    • Person (1, 2, 3, 4)

    • Number (sg, pl, du, paucal)

    • Gender/Noun Class (?)

    • Animacy (animate/inanimate)

    • Definiteness (definite/indefinite)

    • Proximity (near, far, very far, etc.)

    • Inclusion/exclusion

  • Multiply with: tenses and aspects (complete, incomplete, real, unreal, iterative, habitual, present, past, recent past, future, recent future, non-past, non-future, etc.)

  • Multiply with verb class: agentive intransitive, non-agentive intransitive, transitive, ditransitive, etc.

  • (Case marking and agreement may vary with verb tense, verb class, animacy, definiteness, and whether or not object outranks subject in person or animacy.)

  • Solutions to combinatorics
    Solutions to Combinatorics

    • Generate paradigms of feature vectors, and then automatically generate sentences to match each feature vector.

    • Use known universals to eliminate features: e.g., Languages without plurals don’t have duals.

    Other challenges of computer based elicitation
    Other Challenges of Computer Based Elicitation

    • Inconsistency of human translation and alignment

    • Bias toward word order of the elicitation language

      • Need to provide discourse context for given and new information

    • How to elicit things that aren’t grammaticalized in the elicitation language:

      • Evidential: I see that it is raining/Apparently it is raining/It must be raining.

        • Context: You are inside the house. Your friend comes in wet.

    Transfer rule formalism

    Type information

    Part-of-speech/constituent information


    x-side constraints

    y-side constraints


    e.g. ((Y1 AGR) = (X1 AGR))

    Transfer Rule Formalism

    ;SL: the man, TL: der Mann

    NP::NP [DET N] -> [DET N]




    ((X1 AGR) = *3-SING)

    ((X1 DEF = *DEF)

    ((X2 AGR) = *3-SING)

    ((X2 COUNT) = +)

    ((Y1 AGR) = *3-SING)

    ((Y1 DEF) = *DEF)

    ((Y2 AGR) = *3-SING)

    ((Y2 GENDER) = (Y1 GENDER))


    Rule learning overview
    Rule Learning - Overview

    • Goal: Acquire Syntactic Transfer Rules

    • Use available knowledge from the source side (grammatical structure)

    • Three steps:

      • Flat Seed Generation: first guesses at transfer rules; flat syntactic structure

      • Compositionality:use previously learned rules to add hierarchical structure

      • Seeded Version Space Learning: refine rules by learning appropriate feature constraints

    Manual transfer rules example
    Manual Transfer Rules: Example


    ;; passive of 43 (7b)


    VP::VP : [V V V] -> [Aux V]



    ((x1 form) = root)

    ((x2 type) =c light)

    ((x2 form) = part)

    ((x2 aspect) = perf)

    ((x3 lexwx) = 'jAnA')

    ((x3 form) = part)

    ((x3 aspect) = perf)

    (x0 = x1)

    ((y1 lex) = be)

    ((y1 tense) = past)

    ((y1 agr num) = (x3 agr num))

    ((y1 agr pers) = (x3 agr pers))

    ((y2 form) = part)


    Manual transfer rules example1
    Manual Transfer Rules: Example


    PP NP1

    NP P Adj N

    N1 ke eka aXyAya




    NP1 PP

    Adj N P NP

    one chapter of N1



    ; NP1 ke NP2 -> NP2 of NP1

    ; Ex: jIvana ke eka aXyAya

    ; life of (one) chapter

    ; ==> a chapter of life



    NP::NP : [PP NP1] -> [NP1 PP]




    ; ((x2 lexwx) = 'kA')



    NP::NP : [NP1] -> [NP1]





    PP::PP : [NP Postp] -> [Prep NP]