Eliciting a corpus of word-aligned phrases for MT - PowerPoint PPT Presentation

Eliciting a corpus of word aligned phrases for mt
1 / 27

  • Uploaded on
  • Presentation posted in: General

Eliciting a corpus of word-aligned phrases for MT. Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University. Introduction. Problem: Building Machine Translation systems for languages with scarce resources:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Eliciting a corpus of word-aligned phrases for MT

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Eliciting a corpus of word aligned phrases for mt

Eliciting a corpus of word-aligned phrases for MT

Lori Levin, Alon Lavie, Erik Peterson

Language Technologies Institute

Carnegie Mellon University



  • Problem: Building Machine Translation systems for languages with scarce resources:

    • Not enough data for Statistical MT and Example-Based MT

    • Not enough human linguistic expertise for writing rules

  • Approach:

    • Elicit high quality, word-aligned data from bilingual speakers

    • Learn transfer rules from the elicited data

Modules of the avenue milliradd rule learning system and mt system

Word-aligned elicited data

English Language Model

Learning Module

Run Time Transfer System

Word-to-Word Translation Probabilities

Transfer Rules

{PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2))



Translation Lexicon

Modules of the AVENUE/MilliRADD rule learning system and MT system



  • Demo of elicitation interface

  • Description of elicitation corpus

  • Overview of automated rule learning

Demo of elicitation tool

Demo of Elicitation Tool

  • Speaker needs to be bilingual and literate: no other knowledge necessary

  • Mappings between words and phrases: Many-to-many, one-to-none, many-to-none, etc.

  • Create phrasal mappings

  • Fonts and character sets:

    • Including Hindi, Chinese, and Arabic

  • Add morpheme boundaries to target language

  • Add alternate translations

  • Notes and context

English chinese example

English-Chinese Example

English hindi example

English-Hindi Example

Spanish mapudungun example

Spanish-Mapudungun Example

English arabic example

English-Arabic Example

Testing of elicitation tool

Testing of Elicitation Tool

  • DARPA Hindi Surprise Language Exercise

  • Around 10 Hindi speakers

  • Around 17,000 phrases translated and aligned

    • Elicitation corpus

    • NPs and PPs from Treebanked Brown Corpus

Elicitation corpus basic principles

Elicitation Corpus: Basic Principles

  • Minimal pairs

  • Syntactic compositionality

  • Special semantic/pragmatic constructions

  • Navigation based on language typology and universals

  • Challenges

Elicitation corpus minimal pairs

Eng: I fell.

Sp: Caí

M: Tranün

Eng: You (John) fell.

Sp: Tu (Juan) caiste

M: Eymi tranimi (Kuan)

Eng: You (Mary) fell. ;;

Sp: Tu (María) caiste

M: Eymi tranimi (Maria)

Eng: I am falling.

Sp: Estoy cayendo

M: Tranmeken

Eng: You (John) are falling.

Sp: Tu (Juan) estás cayendo

M: Eimi(Kuan) tranmekeymi

Elicitation Corpus: Minimal Pairs

Mapudungun: Spoken by around one million people in Chile and Argentina.

Using feature vectors to detect minimal pairs

Using feature vectors to detect minimal pairs

  • np1:(subj-of cl1).pro-pers.hum.2.sg. masc.no-clusn.no-def.no-alien

  • cl1:(subj np1).intr-ag.past.complete

    • Eng: You (John) fell.

      Sp: Tu (Juan) caiste

      M: Eymi tranimi (Kuan)

  • np1:(subj-of cl1).pro-pers.hum.2.sg. fem.no-clusn.no-def.no-alien

  • cl1:(subj np1).intr-ag.past.complete

    • Eng: You (Mary) fell. ;;

      Sp: Tu (María) caiste

      M: Eymi tranimi (Maria)

Feature vectors can be extracted from the output of a parser for English or Spanish. (Except for features that English and Spanish do not have…)

Syntactic compositionality

Syntactic Compositionality

  • The tree

  • The tree fell.

  • I think that the tree fell.

  • We learn rules for smaller phrases

    • E.g., NP

  • Their root nodes become non-terminals in the rules for larger phrases.

    • E.g., S containing an NP

  • Meaning of a phrase is predictable from the meanings of the parts.

  • Special semantic and pragmatic constructions

    Special Semantic and Pragmatic Constructions

    • Meaning may not be compositional

      • Not predictable from the meanings of the parts

    • May not follow normal rules of grammar.

      • Suggestion: Why not go?

    • Word-for-word translation may not work.

    • Tend to be sources of MT mismatches

      • Comparative:

        • English: Hotel A is [closer than Hotel B]

        • Japanese: Hoteru A wa [Hoteru B yori] [tikai desu]

          Hotel A TOP Hotel B than close is

        • “Closer than Hotel B” is a constituent in English, but “Hoteru B yori tikai” is not a constituent in Japanese.

    Examples of semantic pragmatic categories

    Examples of Semantic/Pragmatic Categories

    • Speech Acts: requests, suggestions, etc.

    • Comparatives and Equatives

    • Modality: possibility, probability, ability, obligation, uncertainty, evidentiality

    • Correllatives: (the more the merrier)

    • Causatives

    • Etc.

    A challenge combinatorics

    A Challenge: Combinatorics

    • Person (1, 2, 3, 4)

    • Number (sg, pl, du, paucal)

    • Gender/Noun Class (?)

    • Animacy (animate/inanimate)

    • Definiteness (definite/indefinite)

    • Proximity (near, far, very far, etc.)

    • Inclusion/exclusion

  • Multiply with: tenses and aspects (complete, incomplete, real, unreal, iterative, habitual, present, past, recent past, future, recent future, non-past, non-future, etc.)

  • Multiply with verb class: agentive intransitive, non-agentive intransitive, transitive, ditransitive, etc.

  • (Case marking and agreement may vary with verb tense, verb class, animacy, definiteness, and whether or not object outranks subject in person or animacy.)

  • Solutions to combinatorics

    Solutions to Combinatorics

    • Generate paradigms of feature vectors, and then automatically generate sentences to match each feature vector.

    • Use known universals to eliminate features: e.g., Languages without plurals don’t have duals.

    Other challenges of computer based elicitation

    Other Challenges of Computer Based Elicitation

    • Inconsistency of human translation and alignment

    • Bias toward word order of the elicitation language

      • Need to provide discourse context for given and new information

    • How to elicit things that aren’t grammaticalized in the elicitation language:

      • Evidential: I see that it is raining/Apparently it is raining/It must be raining.

        • Context: You are inside the house. Your friend comes in wet.

    Transfer rule formalism

    Type information

    Part-of-speech/constituent information


    x-side constraints

    y-side constraints


    e.g. ((Y1 AGR) = (X1 AGR))

    Transfer Rule Formalism

    ;SL: the man, TL: der Mann

    NP::NP [DET N] -> [DET N]




    ((X1 AGR) = *3-SING)

    ((X1 DEF = *DEF)

    ((X2 AGR) = *3-SING)

    ((X2 COUNT) = +)

    ((Y1 AGR) = *3-SING)

    ((Y1 DEF) = *DEF)

    ((Y2 AGR) = *3-SING)

    ((Y2 GENDER) = (Y1 GENDER))


    Rule learning overview

    Rule Learning - Overview

    • Goal: Acquire Syntactic Transfer Rules

    • Use available knowledge from the source side (grammatical structure)

    • Three steps:

      • Flat Seed Generation: first guesses at transfer rules; flat syntactic structure

      • Compositionality:use previously learned rules to add hierarchical structure

      • Seeded Version Space Learning: refine rules by learning appropriate feature constraints

    Flat seed rule generation

    Flat Seed Rule Generation



    Version space learning

    Version Space Learning

    Examples of learned rules

    Examples of Learned Rules

    Manual transfer rules example

    Manual Transfer Rules: Example


    ;; passive of 43 (7b)


    VP::VP : [V V V] -> [Aux V]



    ((x1 form) = root)

    ((x2 type) =c light)

    ((x2 form) = part)

    ((x2 aspect) = perf)

    ((x3 lexwx) = 'jAnA')

    ((x3 form) = part)

    ((x3 aspect) = perf)

    (x0 = x1)

    ((y1 lex) = be)

    ((y1 tense) = past)

    ((y1 agr num) = (x3 agr num))

    ((y1 agr pers) = (x3 agr pers))

    ((y2 form) = part)


    Manual transfer rules example1

    Manual Transfer Rules: Example


    PP NP1

    NP P Adj N

    N1 ke eka aXyAya




    NP1 PP

    Adj N P NP

    one chapter of N1



    ; NP1 ke NP2 -> NP2 of NP1

    ; Ex: jIvana ke eka aXyAya

    ; life of (one) chapter

    ; ==> a chapter of life



    NP::NP : [PP NP1] -> [NP1 PP]




    ; ((x2 lexwx) = 'kA')



    NP::NP : [NP1] -> [NP1]





    PP::PP : [NP Postp] -> [Prep NP]





  • Login