columbia university nlp colloquium l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Columbia University NLP Colloquium PowerPoint Presentation
Download Presentation
Columbia University NLP Colloquium

Loading in 2 Seconds...

play fullscreen
1 / 57

Columbia University NLP Colloquium - PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on

Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University. Columbia University NLP Colloquium. October 28, 2004. E. gist. gist. The Intuition Generation-Heavy Machine Translation. Espa ñ ol ‚ عربي ‚.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Columbia University NLP Colloquium' - nitza


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
columbia university nlp colloquium

Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral ResearcherCenter for Computational Learning SystemsColumbia University

Columbia University

NLP Colloquium

October 28, 2004

the intuition generation heavy machine translation

E

gist

gist

The IntuitionGeneration-Heavy Machine Translation

Español‚عربي ‚

Dictionary

English

introduction research contributions
IntroductionResearch Contributions
  • A general reusable and extensible Machine Translation (MT) model that transcends the need for large amounts of deepsymmetric knowledge
  • Development of reusable large-scale resources for English
  • A large-scale Spanish-English MT system: Matador; Matador is more robust across genre and produce more grammatical output than simple statistical or symbolic techniques
roadmap
Roadmap
  • Introduction
  • Generation-Heavy Machine Translation
  • Evaluation
  • Conclusion
  • Future Work
introduction mt pyramid

Gisting

Transfer

Interlingua

IntroductionMT Pyramid

Source meaning

Target meaning

Source syntax

Target syntax

Source word

Target word

Analysis

Generation

introduction mt pyramid6

Dictionaries/Parallel Corpora

Transfer Lexicons

Interlingual Lexicons

IntroductionMT Pyramid

Source meaning

Target meaning

Source syntax

Target syntax

Source word

Target word

Analysis

Generation

introduction mt pyramid7
IntroductionMT Pyramid

Source meaning

Target meaning

Source syntax

Transfer

Target syntax

Gisting

Source word

Target word

introduction why gisting is not enough
IntroductionWhy gisting is not enough

Sobre la base de dichas experiencias se estableció en 1988 una metodología.

Envelope her basis out speak experiences them settle at 1988 one methodology.

On the basis of these experiences, a methodology was arrived at in 1988.

introduction translation divergences
IntroductionTranslation Divergences
  • 35% of sentences in TREC El Norte Corpus (Dorr et al 2002)
  • Divergence Types
    • Categorial(X tener hambre  X be hungry)
    • Conflational(X dar puñaladas a Z  X stab Z)
    • Structural(X entrar en Y  X enter Y)
    • Head Swapping(X cruzar Y nadando  X swim across Y)
    • Thematic(X gustar a Y  Y like X)
roadmap10
Roadmap
  • Introduction
  • Generation-Heavy Machine Translation
  • Evaluation
  • Conclusion
  • Future Work
generation heavy hybrid machine translation
Generation-Heavy Hybrid Machine Translation
  • Problem: asymmetric resources
    • High quality, broad coverage, semantic resources for target language
    • Low quality resources for source language
    • Low quality (many-to-many) translation lexicon
  • Thesis: we can approximate interlingual MT without the use of symmetric interlingual resources
relevant background work
Relevant Background Work
  • Hybrid Natural Language Generation

Constrained Overgeneration  Statistical Ranking

Nitrogen (Langkilde and Knight 1998), Halogen (Langkilde 2002)

FERGUS (Rambow and Bangalore 2000)

  • Lexical Conceptual Structure (LCS) based MT

(Jackendoff 1983), (Dorr 1993)

generation heavy hybrid machine translation14

Theta Linking

Expansion

Assignment

Linearization

Pruning

Ranking

Generation-Heavy HybridMachine Translation

Generation

Analysis

Translation

matador spanish english ghmt

EXERGE

Generation

ExpansiveRich Generation for English

Theta Linking

Expansion

Assignment

Linearization

Pruning

Ranking

MatadorSpanish-English GHMT

Spanish

Analysis

Translation

English

ghmt analysis

dar

:subj

:mod

:obj

Yo

puñalada

a

:obj

Juan

GHMTAnalysis
  • Source language syntactic dependency
  • Example: Yo le di puñaladas a Juan.
  • Features of representation
    • Approximation of predicate-argument structure
    • Long-distance dependencies
ghmt translation

dar

:subj

:mod

ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND, LAND, RENDER

:obj

Yo

puñalada

a

:subj

:mod

:obj

:obj

I, MY, MINE

STAB, KNIFE_WOUND

AT, BY, INTO, THROUGH, TO

Juan

:obj

JOHN

GHMTTranslation
  • Lexical transfer but NO structural change
  • Translation Lexicon

(tener V)((have V) (own V) (possess V) (be V))(deber V)((owe V) (should AUX) (must AUX))(soler V)((tend V) (usually AV))

ghmt thematic linking

EXTEND, GIVE, GRANT, RENDER

Goal

ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND, LAND, RENDER

Agent

Theme

I, MY, MINE

STAB, KNIFE_WOUND

:subj

JOHN

:mod

:obj

I, MY, MINE

STAB, KNIFE_WOUND

AT, BY, INTO, THROUGH, TO

:obj

JOHN

GHMTThematic Linking
  • Syntactic Dependency  Thematic Dependency
  • Which divergence
ghmt thematic linking resources
GHMTThematic Linking Resources
  • Word Class Lexicon

:NUMBER "V.13.1.a.ii" :NAME "Give - No Exchange” :POS V

:THETA_ROLES (((agobl) (thobl) (goalobl to))

((agobl) (goalobl) (thobl)))

:LCS_PRIMS (cause go)

:WORDS (feed give pass pay peddle refund render repay serve))

  • Syntactic-Thematic Linking Map

(:subj ag instr th exp loc src goal perc mod-poss poss)

(:obj2 goal src th perc ben)

(across  goal loc)

(in  loc mod-poss perc goal poss prop)

(to prop goal ben info th exp perc pred loc time)

ghmt thematic linking20

ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND, LAND, RENDER

:subj

:mod

:obj

I, MY, MINE

STAB, KNIFE_WOUND

AT, BY, INTO, THROUGH, TO

:obj

JOHN

GHMTThematic Linking
  • Syntactic Dependency  Thematic Dependency

((ADMINISTER V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO)))

(CONFER V.37.6.b ((EXP OBL)))

(DELIVER V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM)))

(EXTEND V.47.1 ((TH OBL) (MOD-LOC OPT . T)))

(EXTEND V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO)))

(EXTEND V.13.3 ((AG OBL) (GOAL OBL) (TH OBL)))

(EXTEND V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO)))

(GIVE V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO)))

(GIVE V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL)))

(GRANT V.29.5.e ((AG OBL) (INFO OBL THAT)))

(GRANT V.29.5.d ((AG OBL) (TH OBL) (PROP OBL TO)))

(GRANT V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO)))

(GRANT V.13.3 ((AG OBL) (GOAL OBL) (TH OBL)))

(HAND V.11.1 ((AG OBL) (TH OBL) (GOAL OPT TO) (SRC OPT FROM)))

(HAND V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM)))

(LAND V.9.10 ((AG OBL) (TH OBL)))

(RENDER V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO)))

(RENDER V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL)))

(RENDER V.10.6.a ((AG OBL) (TH OBL) (MOD-POSS OPT OF)))

(RENDER V.10.6.a.LOCATIVE ((AG OPT) (SRC OBL) (TH OPT OF))))

ghmt thematic linking21

ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND, LAND, RENDER

:subj

:mod

:obj

I, MY, MINE

STAB, KNIFE_WOUND

AT, BY, INTO, THROUGH, TO

:obj

JOHN

GHMTThematic Linking
  • Syntactic Dependency  Thematic Dependency

((ADMINISTER V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO)))

(CONFER V.37.6.b ((EXP OBL)))

(DELIVER V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM)))

(EXTEND V.47.1 ((TH OBL) (MOD-LOC OPT . T)))

(EXTEND V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO)))

(EXTEND V.13.3 ((AG OBL) (GOAL OBL) (TH OBL)))

(EXTEND V.13.2 ((AG OBL) (TH OBL) (GOAL OPT TO)))

(GIVE V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO)))

(GIVE V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL)))

(GRANT V.29.5.e ((AG OBL) (INFO OBL THAT)))

(GRANT V.29.5.d ((AG OBL) (TH OBL) (PROP OBL TO)))

(GRANT V.13.3 ((AG OBL) (TH OBL) (GOAL OPT TO)))

(GRANT V.13.3 ((AG OBL) (GOAL OBL) (TH OBL)))

(HAND V.11.1 ((AG OBL) (TH OBL) (GOAL OPT TO) (SRC OPT FROM)))

(HAND V.11.1 ((AG OBL) (GOAL OBL) (TH OBL) (SRC OPT FROM)))

(LAND V.9.10 ((AG OBL) (TH OBL)))

(RENDER V.13.1.a.ii ((AG OBL) (TH OBL) (GOAL OBL TO)))

(RENDER V.13.1.a.ii ((AG OBL) (GOAL OBL) (TH OBL)))

(RENDER V.10.6.a ((AG OBL) (TH OBL) (MOD-POSS OPT OF)))

(RENDER V.10.6.a.LOCATIVE ((AG OPT) (SRC OBL) (TH OPT OF))))

ghmt thematic linking22

EXTEND, GIVE, GRANT, RENDER

Goal

ADMINISTER,CONFER, DELIVER, EXTEND, GIVE, GRANT, HAND, LAND, RENDER

Agent

Theme

I, MY, MINE

STAB, KNIFE_WOUND

:subj

JOHN

:mod

:obj

I, MY, MINE

STAB, KNIFE_WOUND

AT, BY, INTO, THROUGH, TO

:obj

JOHN

GHMTThematic Linking
  • Syntactic Dependency  Thematic Dependency
interlingua approximation through expansion operations

Categorial Variation

Node Conflation / Inflation



developmentN

developV

putV



butterV

butterN

enter

enter

go

subj

obj

subj

subj

in

in

John

John

John

room

room

room

Interlingua Approximationthrough Expansion Operations

RelationConflation / Inflation

Relation Variation





interlingua approximation 2 nd degree expansion
Interlingua Approximation2nd Degree Expansion

cross

go

mod

mod

subj

subj

obj

across

John

swimming

John

swimming

river

river

Relation Inflation

swim

across

subj

John

river

Node Conflation

ghmt structural expansion

GIVEV

STABV

Agent

Goal

Agent

Goal

Theme

I

JOHN

I

STABN

JOHN

GHMTStructural Expansion
  • Conflation Example

,

ghmt structural expansion26
GHMTStructural Expansion
  • Conflation and Inflation
  • Structural Expansion Resources
    • Word Class Lexicon

:NUMBER "V.42.2" :NAME “Poison Verbs” :POS V

:THETA_ROLES (((ag obl)(goal obl)))

:LCS_PRIMS (cause go)

:WORDS (crucify electrocute garrotte hang knife poison shoot smother stab strangle)

    • Categorial Variation Database(Habash and Dorr 2003)

(:V (hunger) :N (hunger hungriness) :AJ (hungry))

(:V (validate) :N (validation validity) :AJ (valid))

(:V (cross) :N (crossing cross) :P (across))

(:V (stab) :N (stab))

ghmt structural expansion27

GIVEV

Goal

Agent

Theme

I

STABN

JOHN

STABV

GHMTStructural Expansion
  • Conflation Example
ghmt structural expansion28

GIVEV

STABV

[CAUSE GO]

[CAUSE GO]

Agent

Agent

Goal

Goal

Theme

*

*

I

STABN

JOHN

GHMTStructural Expansion
  • Conflation Example
ghmt structural expansion29

GIVEV

STABV

Agent

Goal

Agent

Goal

Theme

I

JOHN

I

STABN

JOHN

GHMTStructural Expansion
  • Conflation Example

,

ghmt syntactic assignment

GIVEV

STABV

GIVEV

STABV

Subject

Agent

Goal

Agent

IObject

Object

Goal

Theme

Subject

Object

I, MY …

STABN, KNIFE_ WOUNDN

JOHN

I

JOHN

I

STABN

JOHN

I, MY …

JOHN

GIVEV

Subject

Mod

Object

I, MY …

STABN, KNIFE_ WOUNDN

TO, AT, …

Object

JOHN

GHMT Syntactic Assignment
  • Thematic  Syntactic Mapping
ghmt structural n gram pruning

GIVEV

GIVEV

STABV

STABV

Subject

Subject

IObject

IObject

Object

Object

Subject

Subject

Object

Object

I, MY …

I

STABN

STABN, KNIFE_ WOUNDN

JOHN

JOHN

I

I, MY …

JOHN

JOHN

GIVEV

GIVEV

Subject

Subject

Mod

Mod

Object

Object

I, MY …

I

STABN v

STABN, KNIFE_ WOUNDN

TO

TO, AT, …

Object

Object

JOHN

JOHN

GHMT Structural N-gram Pruning
  • Statistical lexical selection
ghmt target statistical resources

every

every

cloud

cloud

have

has

lining

lining

silver

silver

a

a

GHMTTarget Statistical Resources
  • Structural N-gram Model
    • Long-distance
    • Lexemes
  • Surface N-gram Model
    • Local
    • Surface-forms
ghmt linearization ranking
GHMTLinearization &Ranking
  • Oxygen Linearization (Habash 2000)
  • Halogen Statistical Ranking (Langkilde 2002)

---------------------------------------------------------

I stabbed John . [-1.670270 ]

I gave a stab at John . [-2.175831]

I gave the stab at John . [-3.969686]

I gave an stab at John . [-4.489933]

I gave a stab by John . [-4.803054]

I gave a stab to John . [-5.045810]

I gave a stab into John . [-5.810673]

I gave a stab through John . [-5.836419]

I gave a knife wound by John . [-6.041891]

roadmap34
Roadmap
  • Introduction
  • Generation-Heavy Machine Translation
  • Evaluation
    • Overall Evaluation
    • Component Evaluation
  • Conclusion
  • Future Work
overall evaluation systems
Overall EvaluationSystems

(Resnik 1997)

(Brown et al 1990)(Al-Onaizan et al 1999)(Germann and Marcu 2000)

overall evaluation bleu metric
Overall EvaluationBleu Metric
  • Bleu
    • BiLingual Evaluation Understudy (Papineni et al 2001)
    • Modified n-gram precision with length penalty
    • Quick, inexpensive and language independent
    • Correlates highly with human evaluation
    • Bias against synonyms and inflectional variations
overall evaluation results39
Overall EvaluationResults
  • Systran is overall best
  • Gist is overall worst
  • Matador is more robust than IBM4
  • Matador is more grammatical than IBM4
  • Matador has less information loss than IBM4
overall evaluation grammaticality
Overall EvaluationGrammaticality
  • Example
    • SP: Ademàs dijo que solamente una inyecciòn masiva de capital extranjero ...
    • EN: Further, he said that only a massive injection of foreign capital ...
    • IBM4: further stated that only a massive inyecciòn of capital abroad ...
    • MTDR: Also he spoke only a massive injection of foreign capital ...
  • Parsed all sentences (Spanish, English reference and English output)
    • Can we find main verb?
    • Pro Drop Restoration
overall evaluation loss of information
Overall EvaluationLoss of Information
  • Example
    • SP: El daño causado al pueblo de Sudáfrica jamás debe subestimarse.
    • EN: The damage caused to the people of his country should never be underestimated.
    • IBM4: the damage * the people of south * must never underestimated .
    • MTDR: Never the causado damage to the people of South Africa should be underestimated.
component evaluation
Component Evaluation
  • Conducted several component evaluations
    • Parser
      • ~75% correct (labeled dependency links)
    • Categorial Variation Database
      • 81% Precision-Recall
    • Structural Expansion
    • Structural N-grams
component evaluation structural expansion
Component EvaluationStructural Expansion
  • Insignificant increase in Bleu score
  • 40% of divergences pragmatic
  • LCS lexicon coverage issues
  • Minimal handling of nominal divergences
  • Over-expansion
    • Además, destruyó totalmente sus cultivos de subsistencia …
    • EN: It had totally destroyed Samoa's staple crops ...
    • MTDR: Furthermore, it totaled their cultivations of subsistence …
    • SP: Dicha adición se publica sólo en años impares.
    • EN: That addendum is issued in odd-numbered years only.
    • MTDR: concerned addendum is excluded in odd years.
component evaluation structural n grams
Component EvaluationStructural N-grams
  • 60% speed-up with no effect on quality
roadmap47
Roadmap
  • Introduction
  • Generation-Heavy Machine Translation
  • Evaluation
  • Conclusion
  • Future Work
conclusion research contributions
ConclusionResearch Contributions
  • A general reusable and extensible MT model that transcends the need for large amounts of symmetric knowledge
  • A systematic non-interlingual/non-transfer framework for handling translation divergences
  • Extending the concept of symbolic overgeneration to include conflation and head-swapping of structural variations.
  • A model for language-independent syntactic-to-thematic linking
conclusion research contributions49
ConclusionResearch Contributions
  • Development of reusable large-scale modules and resources: Exerge, Categorial Variation Database, etc.
  • A large-scale Spanish-English GHMT implementation
  • An evaluation of Matador against four models of machine translation found it to be robust across genre and to produce more grammatical output.
ongoing work
Ongoing Work
  • Retargetability to new languages
    • Chinese, Arabic
  • Extending system to use bi-texts
    • Phrase dictionary
    • Weighted translation pairs
  • Generation-Heavy parsing
    • Small dependency grammar for foreign language
    • English structural n-grams to rank parses
  • Extending system with new optional modules
    • Cross-lingual headline generation

DepTrimmer (work with Bonnie Dorr) extending Trimmer (Dorr, et al. 2003) to dependency representation

future work
Future Work
  • Categorial Variation Database
    • Improving word-cluster correctness
  • Structural Expansion
    • Extending to nominal divergences
    • Improving thematic linking with a statistical model
  • Structural N-grams
    • Enriching with syntactic/thematic relations
slide52
Thank you!

Questions?

overall evaluation bleu metric53
Overall EvaluationBleu Metric

Test Sentence

colorless green ideas sleep furiously

Gold Standard References

all dull jade ideas sleep irately

drab emerald concepts sleep furiously

colorless immature thoughts nap angrily

overall evaluation bleu metric54
Overall EvaluationBleu Metric

Test Sentence

colorless green ideassleepfuriously

Gold Standard References

all dull jade ideassleep irately

drab emerald concepts sleepfuriously

colorless immature thoughts nap angrily

Unigram precision = 4/5

overall evaluation bleu metric55
Overall EvaluationBleu Metric

Test Sentence

colorless green ideas sleep furiously

colorless green ideas sleep furiously

colorless greenideas sleepfuriously

colorless green ideassleep furiously

Gold Standard References

all dull jade ideassleep irately

drab emerald concepts sleepfuriously

colorless immature thoughts nap angrily

Unigram precision = 4 / 5 = 0.8

Bigram precision = 2 / 4 = 0.5

Bleu Score = (a1 a2 …an)1/n

= (0.8╳ 0.5)½ = 0.6325  63.25

overall evaluation
Overall Evaluation
  • Investigating BLEU’s bias towards inflectional variants
    • SP: Los programas de ajuste estructural se han aplicado rigurosamente.
    • EN: Structural adjustment programmes had been rigorously implemented.
    • IBM4: structural adjustment programmes have been applied strictly.
    • MTDR: programmes of structural adjustment have been added rigurosament.