Morphological analysis for phrase based statistical machine translation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Morphological Analysis for Phrase-Based Statistical Machine Translation PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on
  • Presentation posted in: General

Morphological Analysis for Phrase-Based Statistical Machine Translation. LUONG Minh Thang Supervisor : Dr. KAN Min Yen National University of Singapore Web IR / NLP Group (WING). State-of-the-art systems phrase-phrase translation with data-intensive techniques but still, they

Download Presentation

Morphological Analysis for Phrase-Based Statistical Machine Translation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Morphological analysis for phrase based statistical machine translation

Morphological Analysis for Phrase-Based Statistical Machine Translation

LUONG Minh Thang

Supervisor: Dr. KAN Min Yen

National University of Singapore

Web IR / NLP Group (WING)


Modern machine translation mt

State-of-the-art systems

phrase-phrase translation

with data-intensive techniques

but still, they

treat words as different entities

don’t understand the internal structure of words

Modern Machine Translation (MT)

We investigate the incorporation of word structure knowledge (morphology) and adopt a language-independent approach

Machine translation: understand word structure

2


Issues we address

Issues we address

“cars” has two morphemes “car”+“s”

  • Morphologically-aware system

    • Out-of-vocabulary problem

    • Derive word structure from only raw data, language-general approach

  • Translation to high-inflected languages

    • English-Finnish case study

    • Understand the characteristics

       Suggestion of self-correcting model

Seen “car” before, but not “cars”

auto/si: your car

auto/i/si: your cars

auto/i/ssa/si: in your cars

auto/i/ssa/si/ko: in your cars?

auto

car

Machine translation: understand word structure


What others have done

What others have done?

  • A majority of works address the translation direction from high- to low-inflected languages

    • Arabic-English, German-English, Finnish-English

  • Only few works touch at the reverse direction, which is considered more challenging

    • English-Turkish: (El-Kahlout & Oflazer, 2007)

    • English-Russian, Arabic: (Toutanova et. al., 2008)

    • employ feature-rich approach using abundant annotation data & language-specific tools.

We also look at the reverse direction, English-Finnish, but stick to our language-general approach!

Machine translation: understand word structure


Agenda

Agenda

  • Baseline statistical MT

    Terminology

  • Our morphologically-aware SMT system

  • Baseline + Morphological layers

  • Finnish study – morphological aspects

  • Suggestion of self-correcting model

  • Experiments & results

Machine translation: understand word structure


Baseline statistical mt smt overview

Baseline statistical MT (SMT) - overview

  • We construct our baseline using Moses (Koehn et.al, 2007), a state-of-the-art open-source SMT toolkit

Monolingual/Parallel train data

Training

Language model

Translation model

Reordering model

Test data

(source language)

Decoding

Output translation (target language)

Evaluating

BLEU score

Machine translation: understand word structure


Baseline statistical mt terminology

Baseline statistical MT - Terminology

  • Parallel data: pairs of sentences in both language

    (implies alignment correspondence)

  • Monolingual data: from one language only

Source

Reordering effect

Target

  • Distortion limit parameter: control reordering

  • - how far translated words could be from the source word

  • We test the effect of this parameter later

Machine translation: understand word structure


Automatic evaluation in smt

Automatic evaluation in SMT

  • Human judgment is expensive & labor-consuming

  • Automatically evaluate using reference translation(s)

Input: Mary did not slap the green witch

Baseline SMT system

Output: Maria daba una botefada a verde bruja

Ref: Maria no daba una botefada a la bruja verde

Evaluating

BLEU score

Machine translation: understand word structure


Automatic evaluation in smt bleu score

Automatic evaluation in SMT – BLEU score

  • Match unigram, bigram, trigram, and up to N-gram

Ref: Maria no daba una botefada a la bruja verde

Output: Maria daba una botefada a verde bruja

  • p1 (unigram) = 7

  • p2 (bigram) = 4

Output: Maria daba una botefada a verde bruja

Output: Maria daba una botefada a verde bruja

  • p3 (trigram) = 2

Output: Maria daba una botefada a verde bruja

  • p4 (4-gram) = 1

BLEU score = length_ratio* exp(p1+ ..+ p4)/4

Machine translation: understand word structure


Baseline smt shortcomings

Baseline SMT – Shortcomings?

  • Only deal with language of similar morphology level

  • Suffer from data sparseness problem in high-inflected languages

(Statistics from 714 K)

Type: number of different words (vocabulary size)

Token: the total number of words

Machine translation: understand word structure


Why high inflected language is hard

Why high-inflected language is hard?

  • Has huge vocabulary size.

    • Finnish vocabulary ~ 6 times English vocabulary

  • Could freely concatenate prefixes/suffixes to form new word

    Finnish: oppositio/kansa/n/edusta/ja

    (opposition/people/of/represent/-ative)

    = opposition of parliarment member

    Turkish: uygarlas,tiramadiklarimizdanmis,sinizcasina

    (uygar/las, tir/ama/dik/lar/imiz/dan/mis, siniz/casina)

    = (behaving) as if you are among those whom we could not cause to become civilized

    Make our system morphologically-aware to address these

This is a word!!!

Machine translation: understand word structure


Agenda1

Agenda

  • Baseline statistical MT

    Terminology

  • Our morphological-aware SMT system

  • Baseline + Morphological layers

  • Finnish study – morphological aspects

  • Suggestion of self-correcting model

  • Experiments & results

Machine translation: understand word structure


Morpheme pre post processing modules

Morpheme pre- & post-processing modules

Parallel train data

Monolingual train data

Language model training

Translation & reordering model training

auto + t

cars

Test data

Morpheme Pre-processing

Morpheme Post-processing

Decoding

Final translation

autot

car + s

Machine translation: understand word structure


Incorporating morphological layers

Incorporating morphological layers

Parallel train data

Monolingual train data

Morpheme Pre-processing

Language model training

Translation & reordering model training

Test data

Our morphologically-aware SMT

Morpheme Pre-processing

Decoding

Morpheme Post-processing

Final translation E

Machine translation: understand word structure


Preprocessing morpheme segmentation ms

Preprocessing – morpheme segmentation (MS)

  • We perform MS to address the data sparse problem

    • cars might not appears in the training, but car & s do

  • (Oflazer, 2007) & (Toutanova, 2008) also perform MS but use morphological analyzers that

    • customized for a specific language

    • utilize richly-featured annotated data

  • We use an unsupervised morpheme segmentation tool, Morfessor, that requires only unannotated monolingual data.

Machine translation: understand word structure


Morpheme segmentation morfessor

Morpheme segmentation- Morfessor

  • Morfessor – segment words, unsupervised manner

    straight/STM + forward/STM + ness/SUF

  • 3 tags: PRE (prefix), STM (stem), & SUF(suffix)

(Statistics from 714 K)

Reduce data sparseness problem

Machine translation: understand word structure


Post processing morpheme concatenation

Post-processing – morpheme concatenation

  • Output after decoding is a sequence of morphemes

    Pitäkää mme se omassa täsmällis essä tehtävä ssä ä n

     How to put them back into words?

  • During translation, keep the tag info & “+” sign to (indicate internal morpheme)

  • Use word structure : WORD = ( PRE* STM SUF* )+

Test data

omassa/STM

se/STM

Pitäkää/STM+ mme/SUF

Morpheme Pre-processing

  • tehtävä/STM+ ssä/SUF+ ä/SUF+ n/SUF

täsmällis/STM+ essä/STM

Decoding

omassa

täsmällisessä

  • tehtävässään

se

Pitäkäämme

Morpheme Post-processing

Final translation

Machine translation: understand word structure


Agenda2

Agenda

  • Baseline statistical MT

    Terminology

  • Our morphological-aware SMT system

  • Baseline + Morphological layers

  • Finnish study – morphological aspects

  • Suggestion of self-correcting model

  • Experiments & results

Machine translation: understand word structure


Finnish study two distinct characteristics

Finnish study – two distinct characteristics

  • More case endings than usual Indo-European languages

    • Normally correspond to prepositions or postpositions.

    • E.g.: auto/sta “out of the car”, auto/on “into the car”

  • Use endings where Indo-European languages have function words.

    • Finnish possessive suffixes = English possessive pronouns

    • E.g.: auto/si “my car”, auto/mme “our car”.

Machine translation: understand word structure


Structure of nominal a word followed by many suffixes

Structure of nominal –A word followed by many suffixes

  • Structure: Nominal + number + case + possessive + particle

Machine translation: understand word structure


Structure of finite verb form finnish suffixes english function words

Structure of finite verb form –Finnish suffixes ~ English function words

  • Structure: Nominal + tense/mood + personal ending + particle

Machine translation: understand word structure


Potential challenges of high inflected language to the system

Potential challenges of high-inflected language to the system

  • A word might be followed by several suffixes

    • A potential that the system might get the stem right, but miss a suffix.

  • Correct translation: my cars  auto/i/ni (i: plural, ni: my)

  • Intuition: use “my” and “s” to help

………. my/STM car/STM+ s/SUF ……….

How to self-correct this suffix to i/ni?

..…………..auto/STM+ i/SUF…………....

Machine translation: understand word structure


Preliminary self correcting model

Preliminary self-correcting model

  • Suffixes in high-inflected language ~ function words in low-inflected language

    • Besides prefixes & suffixes, make use of source function words

  • Model as a sequence labeling task – Labels are suffixes

Suffixt-1

Suffixt

Suffixt+1

my/STM car/STM+ s/SUF

Predict correct suffix = ini/

auto/STM+ i/SUF

func=“my”

suf =“s”

Stemt=“auto”

Stemt-1

Stemt+1

Machine translation: understand word structure


Agenda3

Agenda

  • Baseline statistical MT

    Terminology

  • Our morphological-aware SMT system

  • Baseline + Morphological layers

  • Finnish study – morphological aspects

  • Suggestion of self-correcting model

  • Experiments & results

Machine translation: understand word structure


Datasets from european parliament corpora

Datasets from European Parliament corpora

  • Four data sets of various sizes

    • select by first pick a keyword for each dataset, and extract all sentences contain the key word and its morphological variants

  • modest in size as compared to 714K of the full corpora.

  • We choose because:

  • - Reduce running time

  • - Simulate the real situation of scarce resources

Machine translation: understand word structure


Experiments out of vocabulary oov rates

Experiments – Out-of-vocabulary (OOV) rates

  • OOV rate = number of un-translated words / total words

  • Reduction rate = (baseline OOV – our OOV rates) / baseline OOV rate

Reduction rate:

10.33% to 34.74%.

Highest effect when

data is limited

Machine translation: understand word structure


Overall results with bleu score

Overall results with BLEU score

  • Use BLEU score metric – judge at

    • word level: unit in N-gram is word

    • morpheme level: unit in N-gram is morpheme

Word BLEU: our STM is as competitive as the baseline SMT

Morpheme BLEU: our STM shows better morpheme coverage

Machine translation: understand word structure


Overall results distortion limit tuning

Overall results - distortion limit tuning

  • Distortion limit controls reordering

  • Has influential effect on the performance (Virpioja, 2007)

Baseline STM is best at 6

Our STM is best at 9

Our SMT is better in both word and morpheme BLUE

Machine translation: understand word structure


Error analysis

Error analysis

  • Interested to know how many times the system could get the stem right but not the suffixes

Real need of the self-correcting model

Machine translation: understand word structure


Even further analysis new results after thesis

Even further analysis – New results after thesis !!!

  • Our datasets are specialized on their keywords

  • Result will be more conclusive if we look at translations of phrases containing dataset keywords

Conclusion: our SMT performs better in both tasks, getting the stems and suffixes right.

Machine translation: understand word structure


Reference

Reference

  • Kohen, P., et. al, 2007. Moses: open source toolkit for statistical machine translation

  • Oflazer & Durgar El-Kahlout, 2007. Exploring different representational units in English-to-Turkish statistical machine translation

  • Virproja, S., et. al., 2007. Morphology-aware statistical machine translations based on morphs induced in an unsupervised manner

  • Toutanova et. al., 2008. Applying morphology generation models to machine translation

Machine translation: understand word structure


Morphological analysis for phrase based statistical machine translation

Q & A?

  • Thank you

Machine translation: understand word structure


Baseline statistical mt

Baseline statistical MT

Train data

Target train data

Translation model training

EM algorithm , symmetrizing word alignment (GIZA++ tool)

Language model training

N-gram extraction, SRILM

Development data

Phrase tables

Language model

Tuning

P(E|F) ~ ∑ λi fi (E|F)Learn λi

(Minimume error rate training)

λ*I

Test data (F)

Decoding

E = argmaxE ∑ λ*i fi (E|F)

(Beam search, Moses toolkit)

Final translation E

Machine translation: understand word structure


Standard smt system translation model

Standard SMT system – translation model

  • Learn how to translate from one source phrase to a target phrase

  • Output phrase table

Parallel train data

Translation model training

car industry in europe ||| euroopan autoteollisuus

car industry in the ||| autoteollisuuden

car industry in ||| autoteollisuuden

Machine translation: understand word structure


Standard smt system language model

Standard SMT system – language model

Target train data

Language model training

  • Constraints on a sequence of words that could go togerther

  • Output N-gram table

-2.882216 commission 's argument 0

-3.182358 commission 's arguments 0

-3.620942 commission 's assertion 0

-3.11402 commission 's assessment 0

Machine translation: understand word structure


Standard smt system tuning

Standard SMT system - tuning

  • Determine the weights to combine different models, e.g. translation or language model.

Parallel development data

P(E|F) ~ ∑ λi fi (E|F)Learn λi

Tuning

Machine translation: understand word structure


Standard smt system decoding

Standard SMT system - Decoding

  • Use phrase table in translation model, N-gram table in language model, and parameters to combine them in tuning.

  • Generate for each input sentence F, a set of best translations, and pick the highest-score one.

Test data F

Decoding

Final translation E

Machine translation: understand word structure


  • Login