EBMT

EBMT • Example Based Machine Translationas used in the Pangloss system at Carnegie Mellon University • Dave Inman EBMT

Outline • EBMT in outline? • What data do we need? • How do we create a lexicon? • Indexing the corpus. • Finding chunks to translate. • Matching a chunk against the target. • Quality of translation. • Speed of translation. • Good and bad points • Conclusions. EBMT

EBMT in outline - Corpus • Corpus • S1: The cat eats a fish. Le chat mange un poisson. • S2: A dog eats a cat. Un chien mange un chat. • ….. • S99,999,999 …. • Index • the: S1 • cat: S1 • eats: S1 • … • dog: S2 EBMT

EBMT in outline – find chunks • A source language sentence is input. • The cat eats a dog. • Chunks of this sentence are matched against the corpus. • The cat : S1 • The cat eats: S1 • The cat eats a: S1 • a dog : S2 EBMT

How does EBMT work in outline - Corpus • 1. The target language sentences are retrieved for each chunk. • The cat eats : S1 • Corpus • S1: The cat eats a fish. Le chat mange un poisson • 2. The chunks are aligned with target sentences (hard!). • The cat eats Le chat mange EBMT

How does EBMT work in outline - Corpus • Chunks are scored to find good match… • The cat eats Le chat mange Score 78% • The cat eats Le chat dorme Score 43% • … • a dog un chien Score 67% • a dog le chien Score 56% • a dog un arbre Score 22% • The best translated chunks are put together to make the final translation. • The cat eats Le chat mange • a dog un chien EBMT

What data do we need? • A large corpus of parallel sentences. …if possible in the same domain as the translations. • A bilingual dictionary …but we can induce this from the corpus. • A target language root/synonym list. … so we can see similarity between words and inflected forms (e.g. verbs) • Classes of words easily translated … such as numbers, towns, weekdays. EBMT

How to create a lexicon. • Take each sentence pair in the corpus. • For each word in the source sentence, add each word in the target sentence and increment the frequency count. • Repeat for as many sentences as possible. • Use a threshold to get possible alternative translations. EBMT

How to create a lexicon..example • The cat eats a fish. Le chat mange un poisson. EBMT

Create a lexicon…after many sentences • the le,956 • la,925 • un,235 • ------ Threshold ---------- • chat,47 • mange,33 • poisson,28 • .... • arbre,18 EBMT

Create a lexicon…after many sentences • cat chat,963 • ------ Threshold ---------- • le,604 • la,485 • un,305 • mange,33 • poisson,28 • .... • arbre,47 EBMT

Indexing the corpus. • For speed the corpus is indexed on the source language sentences. • Each word in each source language sentence is stored with info about the target sentence. • Words can be added to the corpus and the index easily updated. • Tokens are used for common classes of words (e.g. numbers). This makes matching more effective. EBMT

Finding chunks to translate. • Look up each word in the source sentence in the index. • Look for chunks in the source sentence (at least 2 words adjacent) which match the corpus. • Select last few matches against the corpus (translation memory). • Pangloss uses the last 5 matches for any chunk. EBMT

Matching a chunk against the target. • For each source chunk found previously, retrieve the target sentences from the corpus (using the index). • Try to find the translation for the source chunk from these sentences. • This is the hard bit! • Look for the minimum and maximum segments in the target sentences which could correspond with the source chunk. Score each of these segments. EBMT

Scoring a segment… • Unmatched Words : Higher priority is given to sentences containing all the words in an input chunk. • Noise : Higher priority is given to corpus sentences which have fewer extra words. • Order : Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk. • Morphology : Higher priority is given to sentences in which words match exactly rather than against morphological variants. EBMT

Whole sentence match… • If we are lucky the whole sentence will be found in the corpus! • In that case the target sentence is used without previous alignment. • Useful if translation memory is available (sentences recently translated are added to the corpus). EBMT

Quality of translation. • Pangloss was tested against source sentences in a different domain to the examples in the corpus. • Pangloss “covered” about 70% of the sentences input. • This means a match was found against the corpus…. • …but not necessarily a good match. • Others report around 60% of the translation can be understood by a native speaker. Systran manages about 70%. EBMT

Speed of translation. • Translations are much faster than for Systran. • Simple sentences translated in seconds. • Corpus can be added to (translation memory) at about 6MBytes per minute (Sun Sparc Station) • A 270 Mbytes corpus takes 45 minutes to index. EBMT

Good points. • Fast • Easy to add a new language pair • No need to analyse languages (much) • Can induce a dictionary from the corpus • Allows easy implementation of translation memory • Graceful degradation as size of corpus reduced EBMT

Bad points. • Quality is second best at present • Depends on a large corpus of parallel, well translated sentences • 30% of source has no coverage (translation) • Matching of words is brittle – we can see a match Pangloss cannot. • Domain of corpus should match domain to be translated - to match chunks EBMT

Conclusions. • An alternative to Systran • Faster • Lower quality • Quick to develop for a new language pair – if corpus exists! • Needs no linguistics • Might improve as bigger corpora become available? EBMT

EBMT

EBMT

Presentation Transcript

Adapting EBMT to Chinese

Pre-processing of Bilingual Corpora for Mandarin-English EBMT

CML Learning Programme for nurses & other allied health care professionals EBMT Nurses Group

How well are we doing with CVC’s in EBMT? London 5th October 2012 Arno Mank, Nurse Researcher,

The current status of Chinese-English EBMT research -where are we now

The EBMT Registry Database

The current status of Chinese-English EBMT -where are we now

Progress in Chinese EBMT for LingWear

EBMT Based on Finite Automata State Transfer Generation

CML Learning Programme for Nurses & Other Allied Health Care Professionals EBMT Nurses Group

CML Learning Programme for Nurses & Other Allied Health Care Professionals EBMT Nurses Group

EBMT

EBMT

Presentation Transcript

Adapting EBMT to Chinese

Pre-processing of Bilingual Corpora for Mandarin-English EBMT

CML Learning Programme for nurses &amp; other allied health care professionals EBMT Nurses Group

How well are we doing with CVC’s in EBMT? London 5th October 2012 Arno Mank, Nurse Researcher,

The current status of Chinese-English EBMT research -where are we now

The EBMT Registry Database

The current status of Chinese-English EBMT -where are we now

Progress in Chinese EBMT for LingWear

EBMT Based on Finite Automata State Transfer Generation

CML Learning Programme for Nurses &amp; Other Allied Health Care Professionals EBMT Nurses Group

CML Learning Programme for Nurses &amp; Other Allied Health Care Professionals EBMT Nurses Group

CML Learning Programme for nurses & other allied health care professionals EBMT Nurses Group

CML Learning Programme for Nurses & Other Allied Health Care Professionals EBMT Nurses Group

CML Learning Programme for Nurses & Other Allied Health Care Professionals EBMT Nurses Group