1 / 21

EBMT

EBMT. Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman. Outline. EBMT in outline? What data do we need? How do we create a lexicon? Indexing the corpus. Finding chunks to translate. Matching a chunk against the target.

kapono
Download Presentation

EBMT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EBMT • Example Based Machine Translationas used in the Pangloss system at Carnegie Mellon University • Dave Inman EBMT

  2. Outline • EBMT in outline? • What data do we need? • How do we create a lexicon? • Indexing the corpus. • Finding chunks to translate. • Matching a chunk against the target. • Quality of translation. • Speed of translation. • Good and bad points • Conclusions. EBMT

  3. EBMT in outline - Corpus • Corpus • S1: The cat eats a fish. Le chat mange un poisson. • S2: A dog eats a cat. Un chien mange un chat. • ….. • S99,999,999 …. • Index • the: S1 • cat: S1 • eats: S1 • … • dog: S2 EBMT

  4. EBMT in outline – find chunks • A source language sentence is input. • The cat eats a dog. • Chunks of this sentence are matched against the corpus. • The cat : S1 • The cat eats: S1 • The cat eats a: S1 • a dog : S2 EBMT

  5. How does EBMT work in outline - Corpus • 1. The target language sentences are retrieved for each chunk. • The cat eats : S1 • Corpus • S1: The cat eats a fish. Le chat mange un poisson • 2. The chunks are aligned with target sentences (hard!). • The cat eats Le chat mange EBMT

  6. How does EBMT work in outline - Corpus • Chunks are scored to find good match… • The cat eats Le chat mange Score 78% • The cat eats Le chat dorme Score 43% • … • a dog un chien Score 67% • a dog le chien Score 56% • a dog un arbre Score 22% • The best translated chunks are put together to make the final translation. • The cat eats Le chat mange • a dog un chien EBMT

  7. What data do we need? • A large corpus of parallel sentences. …if possible in the same domain as the translations. • A bilingual dictionary …but we can induce this from the corpus. • A target language root/synonym list. … so we can see similarity between words and inflected forms (e.g. verbs) • Classes of words easily translated … such as numbers, towns, weekdays. EBMT

  8. How to create a lexicon. • Take each sentence pair in the corpus. • For each word in the source sentence, add each word in the target sentence and increment the frequency count. • Repeat for as many sentences as possible. • Use a threshold to get possible alternative translations. EBMT

  9. How to create a lexicon..example • The cat eats a fish. Le chat mange un poisson. EBMT

  10. Create a lexicon…after many sentences • the le,956 • la,925 • un,235 • ------ Threshold ---------- • chat,47 • mange,33 • poisson,28 • .... • arbre,18 EBMT

  11. Create a lexicon…after many sentences • cat chat,963 • ------ Threshold ---------- • le,604 • la,485 • un,305 • mange,33 • poisson,28 • .... • arbre,47 EBMT

  12. Indexing the corpus. • For speed the corpus is indexed on the source language sentences. • Each word in each source language sentence is stored with info about the target sentence. • Words can be added to the corpus and the index easily updated. • Tokens are used for common classes of words (e.g. numbers). This makes matching more effective. EBMT

  13. Finding chunks to translate. • Look up each word in the source sentence in the index. • Look for chunks in the source sentence (at least 2 words adjacent) which match the corpus. • Select last few matches against the corpus (translation memory). • Pangloss uses the last 5 matches for any chunk. EBMT

  14. Matching a chunk against the target. • For each source chunk found previously, retrieve the target sentences from the corpus (using the index). • Try to find the translation for the source chunk from these sentences. • This is the hard bit! • Look for the minimum and maximum segments in the target sentences which could correspond with the source chunk. Score each of these segments. EBMT

  15. Scoring a segment… • Unmatched Words : Higher priority is given to sentences containing all the words in an input chunk. • Noise : Higher priority is given to corpus sentences which have fewer extra words. • Order : Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk. • Morphology : Higher priority is given to sentences in which words match exactly rather than against morphological variants. EBMT

  16. Whole sentence match… • If we are lucky the whole sentence will be found in the corpus! • In that case the target sentence is used without previous alignment. • Useful if translation memory is available (sentences recently translated are added to the corpus). EBMT

  17. Quality of translation. • Pangloss was tested against source sentences in a different domain to the examples in the corpus. • Pangloss “covered” about 70% of the sentences input. • This means a match was found against the corpus…. • …but not necessarily a good match. • Others report around 60% of the translation can be understood by a native speaker. Systran manages about 70%. EBMT

  18. Speed of translation. • Translations are much faster than for Systran. • Simple sentences translated in seconds. • Corpus can be added to (translation memory) at about 6MBytes per minute (Sun Sparc Station) • A 270 Mbytes corpus takes 45 minutes to index. EBMT

  19. Good points. • Fast • Easy to add a new language pair • No need to analyse languages (much) • Can induce a dictionary from the corpus • Allows easy implementation of translation memory • Graceful degradation as size of corpus reduced EBMT

  20. Bad points. • Quality is second best at present • Depends on a large corpus of parallel, well translated sentences • 30% of source has no coverage (translation) • Matching of words is brittle – we can see a match Pangloss cannot. • Domain of corpus should match domain to be translated - to match chunks EBMT

  21. Conclusions. • An alternative to Systran • Faster • Lower quality • Quick to develop for a new language pair – if corpus exists! • Needs no linguistics • Might improve as bigger corpora become available? EBMT

More Related