1 / 15

Approaches to Machine Translation

Approaches to Machine Translation. CSC 5930 Machine Translation Fall 2012 Dr. Tom Way. How humans do translation?. Learn a foreign language: Memorize word translations Learn some patterns: Exercise: Passive activity: read, listen Active activity: write, speak Translation:

tala
Download Presentation

Approaches to Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way

  2. How humans do translation? • Learn a foreign language: • Memorize word translations • Learn some patterns: • Exercise: • Passive activity: read, listen • Active activity: write, speak • Translation: • Understand the sentence • Clarify or ask for help (optional) • Translate the sentence Training stage Translation lexicon Templates, transfer rules Reinforced learning? Reranking? Decoding stage Parsing, semantics analysis? Interactive MT? Word-level? Phrase-level? Generate from meaning?

  3. What kinds of resources are available to MT? • Translation lexicon: • Bilingual dictionary • Templates, transfer rules: • Grammar books • Parallel data, comparable data • Thesaurus, WordNet, FrameNet, … • NLP tools: tokenizer, morph analyzer, parser, …  More resources for major languages, less for “minor” languages.

  4. Major approaches • Transfer-based • Interlingua • Example-based (EBMT) • Statistical MT (SMT) • Hybrid approach

  5. The MT triangle Meaning (interlingua) Synthesis Analysis Transfer-based Phrase-based SMT, EBMT Word-based SMT, EBMT word Word

  6. Transfer-based MT • Analysis, transfer, generation: • Parse the source sentence • Transform the parse tree with transfer rules • Translate source words • Get the target sentence from the tree • Resources required: • Source parser • A translation lexicon • A set of transfer rules • An example: Mary bought a book yesterday.

  7. Transfer-based MT (cont) • Parsing: linguistically motivated grammar or formal grammar? • Transfer: • context-free rules? A path on a dependency tree? • Apply at most one rule at each level? • How are rules created? • Translating words: word-to-word translation? • Generation: using LM or other additional knowledge? • How to create the needed resources automatically?

  8. Interlingua • For n languages, we need n(n-1) MT systems. • Interlingua uses a language-independent representation. • Conceptually, Interlingua is elegant: we only need n analyzers, and n generators. • Resource needed: • A language-independent representation • Sophisticated analyzers • Sophisticated generators

  9. Interlingua (cont) • Questions: • Does language-independent meaning representation really exist? If so, what does it look like? • It requires deep analysis: how to get such an analyzer: e.g., semantic analysis • It requires non-trivial generation: How is that done? • It forces disambiguation at various levels: lexical, syntactic, semantic, discourse levels. • It cannot take advantage of similarities between a particular language pair.

  10. Example-based MT • Basic idea: translate a sentence by using the closest match in parallel data. • First proposed by Nagao (1981). • Ex: • Training data: • w1 w2 w3 w4  w1’ w2’ w3’ w4’ • w5 w6 w7  w5’ w6’ w7’ • w8 w9  w8’ w9’ • Test sent: • w1 w2 w6 w7 w9  w1’ w2’ w6’ w7’ w9’

  11. EMBT (cont) • Types of EBMT: • Lexical (shallow) • Morphological / POS analysis • Parse-tree based (deep) • Types of data required by EBMT systems: • Parallel text • Bilingual dictionary • Thesaurus for computing semantic similarity • Syntactic parser, dependency parser, etc.

  12. EBMT (cont) • Word alignment: using dictionary and heuristics  exact match • Generalization: • Clusters: dates, numbers, colors, shapes, etc. • Clusters can be built by hand or learned automatically. • Ex: • Exact match: 12 players met in Paris last Tuesday  12 Spieler trafen sich letzen Dienstag in Paris • Templates: $num players met in $city $time $num Spieler trafen sich $time in $city

  13. Statistical MT • Basic idea: learn all the parameters from parallel data. • Major types: • Word-based • Phrase-based • Strengths: • Easy to build, and it requires no human knowledge • Good performance when a large amount of training data is available. • Weaknesses: • How to express linguistic generalization?

  14. Comparison of resource requirement

  15. Hybrid MT • Basic idea: combine strengths of different approaches: • Syntax-based: generalization at syntactic level • Interlingua: conceptually elegant • EBMT: memorizing translation of n-grams; generalization at various level. • SMT: fully automatic; using LM; optimizing some objective functions. • Types of hybrid HT: • Borrowing concepts/methods: • SMT from EBMT: phrase-based SMT; Alignment templates • EBMT from SMT: automatically learned translation lexicon • Transfer-based from SMT: automatically learned translation lexicon, transfer rules; using LM • … • Using two MTs in a pipeline: • Using transfer-based MT as a preprocessor of SMT • Using multiple MTs in parallel, then adding a re-ranker.

More Related