1 / 7

Issues in Arabic MT

Issues in Arabic MT. Alex Fraser USC/ISI. ISI Arabic System for 2003 TIDES Evaluation. Alignment Template Approach (Och and Others at RWTH Aachen) Maximum BLEU training (Och, ACL 2003) Customization of Training for Arabic System Model and Search are not Arabic dependent (currently)

rgroves
Download Presentation

Issues in Arabic MT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues in Arabic MT Alex Fraser USC/ISI Issues in Arabic MT

  2. ISI Arabic System for 2003 TIDES Evaluation • Alignment Template Approach (Och and Others at RWTH Aachen) • Maximum BLEU training (Och, ACL 2003) • Customization of Training for Arabic System • Model and Search are not Arabic dependent (currently) • Top Scoring System Issues in Arabic MT

  3. Character Encoding and Normalization • Arabic UTF-8 reduced to CP-1256 character set (8-bit MS-Windows encoding) • Handle non-Arabic characters that look similar • Numbers • Normalization is important • Strip Kashida, vowels, Shadda • Normalize Alef variants, Alef Maqsura/Yeh, Heh/The Marbuta, Hamza variants Issues in Arabic MT

  4. Morphology • Simple morphological segmentation did not improve performance at large training sizes • MT Extensions to UMass Light Stemming for IR (Larkey et al., SIGIR 2002) • Modified Buckwalter Stemmer (LDC), conservative stems (Xu, Fraser, Weischedel, SIGIR 2002) • Space-separated Arabic strings are already translated as consecutive-word phrases with baseline system • Used Buckwalter Stemmer and Gloss for unknown words Issues in Arabic MT

  5. Training on long sentences • Realignment of sentences of length > 45 tokens on chunk level • Virtually all data can be used for training (93M words English, 82M words Arabic). • English chunks are projected to Arabic • IBM Model 1 Viterbi word alignment is used to project high precision chunk breaks from English to Arabic • Dynamic programming search for best chunk projection Issues in Arabic MT

  6. Error Analysis • Verbal movement and form • VSO ordering • Tense • NP structure • Missing 'to be' in present tense • Also causes spurious ‘to be’ • PRO • These are all syntactic problems • Also Important: Named Entities, Unknown Words Issues in Arabic MT

  7. Future • More parallel data – 1 billion words • More in-domain data • More test sets • Named Entity list • Research on Syntax Issues in Arabic MT

More Related