1 / 12

Duluth Word Alignment System

Duluth Word Alignment System. Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department. 31 May 2003. Duluth Word Alignment System. Perl implementation of IBM Model 2 Learns a probabilistic model from sentence aligned parallel corpora

beate
Download Presentation

Duluth Word Alignment System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003

  2. Duluth Word Alignment System • Perl implementation of IBM Model 2 • Learns a probabilistic model from sentence aligned parallel corpora • The parallel text consists of a source language text and its translation into some target language • Determines the word alignments of the sentence pairs • Missing data problem • No examples of word alignments in the training data • Use the Expectation Maximization (EM) Algorithm

  3. IBM Model 2 • Takes into account • The probability of the two words being translations of each other • how likely it is for words at particular positions in a sentence pair to be alignments of each other • Example 2 1 3 1 2 3

  4. Distortion Factor • Distortion Factor • How far away from the original (source) position can the word move • Example: Source sentence : Target sentence :

  5. Types of Alignments • Sure and Probable alignments • Sure : Alignment judged to be very likely • Probable : Alignment judged to be less certain • Our system does not make this distinction, we take the highest alignment regardless of the value • No-null and Null alignments • Our system does not include null alignments • Null alignments : source words that do not align to any word in the target sentence • One-to-One and One-to-Many alignments • Our system includes one-to-many as well as one to one alignments

  6. Alignments One to One One to Many Many to One S1 S2 S3 S4 S5 T1 T2 T3 T4 T5

  7. Data • English – French • Trained • 5% subset of the Aligned Hansards of the 36th Parliament of Canada • Approximately 50,000 out of the 1,200,000 given sentence pairs • Mixture of the House and Senate debates • We wanted to train the model on comparable size data sets • Tested • 447 manually word aligned sentence pairs • Romanian – English • Trained on all available training data (49,284 sentence pairs) • Tested • 248 manually word aligned sentence pairs

  8. Results

  9. Precision and Recall Results • Precision of the two language pairs were similar • This may reflect the fact that we used approximately the same amount of training data for each of the models • The recall for the English-French data was low • This system does not find alignments in which many English words align to one French word. • This reduced the number of alignment made by the system in comparison to the number of alignments in the gold standard

  10. Distortion Results • The precision and recall were not significantly affected by the distortion factor • Distortion factor of 0 resulted in lower precision and recall than a distortion factor of 2, 4 or 6 • Distortion factor of 2, 4, 6 resulted in approximately the same precision and recall values for each of the different language sets • The distortion factor of 4 and 6 do not contain any more information than a distortion factor of 2 • suggests that word movement is limited

  11. Conclusions of Training Data • Small amount of training data • wanted to compare the Romanian English and the English French results • Although the data for Romanian English was different than the data for English French the results were comparable • would like to increase the training data to determine the how much of an improvement of the results could be obtained

  12. Conclusions • Considering modifying the existing Perl implementation to allow for this • Database approach • Berkeley DB • NDBM • re-implementing the algorithm in Perl Data Language • Perl module that is optimized for matrix and scientific computing

More Related