1 / 20

Hindi SLE Debriefing AVENUE Transfer System

Hindi SLE Debriefing AVENUE Transfer System. July 3, 2003. Summary of our Final Hindi-to-English Transfer System. Overview of our Lexical Resources and how they were used in the system Grammar Development Transfer System Runtime Configuration Dev-test Evaluation Results

Download Presentation

Hindi SLE Debriefing AVENUE Transfer System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hindi SLE DebriefingAVENUE Transfer System July 3, 2003

  2. Summary of our Final Hindi-to-English Transfer System • Overview of our Lexical Resources and how they were used in the system • Grammar Development • Transfer System Runtime Configuration • Dev-test Evaluation Results • Observations and Lessons Learned Hindi SLE Debriefing

  3. Elicited Data Collection • Goal: Acquire high quality word aligned Hindi-English data to support system development, especially grammar development and automatic grammar learning • We recruited a sizeable team of bilingual speakers – Rachel… • “Original” Elicitation Corpus was translated into Hindi • Corpus of Phrases extracted from Brown Corpus (NPs and PPs) was broken into files and assigned to translators, here and in India • Resulting in total of 17589 word aligned translated phrases Hindi SLE Debriefing

  4. Summary of Lexical Resources • Manual: manually written phrase transfer rules (72) • Postpos: manually writen postpos rules (105) • Bigram: translations of 500 most frequent bigrams in Hindi (from Ralf) • Elicited: elicited data from controlled corpus and Brown, w-to-w and p-to-p, total of 84619 lexical and phrase rules • LDC: “master” bilingual dict from LDC, frequency sorted, Richard and Shobha cleaned up manually top 12% of entries, total of 87902 rules • NE: Named Entity lists from LDC website and from Fei, total of 1237+2109= 3346 rules • IBM: statistical w-to-w and p-to-p lexicon from IBM, sorted by translation prob, 81664 rules • JOY: SMT system w-to-w and p-to-p lexicon, sorted by translation prob, 189583 rules • TOTAL: 447791 rules Hindi SLE Debriefing

  5. Ordering of Lexical Resources • Corresponds to three passes of system: • Phrase-to-phrase (used in first pass) • POS-tagged w-to-w pass (morph, enhanced, sorted, can feed into grammar) • LEX-tagged w-to-w pass (full forms, can only be used for w-to-w, no grammar). Hindi SLE Debriefing

  6. Ordering of Lexical Resources • Man rules (p-to-p, w-to-w) • Postpos (w-to-w) • Bigrams (p-to-p) • LDC (w-to-w, enhanced, sorted) • Etposrules (w-to-w, enhanced, sorted) • NE (p-to-p , w-to-w) • Etlexrules (w-to-w, sorted) • Etphraserules (p-to-p) • IBM (p-to-p, w-to-w, sorted) • JOY (p-to-p, w-to-w, sorted) Cleaned up and duplicates removed Total Rules in Global Lexicon: xxx Hindi SLE Debriefing

  7. Grammar Development • Grammar covers mostly VPs (verb complexes) • 73 grammar rules, covering all tenses, active and passive, subjunctive • Experimented also with simple NP and PP rules (movement of postpos in Hindi to prep in English), hurt performance • Problems in grammar testing and debugging – Ari… Hindi SLE Debriefing

  8. Example Grammar Rule ;; SIMPLE PRESENT AND PAST (depends on the tense of the Aux) ; Ex: (tu) bolta hE -> (I) (usually) speak ; Ex: (maiM) sotA hUM -> (I) sleep (now) ; Ex: (maiM) sotA thA -> (I) slept (used to spleep) {VP,5} VP::VP : [V Aux] -> [V] ( (X1::Y1) ((x1 form) = part) ((x1 aspect) = imperf) ((x2 lexwx) = 'honA') ((x2 tense) = (*NOT* fut)) ((x2 tense) = (*NOT* subj)) ((x0 tense) = (x2 tense)) ((x0 agr num) = (x2 agr num)) ((x0 agr pers) = (x2 agr pers)) (x0 = x1) ((y1 tense) = (x0 tense)) ((y1 agr num) = (x0 agr num)) ; not always agrees, try commenting ((y1 agr pers) = (x0 agr pers)) ) Hindi SLE Debriefing

  9. Transfer Runtime System • Three passes: • Pass1: match against p-to-p entries, halt if match found (ver2 allows to continue) • Pass2: morph analyze word and match against all w-to-w resources, halt if match found • Pass3: match original word against all w-to-w resources, provides only w-to-w output, no feeding into grammar rules. • Selection of best set of arcs: greedy left-to-right search that prefers longer input segments • Unk word policy: replace with English “the” • Post-processing: • remove be/give at eos if preceded by a verb • Replace all remaining “be” with “is” Hindi SLE Debriefing

  10. Development Testing • Three dev-test sets: • India Today: 59 sentences, single ref • Full ISI: 358 sents, newswire, single ref • Small ISI: first 25 sentences of Full ISI • Full ISI was most meaningful test-set, tested on IT earlier on, and to ensure no over-fitting. Hindi SLE Debriefing

  11. Final Performance: ISI-Full • Lexicon xferdict.0630-al-4 Hindi SLE Debriefing

  12. Debug Output with Sources amerikI senA ne kahA hE ki irAka kI galiyoM meM cAro waraPa vyApwa aparAXa ko niyaMwriwa karane ke lie uMhoMne irAkiyoM ko senA ke kAma meM Ane vAle haWiyAra sOMpane ke lie 2 sapwAha kA samaya xiyA hE . <AMERICAN---ETLEX> <ARMY---SORTED> <SAID---ETLEX> <BE---SORTED> <THAT ---POSTPOS> <IRAQ---IBM> <OF---MANUALERIK> <LANES---SORTED> <IN---MANUALERIK> <ROUND---IBM> <SIDE---SORTED> <PERVADING---IBM> <TO THE CRIME---ETPHRASE> <CONTROLLED---IBM> <TO MAKE---ETLEX> <THE> <IRAQIS---MANUALARI> <TO---MANUALERIK> <ARMY---SORTED> <WORK---JOY> <CAME TO---JOY> <ONES---SORTED> <WEAPONS---SORTED> <CHARGE---SORTED> <FOR---BIGR> <2> <WEEK---SORTED> <OF---MANUALERIK> <TIME---SORTED> <THEY HAVE---JOY> <.> Hindi SLE Debriefing

  13. Histogram of Source Information SORTED total = 2425 IBM total = 447 JOY total = 483 MANUAL ERIK total = 619 MANUAL ARI total = 139 BIGR total = 196 POSTPOS total = 510 TIMEEXP total = 4 ETLEX total = 0 ETPHRASE total = 0 Hindi SLE Debriefing

  14. Things we Tried at Last Minute • Allowing the second pass to take place even if matches on phrases in first pass – no improvement in score • Throwing in NP rules and solving the lost unigrams by a clever final pass that replaces the choices for words – hurt score slightly… Hindi SLE Debriefing

  15. Real Eval Set Transfer Run • Eval set consisted of 450 sentences from a variety of newswire sources • Suspicion of some sents drawn from dev data! • We submitted XFER-ONLY and XFER-ONLY+CASE • Aggregate stats from our run: • Coverage: 88.3% • Compounds matched: 2279 (token) • Went thru Morph and matched: 6256/9605 • Unkown Hindi words: 1122 Hindi SLE Debriefing

  16. Limited Resource Scenario • The “rules of the game” were skewed against us in this evaluation: • 1.5 Million words of parallel text • Noisy statistical lexical resources • We don’t have a strong statistical selection model • How do we do in the minority language scenario, with our limited resources? • Kathrin ran test with Lexicon constructed just from Man rules, bigrams, postpos, LDC dict and Elicited data • We will also test EBMT and SMT under the same scenario! Hindi SLE Debriefing

  17. Results: ISI-Full Hindi SLE Debriefing

  18. Observations and Lessons • Serious grammar development occurred very late in the process (last few days) • Very hard time getting grammar to start pulling performance numbers up • Grammar rules are often blocked from applying because of phrasal matches • Rather hard to find cases where they were supposed to apply and didn’t • NP/PP rules did not help, partly because NPs boundaries were not adequately found • Strange phenomena of loosing unigrams when NP rules apply. Need to debug this thoroughly Hindi SLE Debriefing

  19. Things we should find out • What sources did output come from in real eval test run? Get histogram… • What is the marginal contribution of various resources to our performance? • Conduct runs with individual resources omitted, in particular, without: • Our elicited data • The IBM data • The “Joy” data • The LDC Lexicon • Without the phrase-to-phrase pass • Can more grammar development help? Hindi SLE Debriefing

  20. Further Work on our Hindi System • Our Hindi system ideal platform for advanced thesis-related research work from now on • Eval test set will remain unseen test data for future experimentation (ref translations will be available soon) • Low pace further system development throughout July (grammar, bug fixes) • Worthwhile new results to be reported at the August PI meeting Hindi SLE Debriefing

More Related