1 / 47

Not So Surprising Anymore: Hindi from TIDES to FIRE

Not So Surprising Anymore: Hindi from TIDES to FIRE. Douglas W. Oard and Tan Xu University of Maryland, USA http://terpconnect.umd.edu/~oard. Slides from: Leah Larkey, Mike Maxwell, Franz Josef Och, David Yarowsky Ideas from: Just about all of “Team TIDES”. A Very Brief History of NLP.

hilda
Download Presentation

Not So Surprising Anymore: Hindi from TIDES to FIRE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA http://terpconnect.umd.edu/~oard Slides from: Leah Larkey, Mike Maxwell, Franz Josef Och, David Yarowsky Ideas from: Just about all of “Team TIDES” FIRE

  2. A Very Brief History of NLP • 1966: ALPAC • Refocus investment on enabling technologies • 1990: IBM’s Candide MT system • Credible data-driven approaches • 1999: TIDES • Translation, Detection, Extraction, Summarization

  3. Surprise Language Framework • English-only Users / Docs in language X • Zero-resource start (treasure hunt) • Sharply time constrained (29 days) • Character-coded text • Research-oriented • Intense team-based collaboration

  4. Cebuano Announce: Mar 5 Test Data: Stop Work: Mar 14 Newsletter: April Talks: May 30 (HLT) Papers: Hindi Jun 1 Jun 27 Jun 30 August Aug 5 (TIDES PI) October (TALIP) Schedule

  5. 300-Language Survey

  6. Five evaluated tasks • Automatic CLIR (English queries) • Topic tracking (English examples, event-based) • Machine translation into English • English “Headline” generation • Entity tagging (five MUC types) • Several useful components • POS tags, morphology, time expressions, parsing • Several demonstration systems • Interactive CLIR (two systems) • Cross-language QA (English Q, Translated A) • Machine translation (+ Translation elicitation) • Cross-document entity tracking

  7. Cebuano + Hindi USC-ISI Maryland NYU Johns Hopkins Sheffield U Penn-LDC CMU UC Berkeley MITRE Hindi Only U Mass Alias-i BBN IBM CUNY-Queens K-A-T (Colorado) Navy-SPAWAR 16 Participating Teams

  8. Research Results Innovation Cycle Systems Coordination Resource Harvesting Strategy Push Organize Talk Capture Process Knowledge Translation Detection Extraction Summarization People Corpora Web Books Web Lexicons Books Time

  9. 10-Day Cebuano Pre-Exercise

  10. Hindi Participants

  11. Hindi Resources • Much more data available than for Cebuano • Data collected by all project participants • Web pages, News, Handbooks, Manually created, … • Dictionaries • Major problems: • Many non-standard encodings • Often no converters available • Available converters often did not work properly • Huge effort: data conversion and cleaning • Resulting bilingual corpus: 4.2 million words

  12. Hindi Translation Elicitation Server- Johns Hopkins University (David Yarowsky) • People voluntarily translated large numbers of Hindi news sentences for nightly prizes at a novel Johns Hopkins University website • Performance is measured by Bleu score on 20% randomly interspersed test sentences • Allows immediate way to rank and reward quality translations and exclude junk • Result: 300,000 words of perfectly sentence-aligned bitext (exactly on genre) for 1-2 cents/word within ~5 days • Much cheaper than 25 cents/word for translation services or 5 cents/word for a prior MT-group’s recruitment of local students • Observed exponential growth in usage (before prizes ended) • viral advertising via family, friends, newgroups, … • $0 in recruitment, advertising, and administrative costs • Nightly incentive rewards given automatically via amazon.com gift certificates to email addresses (any $ amount, no fee) • no need for hiring overhead. Rewards only given for proven high quality work already performed (prizes not salary). • immediate positive feedback encourages continued use • Direct immediate access to worldwide labor market fluent in source language Sample Interface: user (English) translations typed here… and here …. User choice of 2-3 encoding alternatives

  13. MT Challenges • Lexicon coverage • Hindi morphology • Transliteration of Names • Hindi word order: • SOV vs. SVO • Training data inconsistencies, misalignments • Incomplete tuning cycle • Same data/same model would give better results from better tuning of model parameters

  14. Example Translation • Indonesian City of Bali in October last year in the bomb blast in the case of imam accused India of the sea on Monday began to be averted. The attack on getting and its plan to make the charges and decide if itwere found guilty, he death sentence of May. Indonesia of the police said that the imam sea bomb blasts in his hand claim to be accepted. A night Club and time in the bomb blast in more than 200 people were killed and several injured were in which most foreign nationals. …

  15. MT Results Overview - Hindi Results in NIST evaluation: 7.43 Cased NIST (7.80 uncased)

  16. Comparison to other languages Note: different (news) test corpora, NIST scores incomparable

  17. Hindi Week 1: Porting • Monday • 2,973 BBC documents (UTF-8) • Batch CLIR (no stem, 2/3 known items rank 1) • Tuesday • MIRACLE (“ITRANS”, gloss) • Stemmer (implemented from a paper) • Wednesday • BBC CLIR collection (19 topic, known item) • Friday: • Parallel text (Bible: 900k words, Web: 4k words) • Devanagari OCR system

  18. Hindi Weeks 2/3/4: Exploration • N-grams (trigrams best for UTF-8) • Relative Average Term Frequency (Kwok) • Scanned bilingual dictionary (Oxford) • More topics for test collection (29) • Weighted structured queries (IBM lexicon) • Alternative stemmers (U Mass, Berkeley) • Blind relevance feedback • Transliteration • Noun phrase translation • MIRACLE integration (ISI MT, BBN headlines)

  19. Second Version BBN/UMD Topic Lists Scored Lexicon June 24 BBN/UMD Topic Lists ISI Emellie Word Alignment June 23 June 23 XML Format Ocred Oxford Hindi-English Dictionary Version3.0 Complete Matchine Translated BBC Collection Cleaned Complete Matchine Translated BBC Collection June 20 BBN Revised Hindi stemmer in UTF-8 hext XML Format Ocred Oxford Hindi-English Dictionary Scored translation lexicon version 2.1 June 19 ISI Emellie Word Alignment June 18 Ocred Oxford Hindi-English Dictionary June 18 Word alignment of LDC Parallel Texts Scored translation lexicon version 2 June 17 Auto Word Aligned Hindi Bible BBC Small word alignment in UTF-8 Master Dictionary By source Version 0.7 (only IIIT party) LDC Sentence Aligned Parallel Texts Collections June16 Berkeley Probabilistic Dictionaries of June13. June 13 ISI Probabilistic Lexicon. Of June13 2nd version BBC Small word alignment June12 EMILLIE CORPUS VERSION 0.1 Eng-Hindi CLIR Test Collection:29 queries June11 Hindi stemmer in UTF-8 hext June 10 Small BBC word alignment Fourth version CLIR system OCR System June 9 Full Hindi OCR System small web parallel corpus ITRANS Hindi Bible Master Dictionary Version 0.7 Cleaned Master Dictionary Version 0.7 Scored translation lexicon Third version CLIR system Expanding Coverage of Gloss Translation June 6 the first version of Internet Archive (IA) Web parallel Second version CLIR system Relevence Judgement of 19 queries June 5 Hindi Stemmer June 4 Eng-Hindi CLIR Test Collection:19 queries Initial BLEU test collection Human Translated BBC News Documents Word Aligned Bible Transliterated Hindi Bible June 3 First version CLIR system Cleaned BBC-Hindi News Collection in UTF_8 First MT System IIIT Shabdanjali Dictionary in UTF_8 Converter from utf8 devanagari to hexadecimal and to ITRANS IIIT Shabdanjali Dictionary in ISCII Hindi Bible Original BBC-Hindi News Collection in html Hindi Morphological Analysers Eng_Hindi dict with POS tags Paper: A Lightweight Stemmer for Hindi June 2 Converter between iscii and utf8 University of Maryland LDC ISI Other Resources JHU

  20. Formative Evaluation

  21. Lessons Learned • We learned more from 2 languages than 1 • Simple techniques worked for Cebuano • Hindi needed more (encoding, MT, transliteration) • Usable systems can be built in a month • Parallel text for MT is the pacing item • Broad collaboration yielded useful insights

  22. Our FIRE-2008 Goals • Evaluate Surprise Language resources • IBM and LDC translation lexicons • Berkeley Stemmer • Compare CLIR techniques • Probabilistic Structured Queries (PSQ) • Derived Aggregated Meaning Matching (DAMM)

  23. Comparing Test Collections

  24. Monolingual Baselines Our FIRE-2008 Training (TDN) 2003 Surprise Language (TDNS) 15 Surprise Language topics

  25. term frequency document frequency query term query document length document average document length term frequency in query A Ranking Function: Okapi BM25

  26. Estimating TF and DF for Query Terms f1 f2 f3 f4 0.4 20 5 2 50 50 40 30 200 0.3 e1 0.4 0.3 0.2 0.1 0.2 0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9 0.1 0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58

  27. Bidirectional: Unidirectional: merveilles//0.92 merveille//0.03 emerveille//0.03 merveilleusement//0.02 se//0.31 demande//0.24 demander//0.08 peut//0.07 merveilles//0.04 question//0.02 savoir//0.02 on//0.02 bien//0.01 merveille//0.01 pourrait//0.01 si//0.01 sur//0.01 me//0.01 t//0.01 emerveille//0.01 ambition//0.01 merveilleusement//0.01 veritablement//0.01 cinq//0.01 hier//0.01 Bidirectional Translation wonders of ancient world (CLEF Topic 151)

  28. Surprise LanguageTranslation Lexicons 40% p(h|e) 60% p(e|h)

  29. Synonym Sets as Models of Term Meaning George. W. Bush 乔治 . 布什 shrubbery 草丛 grass lawn 草坪 marijuana grass 大麻 1.0 0.7 布什 草丛 大麻 0.3 bush grass 0.6 0.4 0.8 0.2 1.0

  30. “Meaning Matching” Variants (Q) (D) (Q) (D) (Q) D Q (D) Q D Q D Q (D) Q D (Q) D

  31. Pruning Translations Translations Cumulative Probability Threshold 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 f1(0.32) f2(0.21) f3(0.11) f4(0.09) f5(0.08) f6(0.05) f7(0.04) f8(0.03) f9(0.03) f10(0.02) f11(0.01) f12(0.01) f1 f1 f1 f1 f1 f2 f1 f2 f1 f2 f3 f1 f2 f3 f4 f1 f2 f3 f4 f5 f1 f2 f3 f4 f5 f6 f7 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12

  32. Comparing PSQ and DAMM 15 Surprise Language topics, TDN queries

  33. 1/3 of Topics Improve w/DAMM 15 Surprise Language topics, TDN queries

  34. Official CLIR Results 50 FIRE-2008 topics, TDN queries

  35. Comparing Stemmers YASS Stemmer Better Berkeley Stemmer Better 50 FIRE-2008 Topics, TDN queries

  36. Best (Overall) CLIR Run clir-EH-umd-man2 Better Median Better 41 FIRE-2008 topics with ≥ 5 relevant documents, TDN queries

  37. Query Translated Query Search Cross-Language “Retrieval” Query Translation Ranked List

  38. English Definitions Query Query Translation Translated “Headlines” Translated Query Search Ranked List MT Selection Document Examination Query Reformulation Interactive Translingual Search Query Formulation Document Use

  39. UMass Interactive Hindi CLIR

  40. MIRACLE Design Goals • Value-added interactive search • Regardless of available resources • Maximize the value of minimal resources • Bilingual term list + Comparable English text • Leverage other available resources • Parallel text, morphology, MT, summarization

  41. Summary • Larger Hindi test collection • Prerequisite for insightful failure analysis • Surprise Language resources were useful • Translation lexicons • Berkeley stemmer (combine with YASS?) • DAMM is robust with weaker resources

  42. Looking Forward • Shared resources • Test collections • Translation lexicons (or parallel corpora) • Stemmers • System infrastructure • IL variants of Indri/Terrier/Zettair/Lucene • Community-based cycle of innovation • Students are our most important “result”

  43. For More Information • Team TIDES newsletter • http://language.cnri.reston.va.us/TeamTIDES.html • Cebuano: April 2003 • Hindi: October 2003 • Papers • NAACL/HLT 2003 • MT Summit 2003 • ACM TALIP Special Issues(Jun/Sep 2003)

More Related