1 / 24

Gina-Anne Levow and Douglas W.Oard Institute for Advanced Computer Studies

Topic Tracking at Maryland: Lessons from the Johns Hopkins Mandarin-English Information (MEI) Project. Gina-Anne Levow and Douglas W.Oard Institute for Advanced Computer Studies University of Maryland, College Park. Roadmap . MEI Overview (6 weeks in 5 minutes) MEI Results

astrid
Download Presentation

Gina-Anne Levow and Douglas W.Oard Institute for Advanced Computer Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic Tracking at Maryland:Lessons from the Johns Hopkins Mandarin-English Information (MEI) Project Gina-Anne Levow and Douglas W.Oard Institute for Advanced Computer Studies University of Maryland, College Park TDT-2000 Workshop

  2. Roadmap • MEI Overview (6 weeks in 5 minutes) • MEI Results • Adapting MEI to TDT • TDT Results • Conclusions

  3. The MEI Team Helen Meng Chinese University of Hong Kong Erika Grams Advanced Analytic Tools Sanjeev Khudanpur Johns Hopkins University Gina-Anne Levow University of Maryland Douglas Oard University of Maryland Patrick Schone US Department of Defense Hsin-Min Wang Academia Sinica, Taiwan • Senior Members • Students Berlin Chen National Taiwan University Wai-Kit Lo Chinese University of Hong Kong Karen Tang Princeton University Jianqiang Wang University of Maryland

  4. Different Problems MEI: The Challenges • Speech Recognition • Tokenization • Lexicon coverage • Selection among alternatives • Translation • Tokenization • Lexicon coverage • Selection among alternatives

  5. English Phrases English Words Mandarin Characters Mandarin Words Mandarin Syllables Term Granularity Options

  6. MEI Evaluation Collections Development Collection: TDT-2 Evaluation Collection: TDT-3 Jan 98 Jun 98 Oct 98 Dec 98 17 topics, variable number of exemplars 56 topics, variable number of exemplars English text topic exemplars: Associated Press New York Times 2265 manually segmented stories 3371 manually segmented stories Mandarin audio broadcast news: Voice of America Mar 98 Jun 98

  7. Bilingual Term List Relevance Judgments English Exemplar LDC CETA LDC 000100010000010100 President Bill Clinton and… LDC Named Entity Tagging Term Selection Term Translation Query Construction BBN Ranked List Mandarin IR System U Mass Evaluation Mandarin Audio Speech Recognition Document Construction Cornell Mean Uninterpolated Average Precision LDC Dragon Story Boundaries LDC

  8. Query Translation • Dictionary inversion for phrase translation • “Wall Street” “best interests” “human rights” • Lemmatize remaining words if necessary • e.g. “televised” translates as “television • filtering for query term selection • Compared to an English background model

  9. Evaluation Measure Able to characterize variation across exemplars!

  10. Balanced Translation Works Well • Pirkola’s structured queries • Treat translation alternatives as synonyms • Inquery #syn() operator • Balanced translation • Distribute probability mass over translation alternatives • Inquery #sum() operator TDT-2, phrase-based translation, word-based retrieval

  11. Phrase Translation Beats Words • Phrases beat words • Three sources • Translation lexicon • Named entities • Numeric expressions Condition: TDT-2, 12 exemplars, word-based retrieval

  12. Character Bigram Indexing Wins • Character bigrams are best • Syllable bigrams do poorly TDT-2, single NYT exemplar, manual translation

  13. Terms total OOV # (by token) 87,004 3,028 # (by type) 12,402 1,122 Untranslatable Terms Term Occurrences suharto 97 netanyahu 88 starr 62 arafat 50 bjp 45 vajpayee 44 estrada 44 …. hsu 19 zemin 7

  14. Cross-Language Phonetic Matching • Small improvement • Not statistically significant • Character bigrams are best • Form a unified index • Character and syllable bigrams • Translate words if possible • Then form character bigrams • Otherwise translate syllables • Then form syllable bigrams TDT-2, phrase-based translation

  15. MEI: Comparing Collections

  16. MEI Conclusions • ASR Words • Translation Phrases, Words, Lemmas, Syllables • Indexing Character Bigrams

  17. TDT-2000: What’s New Since ’99? • Key ideas from MEI: • Dictionary inversion for phrase translation • Balanced translation • Post-translation resegmentation • Adaptation to TDT: • Exploit negative exemplars • Improved Mandarin topic normalization • Round-robin balanced translation

  18. Bilingual Term List English Exemplars TDT-2000 LDC President Bill Clinton and… LDC/ CETA Training Epoch Term Selection Term Translation Query Construction Ranked List PRISE IDF Computation NIST Score Normalization Mandarin Audio Speech Recognition Document Construction Scores LDC Dragon Story Boundaries LDC

  19. Topic Tracking Improvements • Improved filtering for query term selection • First compare to background model • Augment by comparison to negative exemplars • Mandarin topic normalization (unofficial) • Language-specific strategy • Mandarin: Best single training epoch score • English: Average of exemplar scores • Recomputed Mandarin source normalization

  20. Effect of Negative Exemplars Text Only DET Plots 1st 60 topics (self-scored) Mandarin Text Nn=0 & Nn = 2 English Text Nn=0 & Nn=2

  21. Indexing Character Bigrams Mandarin Speech Only 1st 60 topics (unofficial renormalization) Words Character Bigrams

  22. Round Robin 8-Best Translation Mandarin Text 1st 60 Topics (self-scored) TDT-1999 2-best translation TDT-2000 Round-robin 8 best

  23. Conclusions • Top-8 round robin translation to Mandarin wins • Slightly outperforms top-2 translation to English • Query translation is more efficient • Better suited to a stream of stories • Match term extent to purpose • ASR, translation, indexing

  24. Closing Thoughts • Thanks to Jon and LDC ! • Normalization limits our insight • Need some way to see past it • Availability of TDT-3 ground truth?

More Related