coping with surprise multiple cmu mt approaches n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Coping with Surprise: Multiple CMU MT Approaches PowerPoint Presentation
Download Presentation
Coping with Surprise: Multiple CMU MT Approaches

Loading in 2 Seconds...

play fullscreen
1 / 17

Coping with Surprise: Multiple CMU MT Approaches - PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on

Coping with Surprise: Multiple CMU MT Approaches. Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language Technologies Institute Carnegie Mellon University Joint work with: Katharina Probst, Erik Peterson, Joy Zhang,

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Coping with Surprise: Multiple CMU MT Approaches' - ronna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
coping with surprise multiple cmu mt approaches

Coping with Surprise:Multiple CMU MT Approaches

Alon Lavie

Lori Levin, Jaime Carbonell,

Alex Waibel, Stephan Vogel,

Ralf Brown, Robert Frederking

Language Technologies Institute

Carnegie Mellon University

Joint work with:

Katharina Probst, Erik Peterson, Joy Zhang,

Fei Huang, Alicia Tribble, Ariadna Font-Llitjos,

Rachel Reynolds, Richard Cohen

main hindi sle efforts
Main Hindi SLE Efforts
  • Data Collection
    • Elicited Data Collection
    • Data from contacts in India
    • Web Crawling
  • Language Processing Utilities
    • Morphology
    • Encoding identification and conversion
  • MT system development
    • XFER system
    • SMT system
    • EBMT system

TIDES PI Meeting/ SLE

elicited data collection
Elicited Data Collection
  • Goal: Acquire high quality word aligned Hindi-English data to support XFER system development (grammar learning)
  • Recruited team of ~20 bilingual speakers at CMU and in India
  • Extracted a corpus of phrases (NPs and PPs) from Brown Corpus section of Penn TreeBank
  • Controlled Elicitation Corpus (typologically diverse, limited vocabulary) also translated into Hindi
  • Resulting in total of 17589word aligned translated phrases (~50KB)

TIDES PI Meeting/ SLE

the cmu elicitation tool
The CMU Elicitation Tool

TIDES PI Meeting/ SLE

elicited data collection1
Elicited Data Collection
  • Problems and issues:
    • English  Hindi direction allowed us to use the Penn TreeBank to extract accurate phrases
    • However, bilingual informants not accustomed to type Hindi  typos
    • Limits utility of the data, less effect on accuracy
    • Using the WSJ portion of the PennTB may have been a better fit for genre

TIDES PI Meeting/ SLE

main cmu contributions to sle shared resources
Main CMU Contributions to SLE Shared Resources
  • Elicited Data Corpus (~50KB)
  • Indian Government Parallel Text ERDC.tgz (338 MB)
  • CMU Phrase Lexicon Joyphrase.gz (3.5 MB)
  • Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5 MB)
  • CMU Aligned Sentences CMU-aligned-sentences.tar.gz (1.3 MB)
  • CMU Phrases and sentences CMU-phrases+sentences.zip (468 KB)
  • Bilingual Named Entity List IndiaTodayLPNETranslists.tar.gz (54KB)

Web Crawling:

  • Most sites with possible parallel texts had Hindi in proprietary encodings
  • Osho http://www.osho.com/Content.cfm?Language=Hindi

TIDES PI Meeting/ SLE

hindi morphological analyzer
Hindi Morphological Analyzer
  • http://www.iiit.net/ltrc/morph/index.htm
  • High quality and high coverage morphological analyzer from IIIT
    • Input: full inflected forms (RomanWX encoding)
    • Output: root form + collection of features
  • Installing as a local server required some effort, e.g. UTF-8  RomanWX
  • Used primarily in our XFER system

TIDES PI Meeting/ SLE

other hindi processing utilities
Other Hindi Processing Utilities
  • Encoding identification and conversion tools
    • Built two automatic encoding identifiers, used for web data collection
    • Located and installed encoding converters from a variety of encodings
    • Most widely used was UTF-8 to RomanWX

TIDES PI Meeting/ SLE

xfer system for hindi
XFER System for Hindi
  • Three passes:
    • match against phrase-to-phrase entries (full-forms, no morphology)
    • morphologically analyze input words and match against lexicon
      • matches feed into manual and learned transfer rules
    • match original word against lexicon - provides word-to-word translation as fall-back for input not otherwise covered
  • Simple decoding: greedy left-to-right search that prefers longer input segments: NIST 5.35
  • “Strong” decoding with lattices+LM: NIST 5.47

TIDES PI Meeting/ SLE

examples of learned rules
Examples of Learned Rules

TIDES PI Meeting/ SLE

smt system for hindi
SMT System for Hindi
  • Resources
    • Trained on commonly available bilingual corpora
    • Used bilingual Hindi-English dictionary
    • Named Entities
    • 70 million word English LM
  • CMU SMT System
    • Tuned on ISI devtest data
    • Monotone decoding, as reordering did not result in improvement on this test set
    • Mixed casing based on Named Entities and simple rules
  • NIST score: 6.74

TIDES PI Meeting/ SLE

ebmt system for hindi
EBMT System for Hindi
  • Training data: same as SMT + a few hand-written equivalent class generalizations
  • English LM built from APW portion of GigaWord Corpus (600M words)
  • Encoding variation: raw training data in a variety of different encodings  all converted to UTF-8 (already supported by EBMT)
  • Preprocessing of example phrases to improve word matching:
    • Match Hindi possessive with English ‘s
  • NIST Score: 5.98

TIDES PI Meeting/ SLE

a truly limited data scenario for hindi to english
A Truly Limited Data Scenario for Hindi-to-English
  • Put together a scenario with very miserly data resources:
    • Elicited Data corpus: 17589 phrases
    • Cleaned portion (top 12%) of LDC dictionary: ~2725 Hindi words (23612 translation pairs)
    • Manually acquired resources during the SLE:
      • 500 manual bigram translations
      • 72 manually written phrase transfer rules
      • 105 manually written postposition rules
      • 48 manually written time expression rules
  • No additional parallel text!!
  • Results presented tomorrow…

TIDES PI Meeting/ SLE

other cmu contributions to sle shared resources
Other CMU Contributions to SLE Shared Resources

FOUND RESOURCES not on LDC Website:

[From TidesSLList Archive website]

  • Vogel email 6/2
    • Hindi Language Resources: http://www.cs.colostate.edu/~malaiya/hindilinks.html
    • General Information on Hindi Script: http://www.latrobe.edu.au/indiangallery/devanagari.htm
    • Dictionaries at: http://www.iiit.net/ltrc/Dictionaries/Dict_Frame.html
    • English to Hindu dictionary in different formats: http://sanskrit.gde.to/hindi/
    • A small English to Urdu dictionary: http://www.cs.wisc.edu/~navin/india/urdu.dictionary
    • The Bible at: http://www.gospelcom.net/ibs/bibles/
    • The Emille Project: http://www.emille.lancs.ac.uk/home.htm
    • [Hardcopy phrasebook references]
    • A Monthly Newsletter of Vigyan Prasar
    • http://www.vigyanprasar.com/dream/index.asp
    • Morphological Analyser: http://www.iiit.net/ltrc/morph/index.htm

TIDES PI Meeting/ SLE

other cmu contributions to sle shared resources1
Other CMU Contributions to SLE Shared Resources

FOUND RESOURCES not on LDC Website: (cont.)

[From TidesSLList Archive website]

  • Tribble email, via Vogel 6/2 Possible parallel websites:
    • http://www.bbc.co.uk (English)
    • http://www.bbc.co.uk/urdu/ (Hindi)
    • http://sify.com/news_info/news/
    • http://sify.com/hindi/
    • http://in.rediff.com/index.html (English)
    • http://www.rediff.com/hindi/index.html (Hindi)
    • http://www.indiatoday.com/itoday/index.html
    • http://www.indiatodayhindi.com
  • Vogel email 6/2
    • http://us.rediff.com/index.html
    • http://www.rediff.com/hindi/index.html [Already listed]
    • http://www.niharonline.com/
    • http://www.niharonline.com/hindi/index.html
    • http://www.boloji.com/hindi/index.html
    • http://www.boloji.com/hindi/hindi/index.htm
    • The Gita Supersite http://www.gitasupersite.iitk.ac.in/
    • Press Information Bureau, Government of India
      • English: http://pib.nic.in/
      • Hindi: http://pib.nic.in/urdu/hindimain.html

TIDES PI Meeting/ SLE

other cmu contributions to sle shared resources2
Other CMU Contributions to SLE Shared Resources

FOUND RESOURCES not on LDC Website: (cont.)

[From TidesSLList Archive website]

  • 6/20 Parallel Hindi/English webpages:
    • GAIL (Natural Gas Co.) http://gail.nic.in/ UTF-8. [Found by CMU undergrad Web team] [Mike Maxwell, LDC, found it at the same time.]

SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE:

[From TidesSLList Archive website:]

  • Frederking email 6/3 [announced], 6/4 [provided]
    • Ralf Brown's idenc encoding classifier
  • Frederking email 6/5
    • PDF extractions from LanguageWeaver URLs: http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/English/ http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/Hindi/
  • Frederking email 6/5
    • Richard Wang's Perl ident.pl encoding classifier and ISCII-UTF8.pl converter
  • Frederking email 6/11
    • Erik Peterson here has put together a Perl wrapper for the IIIT Morphology package, so that the input can be UTF-8: http://progress.is.cs.cmu.edu/surprise/morph_wrapper.tar.gz

TIDES PI Meeting/ SLE

other cmu contributions to sle shared resources3
Other CMU Contributions to SLE Shared Resources

SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: (cont.)

[From TidesSLList Archive website:]

  • Levin email 6/13
    • Directory of Elicited Word-Aligned English-Hindi Translated Phrases: http://progress.is.cs.cmu.edu/surprise/Elicited-Data/
  • Frederking email 6/20
    • Undecoded but believed to be parallel webpages: http://progress.is.cs.cmu.edu/surprise/merged_urls.txt
    • PDF extractions from same: http://progress.is.cs.cmu.edu/surprise/merged_urls/
  • Frederking email 6/24
    • Several individual parallel webpages; sites may have more: www.commerce.nic.in/setup.htm www.commerce.nic.in/hindi/setup.html mohfw.nic.in/kk/95/books1.htm mohfw.nic.in/oph.htm wwww.mp.nic.in

TIDES PI Meeting/ SLE