enhanced infrastructure for creation collection of translation resources n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Enhanced Infrastructure for Creation & Collection of Translation Resources PowerPoint Presentation
Download Presentation
Enhanced Infrastructure for Creation & Collection of Translation Resources

Loading in 2 Seconds...

play fullscreen
1 / 14

Enhanced Infrastructure for Creation & Collection of Translation Resources - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

Enhanced Infrastructure for Creation & Collection of Translation Resources. Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda. Introduction. LDC develops large scale parallel text corpora for sponsored research programs Manual creation of parallel text by human translators

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Enhanced Infrastructure for Creation & Collection of Translation Resources' - waseemah-duaa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
enhanced infrastructure for creation collection of translation resources

Enhanced Infrastructure for Creation & Collection of Translation Resources

Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda

introduction
Introduction
  • LDC develops large scale parallel text corpora for sponsored research programs
    • Manual creation of parallel text by human translators
    • Harvesting, aligning potential parallel documents from known repositories and the web
  • Recent expansion in scope and variety
    • Requiring improvements in quality, efficiency and cost-effectiveness
context for resource creation
Context for Resource Creation
  • Previous focus primarily Chinese, Arabic newswire (NW)
  • Current focus on "unstructured" data
    • Broadcast News (BN) and Broadcast Conversation (BC)
    • Weblogs, Newsgroups (WB)
    • Handwritten document images of many types (VAR)
  • New linguistic varieties
    • Eight language pairs in the LCTL program
    • Colloquial Arabic varieties for some projects
  • New evaluation requirements
    • Multiple human translations, adjudication of multiple translations
    • Translation alternatives for ambiguous source text
    • Translation post-editing
manual translation pipeline
Manual Translation Pipeline

datapool

source

text

translation

select

audio

transcription

and

segmentation

validate

release

package

QC

segment

into

sentence

units

convert to translator-

friendly

format

select

text

convert

torelease

format

selected

web data

translated

text

manual translation
Manual Translation
  • Commercial agencies vetted, trained by LDC
  • Required to use LDC's project-specific guidelines
    • Accuracy and fidelity over fluency
    • General principles, language-specific requirements
    • Rules for named entities, disfluencies, emoticons, etc.
    • Requirements for formatting and validation
    • Multiple examples of preferred translation
  • Separate guidelines for specialized tasks
    • Post-editing machine translation output
    • Translation alternatives
    • Translation of novel single sentences
    • Translation of handwritten document images
translation qc
Translation QC
  • All translations undergo additional QC at LDC
    • Typically 10% of training data, 100% of evaluation data reviewed
  • Standardized QC rating system deducts points for each type of error
    • QC report including score, examples sent to translators
    • Failing score requires re-translation of full data set
  • QC process facilitated by customized TransQC GUI
translation project management
Translation Project Management
  • Translation database is core management tool
    • Document ID, language, genre, token count, LDC file server path
    • Data set information including project, phase, partition, restrictions
    • Translator assignment, due date, status, QC score, payment info
  • Backend to LDC Translator Extranet
    • Translators access and submit assignments, validate submissions, view QC reports, generate invoices, check payment status
  • Queries support status tracking but also assignment generation, data selection, cross-project coordination
    • What translation assignments are pending delivery this week?
    • What is average QC score for this translator on Chinese BC?
    • List Arabic NW files from 2007 that have never been released as GALE training data and are not part of any project's eval set
parallel text harvesting
Parallel text harvesting
  • Manual translation supplemented by harvesting and alignment of potential parallel text
    • Harvest text from multilingual sites
      • E.g. newswire providers
    • Standardize markup format
    • Use BITS document mapping module to find likely parallel documents
    • Use Champollion to find sentence alignments
  • High yields in GALE program
    • 82,000 Arabic-English document pairs
    • 67,000 Chinese-English document pairs
conclusion
Conclusion
  • Robust, flexible translation infrastructure to support multiple, distinct, concurrent projects
  • Much of this infrastructure freely available from LDC
    • Task specifications, guidelines available for all projects
      • http://projects.ldc.upenn.edu/gale/Translation/
    • QCTrans GUI slated for free, open-source distribution
  • Many resulting parallel text corpora already in LDC Catalog
  • Newly emerging data sets to be added over time
acknowledgements
Acknowledgements
  • This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.