large scale evaluation of corpus based synthesizers the blizzard challenge 2005 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005 PowerPoint Presentation
Download Presentation
Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

Loading in 2 Seconds...

play fullscreen
1 / 26

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005 - PowerPoint PPT Presentation


  • 189 Views
  • Uploaded on

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005. Christina Bennett Language Technologies Institute Carnegie Mellon University Student Research Seminar September 23, 2005. Speech Synthesizer. Transcript. Voice talent speech. Corpus. New text. +. =.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005' - andrew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
large scale evaluation of corpus based synthesizers the blizzard challenge 2005

Large Scale Evaluation of Corpus-based Synthesizers:The Blizzard Challenge 2005

Christina Bennett

Language Technologies Institute

Carnegie Mellon University

Student Research Seminar

September 23, 2005

what is corpus based speech synthesis

Speech

Synthesizer

Transcript

Voice talent

speech

Corpus

New

text

+

=

New speech

What is corpus-based speech synthesis?
need for speech synthesis evaluation
Need for Speech Synthesis Evaluation

Motivation

  • Determine effectiveness of our “improvements”
  • Closer comparison of various corpus-based techniques
  • Learn about users' preferences
  • Healthy competition promotes progress and brings attention to the field
blizzard challenge goals
Blizzard Challenge Goals

Motivation

  • Compare methods across systems
  • Remove effects of different data by providing & requiring same data to be used
  • Establish a standard for repeatable evaluations in the field
  • [My goal:]Bring need for improved speech synthesis evaluation to forefront in community (positioning CMU as a leader in this regard)
blizzard challenge overview
Blizzard Challenge: Overview

Chal lenge

  • Released first voices and solicited participation in 2004
  • Additional voices and test sentences released Jan. 2005
  • 1 - 2 weeks allowed to build voices & synthesize sentences
    • 1000 samples from each system

(50 sentences x 5 tests x 4 voices)

evaluation methods
Evaluation Methods

Chal lenge

  • Mean Opinion Score (MOS)
    • Evaluate sample on a numerical scale
  • Modified Rhyme Test (MRT)
    • Intelligibility test with tested word within a carrier phrase
  • Semantically Unpredictable Sentences (SUS)
    • Intelligibility test preventing listeners from using knowledge to predict words
challenge setup tests
Challenge setup: Tests

Chal lenge

  • 5 tests from 5 genres
    • 3 MOS tests (1 to 5 scale)
      • News, prose, conversation
    • 2 “type what you hear” tests
      • MRT – “Now we will say ___ again”
      • SUS – ‘det-adj-noun-verb-det-adj-noun’
  • 50 sentences collected from each system, 20 selected for use in testing
challenge setup systems
Challenge setup: Systems

Chal lenge

  • 6 systems: (random ID A-F)
    • CMU
    • Delaware
    • Edinburgh (UK)
    • IBM
    • MIT
    • Nitech (Japan)
  • Plus 1: “Team Recording Booth”(ID X)
    • Natural examples from the 4 voice talents
challenge setup voices
Challenge setup: Voices

Chal lenge

  • CMU ARCTIC databases
  • American English; 2 male, 2 female
    • 2 from initial release
      • bdl (m)
      • slt (f)
    • 2 new DBs released for quick build
      • rms (m)
      • clb (f)
challenge setup listeners
Challenge setup: Listeners

Chal lenge

  • Three listener groups:
    • S – speech synthesis experts (50)
      • 10 requested from each participating site
    • V – volunteers (60, 97 registered*)
      • Anyone online
    • U – native US English speaking undergraduates (58, 67 registered*)
      • Solicited and paid for participation

*as of 4/14/05

challenge setup interface
Challenge setup: Interface

Chal lenge

  • Entirely online

http://www.speech.cs.cmu.edu/blizzard/register-R.html

http://www.speech.cs.cmu.edu/blizzard/login.html

  • Register/login with email address
  • Keeps track of progress through tests
  • Can stop and return to tests later
  • Feedback questionnaire at end of tests
voice results listener preference
Voice results: Listener preference

Results

  • slt is most liked, followed by rms
    • Type S:
      • slt - 43.48% of votes cast; rms - 36.96%
    • Type V:
      • slt - 50% of votes cast; rms - 28.26%
    • Type U:
      • slt - 47.27% of votes cast; rms - 34.55%
  • But, preference does not necessarily match test performance…
voice results test performance
Voice results: Test performance

Results

Female voices - slt

voice results test performance18
Voice results: Test performance

Results

Female voices - clb

voice results test performance19
Voice results: Test performance

Results

Male voices - rms

voice results test performance20
Voice results: Test performance

Results

Male voices - bdl

voice results natural examples
Voice results: Natural examples

Results

What makes natural rms different?

voice results by system
Voice results: By system

Results

  • Only system B consistent across listener types: (slt best MOS, rms best WER)
  • Most others showed group trends, i.e.

(with exception of B above and F*)

    • S: rms always best WER, often best MOS
    • V: slt usually best MOS, clb usually best WER
    • U: clb usually best MOS and always best WER

 Again, people clearly don’t preferthe voices they most easily understand

lessons learned listeners
Lessons learned: Listeners

Lessons

  • Reasons to exclude listener data:
    • Incomplete test, failure to follow directions, inability to respond (type-in), unusable responses
  • Type-in tests very hard to process automatically:
    • Homophones, misspellings/typos, dialectal differences, “smart” listeners
  • Group differences:
    • V most variable, U most controlled, S least problematic but not representative
lessons learned test design
Lessons learned: Test design

Lessons

  • Feedback re tests:
    • MOS: Give examples to calibrate scale (ordering schema); use multiple scales (lay-people?)
    • Type-in: Warn about SUS; hard to remember SUS; words too unusual/hard to spell
  • Uncontrollable user test setup
  • Pros & Cons to having natural examples in the mix
    • Analyzing user response (+), differences in delivery style (-), availability of voice talent (?)
goals revisited
Goals Revisited

Lessons

  • One methodology clearly outshined rest
  • All systems used same data allowing for actual comparison of systems
  • Standard for repeatable evaluations in the field was established
  • [My goal:]Brought attention to need for better speech synthesis evaluation (while positioning CMU as the experts)
for the future
For the Future

Future

  • (Bi-)Annual Blizzard Challenge
    • Introduced at Interspeech 2005 special session
  • Improve design of tests for easier analysis post-evaluation
  • Encourage more sites to submit their systems!
  • More data resources (problematic for the commercial entities)
  • Expand types of systems accepted (& therefore test types)
    • e.g. voice conversion