overview of nit hmm based speech synthesis system for blizzard challenge 2011 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 PowerPoint Presentation
Download Presentation
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Loading in 2 Seconds...

play fullscreen
1 / 22

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 - PowerPoint PPT Presentation


  • 188 Views
  • Uploaded on

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011. Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011. Background. HMM-based speech synthesis Quality of synthesized speech depends on acoustic models

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011' - caia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview of nit hmm based speech synthesis system for blizzard challenge 2011

Overview of NIT HMM-basedspeech synthesis systemfor Blizzard Challenge 2011

Kei Hashimoto, Shinji Takaki, Keiichiro Oura,

and Keiichi Tokuda

Nagoya Institute of Technology

2 September, 2011

background
Background
  • HMM-based speech synthesis
    • Quality of synthesized speech depends on acoustic models
    • Model estimation is one of the most important problem
  • Appropriate training algorithm is required
    • Deterministic annealing EM (DAEM) algorithm
      • To overcome the local maxima problem
    • Step-wise model selection
      • To perform the joint optimization of model structures and state sequences
outline
Outline
  • HMM-based speech synthesis system
  • Deterministic annealing EM (DAEM) algorithm
  • Step-wise model selection
  • Experiments
  • Conclusion & future work
overview of hmm based system
Overview of HMM-based system

Speech signal

Speech

database

Excitation parameters

extraction

Spectral parameters

extraction

Label

Training of HMM

Contest-dependent HMMs

& duration models

Training part

Synthesis part

TEXT

Parameter generation

from HMM

Label

Text analysis

Excitation

parameters

Spectral

parameters

Excitation

generation

Synthesized

speech

Synthesis

filter

base techniques
Base techniques
  • Hidden semi-Markov Model (HSMM)
    • HMM with explicit state duration probability dist.
    • Estimate state output and duration probability dists.
  • STRAIGHT
    • A high quality speech vocoding method
    • Spectrum, F0, and aperiodicity measures
  • Parameter generation considering GV
    • Calculate GV features from only speech region excluding silence and pause
    • Context dependent GV models
outline1
Outline
  • HMM-based speech synthesis system
  • Deterministic annealing EM (DAEM) algorithm
  • Step-wise model selection
  • Experiments
  • Conclusion & future work
em algorithm
EM algorithm
  • Maximum likelihood (ML) criterion
  • Expectation Maximization (EM) algorithm

: Model parameter

: Training data

: HMM state seq.

・E-step:

・M-step:

Occur the local maxima problem

daem algorithm
DAEM algorithm
  • Posterior probability
  • Model update process

・E-step:

・M-step:

: Temperature parameter

・Increase temperature parameter

optimization of state sequence
Optimization of state sequence
  • Likelihood function in the DAEM algorithm

State output probability

State transition probability

Time

All state sequences have uniform probability

optimization of state sequence1
Optimization of state sequence
  • Likelihood function in the DAEM algorithm

State output probability

State transition probability

Time

Change from uniform to sharp

optimization of state sequence2
Optimization of state sequence
  • Likelihood function in the DAEM algorithm

State output probability

State transition probability

Time

Estimate reliable acoustic models

outline2
Outline
  • HMM-based speech synthesis system
  • Deterministic annealing EM (DAEM) algorithm
  • Step-wise model selection
  • Experiments
  • Conclusion & future work
problem of context clustering
Problem of context clustering
  • Context-dependent model
    • Appropriate model structures are required
  • Decision tree based context clustering
    • Assumption: state occupancies are not changed
      • State occupancies depend on model structures
      • State sequences and model structures should be optimized simultaneously

Vowel?

/a/?

Silence?

step wise model selection
Step-wise model selection
  • Gradually change the size of decision tree
    • Perform joint optimization of model structures and state sequences
  • Minimum Description Length (MDL) criterion

: Tuning parameter

: Amount of training data

assigned to the root node

: Number of nodes

: Dimension of feature vec.

model training process
Model training process
  • Estimate monophone models (DAEM)
    • # of temperature parameter updates is 10
    • # of EM-steps at each temperature is 5
  • Select decision trees by the MDL criterion using the tuning parameter
  • Estimate context-dependent models (EM)
    • # of EM-steps is 5
  • Decrease the tuning parameter
    • Tuning parameter decreases as 4, 2, 1
  • Repeat from step. 2
outline3
Outline
  • HMM-based speech synthesis system
  • Deterministic annealing EM (DAEM) algorithm
  • Step-wise model selection
  • Experiments
  • Conclusion & future work
likelihood model structure
Likelihood & model structure
  • Average log likelihood of monophone model
  • Number of leaf nodes
      • Phone set: Unilex (58 phoneme)
      • Number of leaf nodes (Full-context): 6,175,466
experimental results
Experimental results
  • Compare with the benchmark HMM-based system
    • NIT system achieved the same performance
    • High intelligibility
  • Compare with the benchmark unit-selection system
    • Worse in speaker similarity
    • Better in intelligibility
speech samples
Speech samples
  • Generate high intelligible speech
  • Include voiced/unvoiced errors
  • Need to improve feature extraction and excitation
conclusion
Conclusion
  • NIT HMM-based speech synthesis system
    • DAEM algorithm
      • Overcome the local maxima problem
    • Step-wise model selection
      • Perform joint optimization of state sequences and model structures
    • Generate high intelligible speech
  • Future work
    • Improve feature extraction and excitation
    • Investigate the schedule of temperature parameters and step-wise model selection