1 / 26

Cairo University Faculty of Computers and Information

Cairo University Faculty of Computers and Information. HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed. Agenda. Speech Synthesis HMM Based Speech Synthesis Proposed System Challenges. Speech Synthesis. What is speech synthesis?

rumor
Download Presentation

Cairo University Faculty of Computers and Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cairo UniversityFaculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed

  2. Agenda • Speech Synthesis • HMM Based Speech Synthesis • Proposed System • Challenges

  3. Speech Synthesis • What is speech synthesis? • Generating human like speech using computers. • Applications • Text To Speech. • Conversation systems. • Speech to speech translation. • Concept to speech. • Systems built since late 1970s. • MITTALK 1979 • Klattalk 1980

  4. Speech Synthesis, Cont. • Challenges: • Intelligibility. • Naturalness. • Pleasantness. • Emotions.

  5. Speech Synthesis, Techniques Techniques Formant Based Concatenative HMM Based Rule Based Difficult to make Machine Like Instance Based Based on corpus Better quality Not flexible Statistical Based Based on corpus Newest technique More flexible

  6. Agenda • Speech Synthesis • HMM Based Speech Synthesis • Proposed System • Challenges

  7. HMM Based Speech Synthesis Overview • HMM has been used successfully in speech recognition. • In Recogntion • In Speech Synthesis:

  8. HMM Based Speech Synthesis Overview, Cont. • Include delta and acceleration to get smooth output

  9. The Overall System Speech Database Training Part F0 Extraction Mel-CepstralAnalysis f0 Mel-cepstrum Text Analysis HMM Training Labels and context features Models Text Synthesis Part Labels and context features Parameters Generation Text Analysis Mel-cepstrum f0 Excitation Pulse or Noise Excitation MLSA filter Speech

  10. The Overall System Modeled using MSD-HMM 25 Mel-Cepstral Speech Database Training Part F0 Extraction Mel-CepstralAnalysis f0 Mel-cepstrum Text Analysis HMM Training Labels and context features Models Text Synthesis Part Labels and context features Parameters Generation Text Analysis Mel-cepstrum f0 Excitation Pulse or Noise Excitation MLSA filter Speech

  11. The Overall System Speech Database Training Part F0 Extraction Mel-CepstralAnalysis f0 Mel-cepstrum Text Analysis HMM Training Labels and context features Context Dependant Models Each model 5 States Models Text Synthesis Part Labels and context features Parameters Generation Text Analysis Mel-cepstrum f0 Excitation Pulse or Noise Excitation MLSA filter Speech

  12. The Overall System Speech Database Training Part F0 Extraction Mel-CepstralAnalysis f0 Mel-cepstrum Text Analysis HMM Training Labels and context features Models Text Synthesis Part Labels and context features Parameters Generation Text Analysis Mel-cepstrum f0 Excitation Pulse or Noise Excitation MLSA filter Speech

  13. The Overall System Speech Database Training Part F0 Extraction Mel-CepstralAnalysis f0 Mel-cepstrum Text Analysis HMM Training Labels and context features Models Text Synthesis Part Labels and context features Parameters Generation Text Analysis Mel-cepstrum Each Frame is either voiced or unvoiced f0 Excitation Pulse or Noise Excitation MLSA filter Speech

  14. The Overall System Speech Database Training Part F0 Extraction Mel-CepstralAnalysis f0 Mel-cepstrum Text Analysis HMM Training Labels and context features Models Text Synthesis Part Labels and context features Parameters Generation Text Analysis Mel-cepstrum f0 Excitation Pulse or Noise Excitation MLSA filter Speech

  15. Advantages • Its voice characteristics can be easily modified, • It can be applied to various languages with little modification, • A variety of speaking styles or emotional speech can be synthesized using the small amount of speech data, • Techniques developed in ASR can be easily applied, • Its footprint is relatively small. • An HMM based TTS system produced best results in Blizzard challenge.

  16. Agenda • Speech Synthesis • HMM Based Speech Synthesis • Proposed System • Challenges

  17. Problems we tried to solve • Marking each frame as either voiced or unvoiced degrades quality, because there are some unvoiced components on most voiced speech parts, and there are mixed-excitation phonemes. • Used speech signal analysis / synthesis techniques and parameters degrades quality.

  18. Multi-Band Excitation • In MBE (Multi-Band Excitation) speech is divided into a number of frequency bands, and voicing is estimated in each band (used 17 bands).

  19. Mixed Excitation • In synthesis periodic and noise excitations are mixed according to voicing parameters

  20. Spectral Envelop Estimation Find values for a fixed number of samples Use sinusoidal model for synthesis

  21. Modified System Speech Database Training Part F0 Extraction Bands Voicing detection Spectral Envelop Analysis Bands Voicing f0 Spectral Envelop Samples Text Analysis HMM Training Labels and context features Models Text Synthesis Part Labels and context features Parameters Generation Bands Voicing Text Analysis Spec. Env. Samples+ f0 Noise + STFT filter Speech Unvoiced Speech Bands Mixing Voiced Speech Harmonics Synthesis

  22. Result • MOS scores

  23. Agenda • Speech Synthesis • HMM Based Speech Synthesis • Proposed System • Challenges

  24. Other Challenges • Speech is overly smoothed • Use global variance. • Modeling accuracy, the system uses same modeling as recognition. • Hidden semi markov models (duration). • Trajectory HMMs, • Minimum Generation error training • More states clusters and use acoustic context (under research).

  25. Previous Current Next … More States Clusters • Instead of computing one Gaussian per state, we store all occurrences. And record the context of each occurrence. • At synthesis we get the best sequence using dynamic programming.

  26. Thank You

More Related