1 / 58

Voice Conversion

Voice Conversion. Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012. E.Godoy Biography. United States (Rhode Island): Native Hometown: Middletown, RI Undergrad & Masters at MIT (Boston Area) France ( Lannion ): 2007-2011 Worked on my PhD at Orange Labs

jolene
Download Presentation

Voice Conversion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

  2. E.Godoy Biography • United States (Rhode Island): Native • Hometown: Middletown, RI • Undergrad & Masters at MIT (Boston Area) • France (Lannion): 2007-2011 • Worked on my PhD at Orange Labs • Iraklio: Current • Work at FORTH with Prof Stylianou E.Godoy, Voice Conversion

  3. Professional Background • B.S. &M.Eng from MIT, Electrical Eng. • Specialty in Signal Processing • Underwater acoustics: target physics, environmental modeling, torpedo homing • Antenna beamforming (Masters): wireless networks • PhD in Signal Processing at Orange Labs • Speech Processing: Voice Conversion • Speech Synthesis Team (Text-to-Speech) • Focus on Spectral Envelope Transformation • Post-Doctoral Research at FORTH • LISTA: Speech in Noise & Intelligibility • Analyses of Human Speaking Styles (e.g. Lombard, Clear) • Speech Modifications to Improve Intelligibility E.Godoy, Voice Conversion

  4. Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Proposed: Dynamic Frequency Warping + Amplitude Scaling • Conversion Results • Objective Metrics & Subjective Evaluations • Sound Samples • Summary & Conclusions E.Godoy, Voice Conversion

  5. Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion E.Godoy, Voice Conversion

  6. Voice Conversion (VC) • Transform the speech of a (source) speaker so that it sounds like the speech of a different (target) speaker. Voice Conversion This is awesome! This is awesome! He sounds like me! ??? source target E.Godoy, Voice Conversion

  7. Context: Speech Synthesis • Increase in applications using speech technologies • Cell phones, GPS, video gaming, customer service apps… • Information communicated through speech! • Text-to-Speech (TTS) Synthesis • Generate speech from a given text Insert your card. Turn left! This is Abraham Lincoln speaking… Ha ha! Next Stop: Lannion Text-to-Speech! 7 E.Godoy, Voice Conversion E.Godoy, Guest Lecture December 11, 2012

  8. Text-to-Speech (TTS) Systems • TTS Approaches • Concatenative: speech synthesized from recorded segments • Unit-Selection: parts of speech chosen from corpora & strung together High-quality synthesis, but need to record & process corpora • Parametric: speech generated from model parameters • HMM-based: speaker models built from speech using linguistic info Limited quality due to simplified speech modeling & statistical averaging Concatenative or Paramateric? E.Godoy, Voice Conversion

  9. Text-to-Speech (TTS) Example voice E.Godoy, Voice Conversion

  10. Voice Conversion: TTS Motivation • Concatenative speech synthesis • High-quality speech • But, need to record & process a large corpora for each voice • Voice Conversion • Create different voices by speech-to-speech transformation • Focus on acoustics of voices E.Godoy, Voice Conversion

  11. What gives a voice an identity? • “Voice” notion of identity (voice rather than speech) • Characterize speech based on different levels • Segmental • Pitch – fundamental frequency • Timbre – distinguishes between different types of sounds • Supra-Segmental • Prosody – intonation & rhythm of speech E.Godoy, Voice Conversion

  12. Goals of Voice Conversion • Synthesize High-Quality Speech • Maintain quality of source speech (limit degradations) • Capture Target Speaker Identity • Requires learning between source & target features Difficult task! • significant modifications of source speech needed that risk severely degrading speech quality… E.Godoy, Voice Conversion

  13. Stages of Voice Conversion • Key Parameter: the spectral envelope (relation to timbre) 1) Analysis, 2) Learning, 3) Transformation PARAMETER EXTRACTION TARGET speech • LEARNING • Generating Acoustic Feature Spaces • Establish Mappings between • Source & Target Parameters CORPORA PARAMETER EXTRACTION SOURCE speech • TRANSFORMATION • Classify Source Parameters • in Feature Space • Apply Transformation Function Converted Speech SYNTHESIS E.Godoy, Voice Conversion

  14. Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Proposed: Dynamic Frequency Warping + Amplitude Scaling E.Godoy, Voice Conversion

  15. The Spectral Envelope • Spectral Envelope: curve approximating the DFT magnitude • Related to voice timbre, plays a key role in many speech applications: • Coding, Recognition, Synthesis, Voice transformation/conversion • Voice Conversion: important for both speech quality and voice identity E.Godoy, Voice Conversion

  16. Spectral Envelope Parameterization • Two common methods 1) Cepstrum - Discrete Cepstral Coefficients - Mel-Frequency Cepstral Coefficients (MFCC)  change the frequency scale to reflect bands of human hearing 2) Linear Prediction (LP) - Line Spectral Frequencies (LSF) E.Godoy, Voice Conversion

  17. Standard Voice Conversion • Parallel corpora: source & target utter same sentences • Parameters are spectral features (e.g. vectors of cepstral coefficients) • Alignment of speech frames in time  Standard: Gaussian Mixture Model Extract target parameters Corpora (Parallel) Align source & target frames LEARNING Extract source parameters Converted Speech synthesis TRANSFORMATION Focus: Learning & Transforming the Spectral Envelope E.Godoy, Voice Conversion

  18. Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Formulation • Limitations • Acoustic mappings between source & target parameters • Over-smoothing of the spectral envelope • Proposed: Dynamic Frequency Warping + Amplitude Scaling E.Godoy, Voice Conversion

  19. Gaussian Mixture Model for VC • Origins: • Evolved from "fuzzy" Vector Quantization (i.e. VQ with "soft" classification) • Originally proposed by [Stylianou et al; 98] • Joint learning of GMM (most common) by [Kain et al; 98] • Underlying principle: • Exploit joint statistics exhibited by aligned source & target frames • Methodology: • Represent distributions of spectral feature vectors as mix of Q Gaussians • Transformation function then based on MMSE criterion E.Godoy, Voice Conversion

  20. GMM-based Spectral Transformation 1) Align N spectral feature vectors in time. (discrete cepstral coeffs) 2) Represent PDF of vectors as mixture of Q multivariate Gaussians Learn from Expectation Maximization (EM) on Z 3) Transform source vectors using weighted mixture of Maximum Likelihood (ML) estimator for each component. : probability source frame belongs to acoustic class described by component q (calculated in Decoding) E.Godoy, Voice Conversion

  21. GMM-Transformation Steps 1) Source frame  want to estimate target vector: 2) Classify  calculate : probability source frame belongs to acoustic class described by component q (Decoding step) 3) Apply transformation function: weighted sum ML estimator for class E.Godoy, Voice Conversion

  22. Acoustically Aligning Source & Target Speech • First question: are the acoustic events from the source & target speech appropriately associated? Time alignment  Acoustic alignment ??? • One-to-Many Problem • Occurs when single acoustic event of source aligned to multiple acoustic events of target [Mouchtaris et al; 07] • Problem is that cannot distinguish distinct target events given only source information Source: one single event As , Target: two distinct events  Bt, Ct [As ; Bt] [As ; Ct] Joint Frames Acoustic Space Bt As Ct E.Godoy, Voice Conversion

  23. Cluster by Phoneme (“Phonetic GMM”) • Motivation:eliminate mixtures to alleviate one-to-many problem! • Introduce contextual information • Phoneme (unit of speech describing linguistic sound): /a/,/e/,/n/, etc • Formulation:cluster frames according to phoneme label Each Gaussian class q then corresponds to a phoneme: • Outcome: error indeed decreases using classification by phoneme • Specifically, errors from one-to-many mappings reduced! E.Godoy, Voice Conversion

  24. Still GMM Limitations for VC • Unfortunately, converted speech quality poor  original converted • What about the joint statistics in the GMM? • Difference between GMM and VQ-type (no joint stats) transformation? joint statistics No significant difference! (originally shown [Chen et al; 03]) E.Godoy, Voice Conversion

  25. "Over-Smoothing" in GMM-based VC • The Over-Smoothing Problem: (Explains poor quality!) • Transformed spectral envelopes using GMM are "overly-smooth" • Loss of spectral details  converted speech "muffled", "loss of presence" • Cause:(Goes back to those joint statistics…) • Low inter-speaker parameter correlation (weak statistical link exhibited by aligned source & target frames) class mean Frame-level alignment of source & target parameters not effective! E.Godoy, Voice Conversion

  26. “Spectrogram” Example (GMM over-smoothing) Spectral Envelopes across a sentence. E.Godoy, Voice Conversion

  27. Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Proposed: Dynamic Frequency Warping + Amplitude Scaling • Related work • DFWA description E.Godoy, Voice Conversion

  28. Dynamic Frequency Warping (DFW) • Dynamic Frequency Warping [Valbret; 92] • Warp source spectral envelope in frequency to resemble that of target • Alternative to GMM  No transformation with joint statistics! • Maintains spectral details  higher quality speech  • Spectral amplitude not adjusted explicitly  poor identity transformation  E.Godoy, Voice Conversion

  29. Spectral Envelopes of a Frame: DFW E.Godoy, Voice Conversion

  30. Hybrid DFW-GMM Approaches • Goal to adjust spectral amplitude after DFW • Combine DFW- and GMM- transformed envelopes • Rely on arbitrary smoothing factors [Toda; 01],[Erro; 10] • Impose compromise between • maintaining spectral details (via DFW) • respecting average trends (via GMM) E.Godoy, Voice Conversion

  31. Proposed Approach: DFWA • Observation: Do not need the GMM to adjust amplitude! • Alternatively, can use a simple amplitude correction term  Dynamic Frequency Warping with Amplitude Scaling (DFWA) E.Godoy, Voice Conversion

  32. Levels of Source-Target Parameter Mappings (3) Class-level: Global (DFWA)  (1) Frame-level: Standard (GMM)  E.Godoy, Voice Conversion

  33. Recall Standard Voice Conversion • Parallel corpora: source & target utter same sentences • Alignment of individual source & target frames in time • Mappings between acoustic spaces determined on frame level Extract target parameters Corpora (Parallel) Align source & target frames LEARNING Extract source parameters Converted Speech synthesis TRANSFORMATION E.Godoy, Voice Conversion

  34. DFW with Amplitude Scaling (DFWA) • Associate source & target parameters based on class statistics Unlike typical VC, no frame alignment required in DFWA! • Define Acoustic Classes • DFW • Amplitude Scaling Extract target parameters Clustering Corpora Converted Speech synthesis DFW Amplitude Scaling Extract source parameters Clustering E.Godoy, Voice Conversion

  35. Defining Acoustic Classes • Acoustic classes built using clustering, which serves 2 purposes: • Classify individual source or target frames in acoustic space • Associate source & target classes • Choice of clustering approach depends on available information: • Acoustic information: • With aligned frames: joint clustering (e.g. joint GMM or VQ) • Without aligned frames: independent clustering & associate classes (e.g. with closest means) • Contextual information: use symbolic information (e.g. phoneme labels) • Outcome: q=1,2…Q acoustic classes in a one-to-one correspondence E.Godoy, Voice Conversion

  36. DFW Estimation • Compares distributions of observed source & target spectral envelope peak frequencies (e.g. formants) • Global vision of spectral peak behavior (for each class) • Only peak locations (and not amplitudes) considered • DFW function: piecewise linear • Intervals defined by aligning maxima in spectral peak frequency distributions • Dijkstra algorithm: min sum abs diff between target & warped source distributions E.Godoy, Voice Conversion

  37. DFW Estimation • Clear global trends (peak locations) • Avoid sporadic frame-to-frame comparisons • Dijkstra algorithm selects pairs • DFW statistically aligning most probable spectral events (peak locations) • Amplitude scaling then adjusts differences between the target & warped source spectral envelopes E.Godoy, Voice Conversion

  38. Spectral Envelopes of a Frame E.Godoy, Voice Conversion

  39. Amplitude Scaling • Goal of amplitude scaling: Adapt frequency-warped source envelopes to better resemble those of target • estimation using statistics of target and warped source data • For class q, amplitude scaling function defined by: • Comparison between source & target data strictly on acoustic class-level • Difference in average log spectra • No arbitrary weighting or smoothing! E.Godoy, Voice Conversion

  40. DFWA Transformation • Transformation(for source frame n in class q): • In discrete cepstral domain: (log spectrum parameters) Average target envelope respected ("unbiased estimator") Captures timbre! Spectral details maintained: difference between frame realization & average Ensures Quality! E.Godoy, Voice Conversion

  41. Spectral Envelopes of a Frame E.Godoy, Voice Conversion

  42. Spectral Envelopes across a Sentence zoom E.Godoy, Voice Conversion

  43. Spectral Envelopes across a Sound • DFWA looks like natural speech • can see warping appropriately shifting source events • overall, DFWA more closely resembles target • Important to note that examining spectral envelope evolution in time very informative! • Can see poor quality with GMM right away (less evident w/in frame) E.Godoy, Voice Conversion

  44. Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Proposed: Dynamic Frequency Warping + Amplitude Scaling • Conversion Results • Objective Metrics & Subjective Evaluations • Sound Samples E.Godoy, Voice Conversion

  45. Formal Evaluations • Speech Corpora & Analysis • CMU ARCTIC, US English speakers (2 male, 2 female) • Parallel annotated corpora, speech sampled at 16kHz • 200 sentences for learning, 100 sentences for testing • Speech analysis & synthesis: Harmonic model • All spectral envelopes  discrete cepstral coefficients (order 40) • Spectral envelope transformation methods: • Phonetic & Traditional GMMs (diagonal covariance matrices) • DFWA (classification by phoneme) • DFWE (Hybrid DFW-GMM, E-energy correction) • source (no transformation) E.Godoy, Voice Conversion

  46. Objective Metrics for Evaluation • Mean Squared Error (standard) alone is not sufficient! • Does not adequately indicate converted speech quality • Does not indicate variance in transformed data • Variance Ratio (VR) • Average ratio of the transformed-to-target data variance • More global indication of behavior Is the transformed data varying from the mean? Does the transformed data behave like the target data? E.Godoy, Voice Conversion

  47. Objective Results • GMM-based methods yield lowest MSE but no variance • Confirms over-smoothing • DFWA yields higher MSE but more natural variations in speech • VR of DFWA directly from variations in the warped source envelopes (not model adaptations!) • Hybrid (DFWE) in-between DFWA & GMM(as expected) • Compared to DFWA, VR notably decreased after energy correction filter • Different indications from MSE & VR:both important, complementary E.Godoy, Voice Conversion

  48. DFWA for VC with Nonparallel Corpora • Parallel Corpora: big constraint ! • Limits the data that can be used in VC • DFWA for Nonparallel corpora • Nonparallel Learning Data: 200 disjoint sentences • DFWA equivalent for parallel or nonparallel corpora! E.Godoy, Voice Conversion

  49. Implementation in a VC System • Converted speech synthesis • Pitch modification using TD-PSOLA (Time-Domain Pitch Synchronous Overlap Add) • Pitch modification factor determined by classic linear transformation • Frame synthesis • voiced frames: • harmonic amplitudes from transformed spectral envelopes • harmonic phases from nearest-neighbor source sampling • unvoiced frames: white noise through AR filter Spectral Envelope Transformation Speech Frame Synthesis (pitch modification) Speech Analysis (Harmonic) TD-PSOLA (pitch modification) source speech converted speech E.Godoy, Voice Conversion

  50. Subjective Testing • Listeners evaluate • Quality of speech • Similarity of voice to the target speaker • Mean Opinion Score • MOS scale from 1-5 E.Godoy, Voice Conversion

More Related