voice conversion part i
Skip this Video
Download Presentation
Voice Conversion – Part I

Loading in 2 Seconds...

play fullscreen
1 / 65

Voice Conversion – Part I - PowerPoint PPT Presentation

  • Uploaded on

Voice Conversion – Part I. By: Rafi Lemansky Noam Elron Under the supervision of: Dr. Yizhar Lavner. Contents. Review – Project Aims. Abstract & Conversion Scheme. Building Blocks The HNM Parametric Model for Speech. Prosodic Modifications using the HNM. Fuzzy Vector Quantization.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Voice Conversion – Part I' - laken

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
voice conversion part i

Voice Conversion – Part I

By: Rafi Lemansky

Noam Elron

Under the supervision of: Dr. Yizhar Lavner

  • Review – Project Aims.
  • Abstract & Conversion Scheme.

Building Blocks

  • The HNM Parametric Model for Speech.
  • Prosodic Modifications using the HNM.
  • Fuzzy Vector Quantization.
  • Phoneme Separation & Alignment.
  • Integration of the System.
project aims
Project Aims

Output voice

Input voice

Voice Conversion

Converting speech into another speaker’s voice, based on offline training.

The emphasis is on:

  • Good output voice quality
  • Robustness
  • Computational Complexity
conversion scheme training
Conversion Scheme - Training

הצגה פרמטרית


חלוקת משפט גסה וסילוק שקט

חלוקה למאורעות דיבור

בניית ספר קוד


אות לימוד - מקור

היסטוגרמות מאפיינים נוספים

העמדה גסה – "מילה" מול "מילה"

העמדה עדינה – מאורע מול מאורע

הצגה פרמטרית


חלוקת משפט גסה וסילוק שקט

חלוקה למאורעות דיבור

יצירת אישכול מושרה של מאורעות מטרה

אות לימוד - מטרה

Training Output:

Quantized vector space for each speaker’s “speech events” with 1-to-1 transformation table + histograms of personal characteristics (pitch, length of phonemes, etc.)

conversion scheme conversion
Conversion Scheme - Conversion


בחירת מאורעות מטרה ומאפיינים נוספים

הצגה פרמטרית (HNM)

חלוקה למאורעות דיבור

דובר המקור


שינוי מאפיינים זמניים

סינתזה - דובר מטרה

the benefits of a parametric representation of speech
The Benefits of a Parametric Representation of Speech

Allows for relatively simple:

  • Comparison between speech events.
  • Manipulation of recorded speech events.

The requirements from a parameterization scheme are:

  • Quality of synthesis.
  • Low computational needs.
  • Low BPS.
the h armonics n oise m odel

During Pitch Interval (ti, ti+1):

the h armonics n oise m odel9


  • Analysis based on pairs of pitch frame intervals, using harmonics and pseudo-harmonics.
  • Division into low frequency Pseudo-Harmonic part, and mainly high frequency Noise.
  • Noise part is modeled (for pairs of pitch cycles) using LPC.
  • Both parts are analyzed and later reconstructed, entirely in the time domain.

- Parametric time stretch and pitch shift are possible (crucial for voice conversion).

pseudo harmonic analysis
Pseudo-Harmonic Analysis
  • Determine pitch and divide into pitch-frames/unvoiced sections.
  • Split unvoiced sections into frames of max. length (10ms).
  • For voiced frames: determine K(ti) – no. of harmonies in the cycle.
  • Using Pseudo-inverse method, minimize:
pseudo harmonic analysis determining k t i
Pseudo-Harmonic Analysis – Determining K(ti)
  • For every harmony centered segment

check if the local maximum is high, sharp and centered enough.

  • If it is mark a harmony.
  • The highest harmony is at
pseudo harmonic analysis determining k t i12
Pseudo-Harmonic Analysis – Determining K(ti)

Take analysis frame.

Concat m frames (highlight periodicity).

Zero pad (increase spectral resolution).

noise analysis
Noise Analysis

For each pitch frame find A and w so that:

where A(ti,z) is a normalized all-pole filter

n(t) is normalized WGN

w(ti) is the local energy of the noise.

is estimation of Spectral Probability Density

advantages of the hnm
Advantages of the HNM

Whereas LPC synthesis is based on models of speech production, HNM utilizes the characteristics of human hearing.


Computation time

High BPS

hnm results oak is strong
Fs = 8192Hz Bit Depth = 8HNM Results – “Oak is Strong…”

Original Harmonic Noise Reconstructed


the phoneme ee from is
The Phoneme EE from “is”

Note: Different scale

more results
More Results

Original Harmonic Noise Reconstructed

Fs = 16000Hz Bit Depth = 16

more results19
Fs = 8000Hz Bit Depth = 8More Results

Original Harmonic Noise Reconstructed

Fs = 16000Hz Bit Depth = 16

integration of the hnm into the conversion scheme reminder
Integration of the HNM into the Conversion Scheme (reminder)


- Implementing pitch changes and time stretches.

- Creating HNM codebook for target speakers phonemes.

- Constructing a single signal from concatenated phonemes.

partial hnm synthesis i
Partial HNM Synthesis I

Pseudo-harmonies are not well understood. Therefore, they are unyielding to manipulation.

For the remainder of the project synthesis will use only real harmonics.

Although the MSE for “harmonic only analysis” is smaller, speech quality is lesser than with regular pseudo-harmonic analysis.

partial hnm synthesis ii
Partial HNM Synthesis II


Full HNM synthesis

Pseudo-Harmonic Analysis\

Harmonic Synthesis

Harmonic Analysis\

Harmonic Synthesis

Hypothesis: with pseudo-harmonic analysis, a coeffs give better approximation of local harmonies, because b coeffs represent changes in local harmonies and time-comb errors.

recalculating the time comb
Recalculating the Time Comb

Time stretching requires only the addition (or removal) of synthesis frames, and the assignment of parameters to the new frames.

For a Time Stretch Ratio Contour β(t), there is an integral equation for the creation of the time-comb.

recalculating the time comb25
Recalculating the Time Comb

We define three time axes:

  • The Analysis (original) time axis tA.
  • The Synthesis (final) time axis tS.
  • The Virtual time axis tV, which has the length of tA, but uneven “time density”.
recalculating the time comb26
Recalculating the Time Comb

Under the assumption

β(t) and P(t) are piecewise constant – (constant for each pitch frame of the analysis time axis)

the integral equation can be solved numerically using an iterative method.

*** Solved for given

reassigning parameters
Reassigning Parameters

Each virtual comb point receives the parameters of the nearest analysis point.

There is a 1-to-1 transition between virtual and synthesis points.

time stretch results
Time Stretch Results

TS ratio 0.6 Original

TS ratio 1.3 TS ratio 1.8

pitch shifting
Pitch Shifting

Also requires recalculation of the time comb.

For a Pitch Shift Ratio Contour α(t) (again assumed piecewise constant), the “Recomb Equation” can be rewritten:

Assignment of parameters to synthesis frames is similar.

resampling the spectral envelope i theory
Resampling the Spectral Envelope I - Theory
  • The harmonic part of speech is quasi-periodic  can be written as a function convoluted with an impulse train.
  • Therefore, its spectral representation is sampled, where the sampling frequency is dependant on the pitch.
  • We assume that the HNM harmonic coefficients are a good approximation of the spectral envelope.
resampling the spectral envelope ii practice
Resampling the Spectral Envelope II - Practice
  • We want to evaluate the Real-Cepstrum Coefficients so that the function

will “follow the contour” of the spectral envelope as closely as possible.

  • We then resample function (*) in the new frequency locations.
resampling the spectral envelope ii practice32
Resampling the Spectral Envelope II - Practice
  • “Following the contour” is defined as minimizing the error measure (pseudo-inv’):



resampling the spectral envelope iii bark scale
Resampling the Spectral Envelope III – Bark Scale
  • The sensitivity of the human ear is logarithmic in both pitch and amplitude.
  • The described resampling method is logarithmic in amplitude but linear in pitch.
  • Therefore:

Redistribute the harmonies that are to be resampled on a Quasi-logarithmic scale.

resampling the spectral envelope iv demonstration
Resampling the Spectral Envelope IV - Demonstration

Original O from “strong"

After Pitch Shifting with α=0.805

pitch shift results using hnm
Pitch Shift Results using HNM

Original PS ratio 0.6 PS ratio 1.3

more prosodic changes
More Prosodic Changes

Original Manipulated

vq within voice conversion
VQ within Voice Conversion

Flexibility vs. Robustness

  • We lower the number of training events used for conversion, and instead, convert local average to local average.
k means algorithm
K Means Algorithm
  • Classifies N given vectors into K clusters, and for each cluster determines the “center of mass” (centroid), which represents the whole cluster.
  • In each iteration relocation of the centroids occurs. The new location is the mean of the vectors currently in the appropriate cluster. (Each vector belongs entirely to one cluster – the nearest centroid.)
  • Relatively robust to different initial conditions (initial centroids).
fuzzy k means algorithm
Fuzzy K Means Algorithm
  • A variation on regular K means.
  • Differences are:

* The decision on which vector belongs to each cluster uses “Fuzzy KNN”.

* The averaging used in order to recalculate the centroid is weighted.

k nearest neighbours
K Nearest Neighbours
  • Given N vectors, classified into K clusters, to which cluster should a new vector pN+1 be added?
  • The algorithm: * P is the set of K vectors nearest to pN+1. * pN+1 belongs to the cluster with the most vectors in P.
  • There is a “Tie-breaker” apparatus.
fuzzy knn
Fuzzy KNN
  • Uses the K nearest neighbours, to calculate a probability distribution function of the new point’s affinity to each cluster.
incorporation of vq in voice conversion summary
Incorporation of VQ in Voice Conversion - Summary

Steps to be taken:

  • Represent speech events as vectors.
  • Find an effective distance measure. (See Phoneme Separation.)

Questions to be answered:

  • What is the optimal number of clusters? (Dependant on event type.)
  • Will fuzzy KNN prove an effective method for increasing the robustness of the conversion?

הצגה פרמטרית


בניית ספר קוד


אות לימוד - מקור

העמדה – מאורע מול מאורע

חלוקה למאורעות דיבור וסילוק שקט

הצגה פרמטרית


יצירת אישכול מושרה של מאורעות מטרה

אות לימוד - מטרה

the need for automatic phoneme separation
The Need for Automatic Phoneme Separation

Regardless of the type of speech event used for conversion, separation into the simplest kind of event will enable extraction of larger ones.

  • A large training bank cannot be separated manually.
  • Use of complex speech events requires a larger training bank.
  • Option for online conversion.
split points redefinition of the problem
Split PointsRedefinition of the Problem

The problem of separating N phonemes is equivalent to the problem of placing N-1 “Split Points” in the signal.

Split points are characterized by changes in:

  • Energy
  • Voiced/Unvoiced sound
the hnm distance measure
The HNM Distance Measure
  • The signal is coded in two additive parts. Therefore the distance is made up of two additive parts:

Where the weight is determined using the local energy of the two parts

* Parameter p will be determined during optimization of the machine.

harmonic distance measure assumptions
Harmonic Distance Measure Assumptions
  • We need to measure distances between spectral envelopes.
  • Distance measurement between orthogonal coordinates is usually Euclidean.
  • Harmonic coefficients a are orthogonal coordinates. The distance between them “should” be Euclidean.

*** We want only “envelope shape” distance, disregarding energy. Hence the discussion is limited to normalized harmonic coefficients.

harmonic distance measure problem
Harmonic Distance Measure Problem
  • Due to pitch differences between frames a1 and a2, they are the coefficients of different members of the orthonormal family.

1 2 3 4 5 6 7 . . .

Euclidean distance between them may result in large distance even for identical spectral envelopes !!

1 2 3 4 5 6 7 . . .

harmonic distance measure solution
Harmonic Distance Measure Solution
  • Harmonic coefficients a1 and a2 should undergo Pitch Shifting to a common pitch before distance calculation.
  • This involves converting harmonic coefficients ai into cepstral coefficients ci and back.
  • Parseval’s Theorem

Since cepstral coefficients are orthonormal coordinates, Euclidean distance between them is equal to the distance between the harmonic coefficients.

harmonic distance measure summary
Harmonic Distance Measure Summary

The harmonic distance is defined as the distance between the normalized cepstral coefficients, calculated using the bark scale.

  • The distance between two unvoiced frames is zero.
  • The distance between a voiced and an unvoiced frame

is defined as the maximum in the calculated batch.

splitting algorithm
Splitting Algorithm
  • Mark Silences (parameters are nil).
  • Mark Voiced/Unvoiced changes.
  • Mark Energy ratio larger than threshold.

“Cleaning”, Coarse Alignment.

  • Mark HNM distance larger than threshold.
  • Fine segmentation.

“Cleaning”, Fine Alignment

fine segmentation algorithms
חלוקה 5

חלוקה 4




Fine Segmentation Algorithms
  • Adaptive Threshold
  • Parameter Drift Identification
  • Higher Order Distances
alignment dtw
Alignment - DTW

Dynamic Time Warping is an application of the principles of Dynamic Programming, used for aligning two signals,which supposedly hold similar information in different densities.

Our alignment mechanism is currently undergoing experimentation.

alignment scheme
חלוקת משפט גסה וסילוק שקט

חלוקה למאורעות דיבור

העמדה גסה – "מילה" מול "מילה"

העמדה עדינה – מאורע מול מאורע

חלוקת משפט גסה וסילוק שקט

חלוקה למאורעות דיבור

Alignment Scheme
alignment scheme alternative
Alignment Scheme - Alternative

חלוקת משפט גסה וסילוק שקט

חלוקה למאורעות דיבור

העמדה גסה – "מילה" מול "מילה"

העמדה מלאה – מסגרת פרמטרית מול מסגרת פרמטרית

חלוקה מושרית

חלוקת משפט גסה וסילוק שקט

completion of the building blocks
Completion of the Building Blocks
  • Perfecting Phoneme Separation
  • Integrating alignment & separation in one of the above schemes.
  • Creating a system to handle the statistics of prosodic characteristics collected during training and to enable their use during conversion.
food for thought selection of target speech event
Food for Thought – Selection of Target Speech Event
  • Which speech events (phonemes, diphones, etc.) will produce the best conversion?
  • After the type of event is decided on, how will it be represented in a vector space?
  • What will be an effective distance measure?
  • What is the optimal number of clusters? (Dependant on event type.)
  • Will fuzzy KNN prove an effective method for increasing the robustness of the conversion?
food for thought ii prosodic characteristics
Food for Thought II – Prosodic Characteristics
  • What method will be used to determine the prosodic characteristics of the output signal? (Based on the input signal and on statistics of the training database.)
  • On which prosodic characteristics should statistics be gathered? (The alternative being transferring them directly from the source sentence.)
  • What should be the “spatial resolution” of the statistics? (How many clusters should each histogram encompass?)