- By
**laken** - Follow User

- 119 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Voice Conversion – Part I' - laken

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Contents

- Review – Project Aims.
- Abstract & Conversion Scheme.

Building Blocks

- The HNM Parametric Model for Speech.
- Prosodic Modifications using the HNM.
- Fuzzy Vector Quantization.
- Phoneme Separation & Alignment.
- Integration of the System.

Project Aims

Output voice

Input voice

Voice Conversion

Converting speech into another speaker’s voice, based on offline training.

The emphasis is on:

- Good output voice quality
- Robustness
- Computational Complexity

Conversion Scheme - Training

הצגה פרמטרית

(HNM)

חלוקת משפט גסה וסילוק שקט

חלוקה למאורעות דיבור

בניית ספר קוד

VQ

אות לימוד - מקור

היסטוגרמות מאפיינים נוספים

העמדה גסה – "מילה" מול "מילה"

העמדה עדינה – מאורע מול מאורע

הצגה פרמטרית

(HNM)

חלוקת משפט גסה וסילוק שקט

חלוקה למאורעות דיבור

יצירת אישכול מושרה של מאורעות מטרה

אות לימוד - מטרה

Training Output:

Quantized vector space for each speaker’s “speech events” with 1-to-1 transformation table + histograms of personal characteristics (pitch, length of phonemes, etc.)

Conversion Scheme - Conversion

המרה

בחירת מאורעות מטרה ומאפיינים נוספים

הצגה פרמטרית (HNM)

חלוקה למאורעות דיבור

דובר המקור

שרשור

שינוי מאפיינים זמניים

סינתזה - דובר מטרה

The Benefits of a Parametric Representation of Speech

Allows for relatively simple:

- Comparison between speech events.
- Manipulation of recorded speech events.

The requirements from a parameterization scheme are:

- Quality of synthesis.
- Low computational needs.
- Low BPS.

The HARMONICS + NOISE MODEL

During Pitch Interval (ti, ti+1):

The HARMONICS + NOISE MODEL

Characteristics:

- Analysis based on pairs of pitch frame intervals, using harmonics and pseudo-harmonics.
- Division into low frequency Pseudo-Harmonic part, and mainly high frequency Noise.
- Noise part is modeled (for pairs of pitch cycles) using LPC.
- Both parts are analyzed and later reconstructed, entirely in the time domain.

- Parametric time stretch and pitch shift are possible (crucial for voice conversion).

Pseudo-Harmonic Analysis

- Determine pitch and divide into pitch-frames/unvoiced sections.
- Split unvoiced sections into frames of max. length (10ms).
- For voiced frames: determine K(ti) – no. of harmonies in the cycle.
- Using Pseudo-inverse method, minimize:

Pseudo-Harmonic Analysis – Determining K(ti)

- For every harmony centered segment

check if the local maximum is high, sharp and centered enough.

- If it is mark a harmony.
- The highest harmony is at

Pseudo-Harmonic Analysis – Determining K(ti)

Take analysis frame.

Concat m frames (highlight periodicity).

Zero pad (increase spectral resolution).

Noise Analysis

For each pitch frame find A and w so that:

where A(ti,z) is a normalized all-pole filter

n(t) is normalized WGN

w(ti) is the local energy of the noise.

is estimation of Spectral Probability Density

Advantages of the HNM

Whereas LPC synthesis is based on models of speech production, HNM utilizes the characteristics of human hearing.

Disadvantages

Computation time

High BPS

Fs = 8192Hz Bit Depth = 8HNM Results – “Oak is Strong…”

Original Harmonic Noise Reconstructed

NORMALIZED

The Phoneme EE from “is”

Note: Different scale

Fs = 8000Hz Bit Depth = 8More Results

Original Harmonic Noise Reconstructed

Fs = 16000Hz Bit Depth = 16

Integration of the HNM into the Conversion Scheme (reminder)

Stages:

- Implementing pitch changes and time stretches.

- Creating HNM codebook for target speakers phonemes.

- Constructing a single signal from concatenated phonemes.

Partial HNM Synthesis I

Pseudo-harmonies are not well understood. Therefore, they are unyielding to manipulation.

For the remainder of the project synthesis will use only real harmonics.

Although the MSE for “harmonic only analysis” is smaller, speech quality is lesser than with regular pseudo-harmonic analysis.

Partial HNM Synthesis II

Original

Full HNM synthesis

Pseudo-Harmonic Analysis\

Harmonic Synthesis

Harmonic Analysis\

Harmonic Synthesis

Hypothesis: with pseudo-harmonic analysis, a coeffs give better approximation of local harmonies, because b coeffs represent changes in local harmonies and time-comb errors.

Recalculating the Time Comb

Time stretching requires only the addition (or removal) of synthesis frames, and the assignment of parameters to the new frames.

For a Time Stretch Ratio Contour β(t), there is an integral equation for the creation of the time-comb.

Recalculating the Time Comb

We define three time axes:

- The Analysis (original) time axis tA.
- The Synthesis (final) time axis tS.
- The Virtual time axis tV, which has the length of tA, but uneven “time density”.

Recalculating the Time Comb

Under the assumption

β(t) and P(t) are piecewise constant – (constant for each pitch frame of the analysis time axis)

the integral equation can be solved numerically using an iterative method.

*** Solved for given

Reassigning Parameters

Each virtual comb point receives the parameters of the nearest analysis point.

There is a 1-to-1 transition between virtual and synthesis points.

Pitch Shifting

Also requires recalculation of the time comb.

For a Pitch Shift Ratio Contour α(t) (again assumed piecewise constant), the “Recomb Equation” can be rewritten:

Assignment of parameters to synthesis frames is similar.

Resampling the Spectral Envelope I - Theory

- The harmonic part of speech is quasi-periodic can be written as a function convoluted with an impulse train.
- Therefore, its spectral representation is sampled, where the sampling frequency is dependant on the pitch.
- CHANGING THE PITCH REQUIRES RESAMPLING THE SPECTRAL ENVELOPE.
- We assume that the HNM harmonic coefficients are a good approximation of the spectral envelope.

Resampling the Spectral Envelope II - Practice

- We want to evaluate the Real-Cepstrum Coefficients so that the function

will “follow the contour” of the spectral envelope as closely as possible.

- We then resample function (*) in the new frequency locations.

Resampling the Spectral Envelope II - Practice

- “Following the contour” is defined as minimizing the error measure (pseudo-inv’):

λ=0

λ=5·10-4

Resampling the Spectral Envelope III – Bark Scale

- The sensitivity of the human ear is logarithmic in both pitch and amplitude.
- The described resampling method is logarithmic in amplitude but linear in pitch.
- Therefore:

Redistribute the harmonies that are to be resampled on a Quasi-logarithmic scale.

Resampling the Spectral Envelope IV - Demonstration

Original O from “strong"

After Pitch Shifting with α=0.805

Pitch Shift Results using HNM

Original PS ratio 0.6 PS ratio 1.3

More Prosodic Changes

Original Manipulated

VQ within Voice Conversion

Flexibility vs. Robustness

- VQ is SURRENDERING INFORMATION AND RELYING ON LOCAL AVERAGES.
- We lower the number of training events used for conversion, and instead, convert local average to local average.

K Means Algorithm

- Classifies N given vectors into K clusters, and for each cluster determines the “center of mass” (centroid), which represents the whole cluster.
- In each iteration relocation of the centroids occurs. The new location is the mean of the vectors currently in the appropriate cluster. (Each vector belongs entirely to one cluster – the nearest centroid.)
- Relatively robust to different initial conditions (initial centroids).

Fuzzy K Means Algorithm

- A variation on regular K means.
- Differences are:

* The decision on which vector belongs to each cluster uses “Fuzzy KNN”.

* The averaging used in order to recalculate the centroid is weighted.

K Nearest Neighbours

- Given N vectors, classified into K clusters, to which cluster should a new vector pN+1 be added?
- The algorithm: * P is the set of K vectors nearest to pN+1. * pN+1 belongs to the cluster with the most vectors in P.
- There is a “Tie-breaker” apparatus.

Fuzzy KNN

- Uses the K nearest neighbours, to calculate a probability distribution function of the new point’s affinity to each cluster.

Incorporation of VQ in Voice Conversion - Summary

Steps to be taken:

- Represent speech events as vectors.
- Find an effective distance measure. (See Phoneme Separation.)

Questions to be answered:

- What is the optimal number of clusters? (Dependant on event type.)
- Will fuzzy KNN prove an effective method for increasing the robustness of the conversion?

הצגה פרמטרית

(HNM)

בניית ספר קוד

VQ

אות לימוד - מקור

העמדה – מאורע מול מאורע

חלוקה למאורעות דיבור וסילוק שקט

הצגה פרמטרית

(HNM)

יצירת אישכול מושרה של מאורעות מטרה

אות לימוד - מטרה

The Need for Automatic Phoneme Separation

Regardless of the type of speech event used for conversion, separation into the simplest kind of event will enable extraction of larger ones.

- A large training bank cannot be separated manually.
- Use of complex speech events requires a larger training bank.
- Option for online conversion.

Split PointsRedefinition of the Problem

The problem of separating N phonemes is equivalent to the problem of placing N-1 “Split Points” in the signal.

Split points are characterized by changes in:

- Energy
- Voiced/Unvoiced sound
- SPECTRAL ENVELOPE

The HNM Distance Measure

- The signal is coded in two additive parts. Therefore the distance is made up of two additive parts:

Where the weight is determined using the local energy of the two parts

* Parameter p will be determined during optimization of the machine.

Harmonic Distance Measure Assumptions

- We need to measure distances between spectral envelopes.
- Distance measurement between orthogonal coordinates is usually Euclidean.
- Harmonic coefficients a are orthogonal coordinates. The distance between them “should” be Euclidean.

*** We want only “envelope shape” distance, disregarding energy. Hence the discussion is limited to normalized harmonic coefficients.

Harmonic Distance Measure Problem

- Due to pitch differences between frames a1 and a2, they are the coefficients of different members of the orthonormal family.

1 2 3 4 5 6 7 . . .

Euclidean distance between them may result in large distance even for identical spectral envelopes !!

1 2 3 4 5 6 7 . . .

Harmonic Distance Measure Solution

- Harmonic coefficients a1 and a2 should undergo Pitch Shifting to a common pitch before distance calculation.
- This involves converting harmonic coefficients ai into cepstral coefficients ci and back.
- Parseval’s Theorem

Since cepstral coefficients are orthonormal coordinates, Euclidean distance between them is equal to the distance between the harmonic coefficients.

Harmonic Distance Measure Summary

The harmonic distance is defined as the distance between the normalized cepstral coefficients, calculated using the bark scale.

- The distance between two unvoiced frames is zero.
- The distance between a voiced and an unvoiced frame

is defined as the maximum in the calculated batch.

Splitting Algorithm

- Mark Silences (parameters are nil).
- Mark Voiced/Unvoiced changes.
- Mark Energy ratio larger than threshold.

“Cleaning”, Coarse Alignment.

- Mark HNM distance larger than threshold.
- Fine segmentation.

“Cleaning”, Fine Alignment

חלוקה 5

חלוקה 4

Dth

Dth-(t-ti)Dinc

d(t)

Fine Segmentation Algorithms- Adaptive Threshold
- Parameter Drift Identification
- Higher Order Distances

Alignment - DTW

Dynamic Time Warping is an application of the principles of Dynamic Programming, used for aligning two signals,which supposedly hold similar information in different densities.

Our alignment mechanism is currently undergoing experimentation.

חלוקת משפט גסה וסילוק שקט

חלוקה למאורעות דיבור

העמדה גסה – "מילה" מול "מילה"

העמדה עדינה – מאורע מול מאורע

חלוקת משפט גסה וסילוק שקט

חלוקה למאורעות דיבור

Alignment SchemeAlignment Scheme - Alternative

חלוקת משפט גסה וסילוק שקט

חלוקה למאורעות דיבור

העמדה גסה – "מילה" מול "מילה"

העמדה מלאה – מסגרת פרמטרית מול מסגרת פרמטרית

חלוקה מושרית

חלוקת משפט גסה וסילוק שקט

Completion of the Building Blocks

- Perfecting Phoneme Separation
- Integrating alignment & separation in one of the above schemes.
- Creating a system to handle the statistics of prosodic characteristics collected during training and to enable their use during conversion.

Food for Thought – Selection of Target Speech Event

- Which speech events (phonemes, diphones, etc.) will produce the best conversion?
- After the type of event is decided on, how will it be represented in a vector space?
- What will be an effective distance measure?
- What is the optimal number of clusters? (Dependant on event type.)
- Will fuzzy KNN prove an effective method for increasing the robustness of the conversion?

Food for Thought II – Prosodic Characteristics

- What method will be used to determine the prosodic characteristics of the output signal? (Based on the input signal and on statistics of the training database.)
- On which prosodic characteristics should statistics be gathered? (The alternative being transferring them directly from the source sentence.)
- What should be the “spatial resolution” of the statistics? (How many clusters should each histogram encompass?)

Download Presentation

Connecting to Server..