Voice Conversion – Part I

Voice Conversion – Part I By: Rafi Lemansky Noam Elron Under the supervision of: Dr. Yizhar Lavner

Contents • Review – Project Aims. • Abstract & Conversion Scheme. Building Blocks • The HNM Parametric Model for Speech. • Prosodic Modifications using the HNM. • Fuzzy Vector Quantization. • Phoneme Separation & Alignment. • Integration of the System.

Project Aims Output voice Input voice Voice Conversion Converting speech into another speaker’s voice, based on offline training. The emphasis is on: • Good output voice quality • Robustness • Computational Complexity

Conversion Scheme - Training הצגה פרמטרית (HNM) חלוקת משפט גסה וסילוק שקט חלוקה למאורעות דיבור בניית ספר קוד VQ אות לימוד - מקור היסטוגרמות מאפיינים נוספים העמדה גסה – "מילה" מול "מילה" העמדה עדינה – מאורע מול מאורע הצגה פרמטרית (HNM) חלוקת משפט גסה וסילוק שקט חלוקה למאורעות דיבור יצירת אישכול מושרה של מאורעות מטרה אות לימוד - מטרה Training Output: Quantized vector space for each speaker’s “speech events” with 1-to-1 transformation table + histograms of personal characteristics (pitch, length of phonemes, etc.)

Conversion Scheme - Conversion המרה בחירת מאורעות מטרה ומאפיינים נוספים הצגה פרמטרית (HNM) חלוקה למאורעות דיבור דובר המקור שרשור שינוי מאפיינים זמניים סינתזה - דובר מטרה

The Benefits of a Parametric Representation of Speech Allows for relatively simple: • Comparison between speech events. • Manipulation of recorded speech events. The requirements from a parameterization scheme are: • Quality of synthesis. • Low computational needs. • Low BPS.

-1-The Harmonic plus Noise Model

The HARMONICS + NOISE MODEL During Pitch Interval (ti, ti+1):

The HARMONICS + NOISE MODEL Characteristics: • Analysis based on pairs of pitch frame intervals, using harmonics and pseudo-harmonics. • Division into low frequency Pseudo-Harmonic part, and mainly high frequency Noise. • Noise part is modeled (for pairs of pitch cycles) using LPC. • Both parts are analyzed and later reconstructed, entirely in the time domain. - Parametric time stretch and pitch shift are possible (crucial for voice conversion).

Pseudo-Harmonic Analysis • Determine pitch and divide into pitch-frames/unvoiced sections. • Split unvoiced sections into frames of max. length (10ms). • For voiced frames: determine K(ti) – no. of harmonies in the cycle. • Using Pseudo-inverse method, minimize:

Pseudo-Harmonic Analysis – Determining K(ti) • For every harmony centered segment check if the local maximum is high, sharp and centered enough. • If it is mark a harmony. • The highest harmony is at

Pseudo-Harmonic Analysis – Determining K(ti) Take analysis frame. Concat m frames (highlight periodicity). Zero pad (increase spectral resolution).

Noise Analysis For each pitch frame find A and w so that: where A(ti,z) is a normalized all-pole filter n(t) is normalized WGN w(ti) is the local energy of the noise. is estimation of Spectral Probability Density

Advantages of the HNM Whereas LPC synthesis is based on models of speech production, HNM utilizes the characteristics of human hearing. Disadvantages Computation time High BPS

Fs = 8192Hz Bit Depth = 8 HNM Results – “Oak is Strong…” Original Harmonic Noise Reconstructed NORMALIZED

The Phoneme EE from “is” Note: Different scale

The Phoneme S from “also”

More Results Original Harmonic Noise Reconstructed Fs = 16000Hz Bit Depth = 16

Fs = 8000Hz Bit Depth = 8 More Results Original Harmonic Noise Reconstructed Fs = 16000Hz Bit Depth = 16

Integration of the HNM into the Conversion Scheme (reminder) Stages: - Implementing pitch changes and time stretches. - Creating HNM codebook for target speakers phonemes. - Constructing a single signal from concatenated phonemes.

Partial HNM Synthesis I Pseudo-harmonies are not well understood. Therefore, they are unyielding to manipulation. For the remainder of the project synthesis will use only real harmonics. Although the MSE for “harmonic only analysis” is smaller, speech quality is lesser than with regular pseudo-harmonic analysis.

Partial HNM Synthesis II Original Full HNM synthesis Pseudo-Harmonic Analysis\ Harmonic Synthesis Harmonic Analysis\ Harmonic Synthesis Hypothesis: with pseudo-harmonic analysis, a coeffs give better approximation of local harmonies, because b coeffs represent changes in local harmonies and time-comb errors.

-2-Prosodic Modifications using HNM

Recalculating the Time Comb Time stretching requires only the addition (or removal) of synthesis frames, and the assignment of parameters to the new frames. For a Time Stretch Ratio Contour β(t), there is an integral equation for the creation of the time-comb.

Recalculating the Time Comb We define three time axes: • The Analysis (original) time axis tA. • The Synthesis (final) time axis tS. • The Virtual time axis tV, which has the length of tA, but uneven “time density”.

Recalculating the Time Comb Under the assumption β(t) and P(t) are piecewise constant – (constant for each pitch frame of the analysis time axis) the integral equation can be solved numerically using an iterative method. *** Solved for given

Reassigning Parameters Each virtual comb point receives the parameters of the nearest analysis point. There is a 1-to-1 transition between virtual and synthesis points.

Time Stretch Results TS ratio 0.6 Original TS ratio 1.3 TS ratio 1.8

Pitch Shifting Also requires recalculation of the time comb. For a Pitch Shift Ratio Contour α(t) (again assumed piecewise constant), the “Recomb Equation” can be rewritten: Assignment of parameters to synthesis frames is similar.

Resampling the Spectral Envelope I - Theory • The harmonic part of speech is quasi-periodic  can be written as a function convoluted with an impulse train. • Therefore, its spectral representation is sampled, where the sampling frequency is dependant on the pitch. • CHANGING THE PITCH REQUIRES RESAMPLING THE SPECTRAL ENVELOPE. • We assume that the HNM harmonic coefficients are a good approximation of the spectral envelope.

Resampling the Spectral Envelope II - Practice • We want to evaluate the Real-Cepstrum Coefficients so that the function will “follow the contour” of the spectral envelope as closely as possible. • We then resample function (*) in the new frequency locations.

Resampling the Spectral Envelope II - Practice • “Following the contour” is defined as minimizing the error measure (pseudo-inv’): λ=0 λ=5·10-4

Resampling the Spectral Envelope III – Bark Scale • The sensitivity of the human ear is logarithmic in both pitch and amplitude. • The described resampling method is logarithmic in amplitude but linear in pitch. • Therefore: Redistribute the harmonies that are to be resampled on a Quasi-logarithmic scale.

Resampling the Spectral Envelope IV - Demonstration Original O from “strong" After Pitch Shifting with α=0.805

Pitch Shift Results using HNM Original PS ratio 0.6 PS ratio 1.3

More Prosodic Changes Original Manipulated

-3-Vector Quantization

VQ within Voice Conversion Flexibility vs. Robustness • VQ is SURRENDERING INFORMATION AND RELYING ON LOCAL AVERAGES. • We lower the number of training events used for conversion, and instead, convert local average to local average.

K Means Algorithm • Classifies N given vectors into K clusters, and for each cluster determines the “center of mass” (centroid), which represents the whole cluster. • In each iteration relocation of the centroids occurs. The new location is the mean of the vectors currently in the appropriate cluster. (Each vector belongs entirely to one cluster – the nearest centroid.) • Relatively robust to different initial conditions (initial centroids).

Fuzzy K Means Algorithm • A variation on regular K means. • Differences are: * The decision on which vector belongs to each cluster uses “Fuzzy KNN”. * The averaging used in order to recalculate the centroid is weighted.

Clustering Demo – K Means

Clustering Demo – Fuzzy K Means

K Nearest Neighbours • Given N vectors, classified into K clusters, to which cluster should a new vector pN+1 be added? • The algorithm: * P is the set of K vectors nearest to pN+1. * pN+1 belongs to the cluster with the most vectors in P. • There is a “Tie-breaker” apparatus.

Fuzzy KNN • Uses the K nearest neighbours, to calculate a probability distribution function of the new point’s affinity to each cluster.

Incorporation of VQ in Voice Conversion - Summary Steps to be taken: • Represent speech events as vectors. • Find an effective distance measure. (See Phoneme Separation.) Questions to be answered: • What is the optimal number of clusters? (Dependant on event type.) • Will fuzzy KNN prove an effective method for increasing the robustness of the conversion? הצגה פרמטרית (HNM) בניית ספר קוד VQ אות לימוד - מקור העמדה – מאורע מול מאורע חלוקה למאורעות דיבור וסילוק שקט הצגה פרמטרית (HNM) יצירת אישכול מושרה של מאורעות מטרה אות לימוד - מטרה

-4-Automatic Phoneme Separation and Alignment& HNM Distance

The Need for Automatic Phoneme Separation Regardless of the type of speech event used for conversion, separation into the simplest kind of event will enable extraction of larger ones. • A large training bank cannot be separated manually. • Use of complex speech events requires a larger training bank. • Option for online conversion.

Split PointsRedefinition of the Problem The problem of separating N phonemes is equivalent to the problem of placing N-1 “Split Points” in the signal. Split points are characterized by changes in: • Energy • Voiced/Unvoiced sound • SPECTRAL ENVELOPE

Current Results with Automatic Phoneme Separation

The HNM Distance Measure • The signal is coded in two additive parts. Therefore the distance is made up of two additive parts: Where the weight is determined using the local energy of the two parts * Parameter p will be determined during optimization of the machine.

Voice Conversion – Part I