An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets

An Evaluation of Many-to-OneVoice Conversion Algorithmswith Pre-Stored Speaker Data Sets Daisuke Tani, Yamato Ohtani, Tomoki Toda,Hiroshi Saruwatari and Kiyohiro Shikano Nara Institute of Science and Technology (NAIST), Japan August 23rd, 2007

Many-to-One VC framework Many-to-One VC framework Many-to-One VC algorithms Experimental evaluations Conclusion Contents

Convetional Voice Conversion (VC) Training of conversion model has some limitations. Please say the same thing. Please say the same thing. Target speaker Source speaker Training Using parallel dataUsing around 50 pairsConverting only trainedsource speaker Conversion model We would like to make VC more flexible! Using arbitrary utterances Using a few utterances Converting arbitrary source speakers

? Applications • Voice changer to movie stars • Speech translation system, etc. Many-to-One VC (M-to-O VC) [T. Toda et al.] Convert arbitrary source speakers into target speaker Pre-stored sourcespeakers Targetspeaker Initial model trainingwith multiple parallel data sets Adaptation of model parameters for an arbitrary source speaker

Many-to-One VC framework Many-to-One VC algorithms Experimental evaluations Conclusion Contents

M-to-O VC Algorithms 1. Based on source independent GMM (SI-GMM) [T. Toda et al.] 2. Based on speaker selection New algorithm 3. Based on Eigenvoice conversion (EVC) [T. Toda et al.] 4. Based on EVC with speaker adaptive training (SAT) New algorithm

M-to-O VC based on Source Independent GMM (SI-GMM) 1. [T. Toda et al.] We train the conversion model for arbitrary source speakers. 1st mixture component : /a/ : /i/ : /o/ 2nd mixturecomponent Red : Speaker ABlue : Speaker BGreen : Speaker C 3rd mixture component Parameters ofthe i-th mixture component of SI-GMM : Tied parameters Weight Source mean vector Mean vector Target mean vector Covariance matrix

Previous Training of SI-GMM SI-GMM Previous training process Training using all parallel data sets Target speaker Multiple pre-storedsource speakers The SI-GMM converts arbitrary source speaker’s voice without any adaptation processes.

Red : Speaker ABlue : Speaker BGreen : Speaker C Problem of SI-GMM Phonemic spaces of a certain speaker often overlap with those of another speaker. 1st mixture component : /a/ : /i/ : /o/ 2nd mixturecomponent 3rd mixturecomponent SI-GMM might cause a conversion error !

＊ 2. M-to-O VC based on Speaker Selection ＊ Speaker Selection [S. Yoshizawa, et al.,2001] We train the conversion model using a part of pre-stored source speakers whose voice characteristics are similar to those of the given source speaker. 1st mixture component : /a/ 2nd mixturecomponent : /i/ : /o/ Red : Speaker ABlue : Speaker BGreen : Speaker CBlack : Source speaker Speaker A and Care selected. 3rd mixturecomponent

SI-GMM Target speaker Conversion model Adaptation process Selected pre-storedsource speakers Process of Speaker Selection Previous training process Target speaker 1. Training of SI-GMM Multiple pre-storedsource speakers 2. Training of speaker dependentGMMs (SD-GMMs) 5. Selection of N-best parallel data sets based on likelihoods 6. Training of conversion model SD-GMMs 3. Calculation of likelihood 4. Sort of likelihood Adaptation dataof source speaker

Trained conversion model by speaker selection Desired conversion model Red : Speaker ABlue : Speaker BGreen : Speaker C Black : Source speaker Problem of Speaker Selection The resulting conversion model just covers the selected pre-stored source speakers. Such a model is not necessarily suitable for the given source speaker. Speaker A and Care selected.

M-to-O VC based on Eigenvoice Conversion (EVC) 3. [T. Toda et al.] The conversion model is adapted by adjusting weights for individual eigenvoices. Weighting 1st eigen vector Weighting 2nd eigen vector Conversion model Weighting (S-1)th eigen vector Unsupervised adaptation Source speaker

Eigenvoice GMM (EV-GMM) + = Parameters ofthe i-th mixture component Representative vectors(eigenvoices) Weight Bias vector(average voice) Mean vector Weigt vector Covariance matrix Free parameter : Tied parameters Free parameter can be estimatedwith adaptation data.

Previous Training of EV-GMM + = SI-GMM EV-GMM Previous training process 3. Target speaker 1. Training of SI-GMM 2. Training of SD-GMMs Multiple pre-storedsource speakers 3. Construction of supervectors 4. Estimation of bias vectorsand representative vectors ＆ 5. Construction of EV-GMM bias vectors Representativevectors

Desired conversion model Red : Speaker ABlue : Speaker BGreen : Speaker CBlack : Source speaker Problem of EVC The tied parameters of the EV-GMM arefrom the SI-GMM. They are not suitable for the given source speaker, e.g., source covariance values are much larger than those of the desired conversion model. Adapted EV-GMM EV-GMM

M-to-O VC based on EVC with Speaker Adaptive Training (SAT) Red : Speaker ABlue : Speaker BGreen : Speaker CBlack : Source speaker 4. ＊＊ SAT [T. Anastasakos, et al., 1996] We previously train EV-GMM so that the adaptation performance is improved. Training criterion: Total likelihood over all pre-stored source speakers Likelihood of the adapted EV-GMM for each pre-stored source speaker Adapted EV-GMM with SAT Adapted EV-GMM SAT EV-GMM EV-GMM with SAT

SAT for EV-GMM + = 2. Training of tied parameters Previous training process GMM weights Weight vectors Bias vectors Representative vectors Target mean vectors Covariance matrices Target speaker 3. Iteration CanonicalEV-GMM Multiple pre-storedsource speakers 1. Training of speaker dependent parameters

Comparison of M-to-O VC Algorithms Sourcemeanvectors Tiedparameters Based onSI-GMM Not adapted Not adapted Based onspeakerselection Roughly adapted Roughly adapted Based onEVC Adapted Not adapted Based onEVCwith SAT Adapted Previously optimized

The number of mixtures 128 The number of representative vectors 159 27 The number of selected speakers Experimental Conditions 50 sentences uttered by each speaker 1 male target speaker 160 pre-stored source speakers (80 males and 80 females) Training stage ? Adaptation stage 10 source speakers (5 males and 5 females)

Experimental Conditions (cont’d) Objective evaluation Test data 21 sentences Objective measure Spectral distortion The number of adaptation sentences Varying from 1/32 to 32 Subjective evaluation Preference test on speech qualityof converted voices The number of subjects (Each subject evaluated 120 sample-pairs) 6 The number ofadaptation sentences 2

The adaptation techniques cause improvements of the conversion accuracy. EVC and EVC with SAT cause large distortionswhen the amount of adaptation data is very limited. Speaker selection is effective even when using very limited amount of adaptation data. Result of Objective Evaluation Worse Better SAT causes further improvements.

Result of Subjective Evaluation Every adaptation technique causes improvements of the converted speech quality.

Conclusions We conducted an experimental evaluationof many-to-one VC algorithms. based on SI-GMM. [T. Toda, et al.] based on EVC. [T. Toda, et al.] based on speaker selection. New methods based on EVC with SAT. Results of objective and subjective evaluations showed the adaptation process results in a better conversion model than the SI-GMM. the algorithm based on speaker selection works well with very little amount of adaptation data.

Thank you for your attention! Any questions?

An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets