HMM-Based Pseudo-Clean Speech Synthesis for SPLICE Algorithm

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHMJun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information Engineering National Taiwan Normal University

Outline • Introduction • Review of SPLICE • Our modifications • Experiments and results • Conclusions

Introduction • The motivation of our approach is to relax the constraint of recorded stereo-data from a new viewpoint: pseudo-clean features generated by exploiting HMM-based synthesis method is used to replace the ideal clean features from one of the stereo channels in SPLICE. • Moreover, a simple ML-based bias adaptation algorithm to handle the mismatch between training and testing, which yields consistent improvements of different testing sets on aurora2, is proposed. • As a extension, this method of generating the pseudo clean features can be used in any algorithms like SPLICE where the stereo-data is needed.

Review of SPLICE(1/2) • Two assumptions of speech modeling and degradation • The first assumption is that the noisy speech cepstral vector follows the distribution of mixture of Gaussians: • The second assumption is that conditional distribution for clean vector given the noisy speech vector in each component is Gaussian whose mean vector is a linear transformation of with the bias vector as follows:

Review of SPLICE(2/2) • SPLICE training • Environment selection where Y is the sequence of the noisy speech vectors in the current utterance, p(E) is set equally for all E. • Cepstral Enhancement

State-level force-alignment • So first, using multi-training speech features, we can get ML-trained HMMs. • Then state-level force-alignment of multi-training features is performed.

HMM-based speech synthesis(1/2) • Multi-training HMMs are denoted as λ, and the speech parameter vector sequence to be determined is described as: • , where dynamic features and static features should satisfy some constraints: is the sequence of static cepstral vectors, the transformation matrix is decided by the relation between static and dynamic features. • From the state-level force-alignment, the state sequence for all frames can be given:

HMM-based speech synthesis(2/2) • We calculate the mean and covariance of state as follows: • For given λ and S, our target is to maximize likelihood function with respect to O. The objective function can be written as: (1) where the constant K is independent of O, by optimizing Eq. 1, we obtain a set of equations:

Post-processing • As the post processing, gain normalization should be taken based on the observation that there are big differences among the dynamic range of clean speech features, noisy speech features and synthesized speech features, respectively. where is the final synthesized pseudo-clean features applied to SPLICE algorithm.

ML-based bias adaptation • In order to handle the mismatch between training and testing, a simple ML-based bias adaptation algorithm is proposed. where is the adapted bias vector for the mixture , is the global bias shift to describe the mismatch between training and testing. • can be iteratively estimated using the maximum likelihood criterion: where the subscript denotes the dimensional index, is the feature vector of the current utterance at frame .

Experimental and results • CMN is applied before SPLICE algorithm. • The bias parameters are trained using the pseudo clean and real multi-style data for each of 17 noise conditions . • The noisy speech model consists of a mixutre of 256 Gaussians with diagonal covariance matrices.

Conclusuons • First, we remove the constraint of stereo-data by exploiting HMM-based synthesis method to generate the pseudo-clean speech parameters. • Then a simple ML-based bias adaptation algorithm to handle the mismatch between training and testing, which yields consistent improvements of different testing sets on aurora2, is proposed. • In our future work, we will further study the noise robustness of the HMM-based speech parameters generation, which can be combined with other robust techniques for noisy speech recognition.

HMM-Based Pseudo-Clean Speech Synthesis for SPLICE Algorithm