Intra-Class Variability Modeling for Speech Processing

Intra-Class Variability Modeling for Speech Processing Dr. Hagai Aronowitz IBM Haifa Research Lab Presentation is available online at: http://aronowitzh.googlepages.com/

Speech Classification Proposed framework • Given labeled training segments from class + and class –, classify unlabeled test segments • Classification framework • Represent speech segments in segment-space • Learn a classifier in segment-space • SVMs • NNs • Bayesian classifiers • …

1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary Outline Intra-Class Variability Modeling for Speech Processing

Estimate Pr(yt|S) Train a universal background model (UBM) GMM using EM For every target speaker S:Train a GMM GS by applying MAP-adaptation Text-Independent Speaker Recognition GMM-Based Algorithm [Reynolds 1995] GMM based speaker recognition • Assuming frame independence: μ1 μ2 μ3 UBM Q1- speaker #1 Q2 - speaker #2 R26 MFCC feature space

Invalid frame independence assumption:Factors such aschannel, emotion, lexical variability, and speaker aging cause frame dependency GMM scoring is inefficient – linear in the length of the audio GMM scoring does not support indexing GMM Based Algorithm - Analysis

Mapping Speech Segments into Segment Space GMM scoring approximation 1/4 Definitions X: training session for target speaker Y: test session Q: GMM trained for X P: GMM trained for Y Goal Compute Pr(Y |Q) using GMMs P and Q only • Motivation • Efficient speaker recognition and indexing • More accurate modeling

Mapping Speech Segments into Segment Space GMM scoring approximation 2/4 Negative cross entropy (1) • Approximating the cross entropy between two GMMs • Matching based lower bound [Aronowitz 2004] • Unscented-transform based approximation [Goldberger & Aronowitz 2005] • Others options in [Hershey 2007]

Mapping Speech Segments into Segment Space GMM scoring approximation 3/4 Matching based approximation (2) Assuming weights and covariance matrices are speaker independent (+ some approximations): (3) Mapping T is induced: (4)

Mapping Speech Segments into Segment Space GMM scoring approximation 4/4 Results Figure and Table taken from: H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007.

Anchor modeling projection [Sturim 2001] efficient but inaccurate MLLR transofrms [Stolcke 2005] accurate but inefficient Kernel-PCA-based mapping [Aronowitz 2007c] Given - a set of objects - a kernel function (a dot product between each pair of objects)Finds a mapping of the objects into Rn which preserves the kernel function. accurate & efficient Other Mapping Techniques

Intra-Class Variability Modeling [Aronowitz 2005b] Introduction • The classic GMM algorithm does not explicitly model intra-speaker inter-session variability: • channel, noise • language • stress, emotion, aging • The frame independence assumption does not hold in these cases! (1) Instead, we can use a more relaxed assumption: (2) which leads to: (3)

Speaker Speaker Old vs. New Generative Models Old Model New Model a PDF over GMM space a GMM a GMM Session GMM generated independently Frame sequence Frame sequence generated independently

Session-GMM Space GMM for session A of speaker #1 GMM for session B of speaker #1 speaker #2 speaker #1 speaker #3 Session-GMM space

Modeling in Session-GMM space 1/2 Recall mapping T induced by the GMM approximation analysis: • is called a supervector • A speaker is modeled by a multivariate normal distribution in supervector space: (3) • A typical dimension of is 50,000*50,000 • is estimated robustly using PCA + regularization: Covariance is assumed to be a low rank matrix with an additional non-zero (noise) diagonal

Modeling in Session-GMM Space 2/2 Estimating covariance matrix 1 1 2 2 2 speaker #2 speaker #1 1 1 1 2 2 2 1 speaker #3 Delta supervector space Supervector space

Experimental Setup Datasets • is estimated from the NIST-2006-SRE corpus • Evaluation is done on the NIST-2004-SRE corpus System description • ETSI MFCC (13-cep + 13-delta-cep) • Energy based voice activity detector • Feature warping • 2048 Gaussians • Target models are adapted from GI-UBM • ZT-norm score normalization

Results 38% reduction in EER

Other Modeling Techniques • NAP+SVMs [Campbell 2006] • Factor Analysis [Kenny 2005] • Kernel-PCA [Aronowitz 2007c] Kernel-PCA based algorithm • Model each supervector as • sS : Common speaker subspace • uU : Speaker unique subspace • S is spanned by a set of development supervectors (700 speakers) • U is the orthogonal complement of S in supervector space • Intra-speaker variability is modeled separately in S and in U • U was found to be more discriminative than S • EER was reduced by 44% compared to baseline GMM

Kernel-PCA Based Modeling Feature space Speaker unique subspace f(x) Session space ux f(y) uy K-PCA x Kernel induced y Tx Ty Anchor sessions Common speaker subspace (Rn)

Goals Detect speaker changes – “speaker segmentation” Cluster speaker segments - “speaker clustering” Motivation for new method Current algorithms do not exploit available training data! (besides tuning thresholds, etc.) Method Explicitly model inter-segment intra-speaker variability from labeled training data, and use for the metric used by change-detection / clustering algorithms. Trainable Speaker Diarization [Aronowitz 2007d]

Dev data BNAD05 (5hr) - Arabic, broadcast news Eval data BNAT05 – Arabic, broadcast news, (207 target models, 6756 test segments) Speaker recognition on pairs of 3s segments

Speaker change detection 2 adjacent sliding windows (3s each) Speaker verification scoring + normalization Speaker clustering Speaker verification scoring + normalization Bottom-up clustering Speaker Error Rate (SER) on BNAT05 Anchor modeling (baseline): 12.9% Kernel-PCA based method: 7.9% Speaker Diarization System & Experiments

Summary 1/2 • A method for mapping speech segments into a GMM supervector space was described • Intra-speaker inter-session variability is modeled in GMM supervector space • Speaker recognition • EER was reduced by 38% on the NIST-2004 SRE • A corresponding kernel-PCA based approach reduces EER by 44% • Speaker diarization • SER for speaker diarization was reduced by 39%.

Summary 2/2 Algorithms based on the proposed framework • Speaker recognition[Aronowitz 2005b; Aronowitz 2007c] • Speaker diarization (“who spoke when”) [Aronowitz 2007d] • VAD (voice activity detection) [Aronowitz 2007a] • Language identification [Noor & Aronowitz 2006] • Gender identification [Bocklet 2008] • Age detection [Bocklet 2008] • Channel/bandwidth classification [Aronowitz 2007d]

Bibliography 1/2 [1] D. A. Reynolds et al., “Speaker identification and verification using Guassian mixture speaker models,” Speech Communications, 17, 91-108. [2] D.E. Sturim et al., “Speaker indexing in large audio databases using anchor models”, in Proc. ICASSP, 2001. [3] H.Aronowitz, D. Burshtein, A. Amir, "Speaker indexing in audio archives using test utterance Gaussian mixture modeling", in Proc. ICSLP, 2004. [4] H.Aronowitz, D. Burshtein, A. Amir, "A session-GMM generative model using test utterance Gaussian mixture modeling for speaker verification", in Proc. ICASSP, 2005. [5] P. Kenny et al., “Factor Analysis Simplified”, in Proc. ICASSP, 2005. [6] H. Aronowitz, D. Irony, D. Burshtein, “Modeling Intra-Speaker Variability for Speaker Recognition ”, in Proc. Interspeech, 2005. [7] J. Goldberger and H. Aronowitz, "A distance measure between GMMs based on the unscented transform and its application to speaker recognition" , in Proc. Interspeech 2005. [8] H. Aronowitz, D. Burshtein, "Efficient Speaker Identification and Retrieval", in Proc. Interspeech 2005.

Bibliography 2/2 [9] A. Stolcke et al., “MLLR Transforms as Features in Speaker Recognition”, in Proc. Interspeech, 2005. [10] E. Noor, H. Aronowitz, "Efficient language Identification using Anchor Models and Support Vector Machines,“ in Proc. ISCA Odyssey Workshop, 2006. [11] W.M. Campbell et al., “SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation”, in Proc. ICASSP 2006. [12] H. Aronowitz, “Segmental modeling for audio segmentation”, in Proc. ICASSP, 2007. [13] J.R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models” ,in Proc. ICASSP 2007. [14] H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007. [15] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability Modeling”, in Proc. Interspeech, 2007. [16] H. Aronowitz, “Trainable Speaker Diarization”, in Proc. Interspeech, 2007. [17] T. Bocklet et al., “Age and Gender Recognition for Telephone Applications Based on GMM Supervectors and Support Vector Machines”, in Proc. ICASSP, 2008.

Thanks! Presentation is available online at: http://aronowitzh.googlepages.com/

Backup slides

Kernel-PCA Based Mapping 2/5 Dot-product feature space Session space f() f(x) x y Kernel trick f(y) Anchor sessions Goals: - Map sessions into feature space - Model in feature space

Kernel-PCA Based Mapping 3/5 Given - kernel K - n anchor sessions Find an orthonormal basis for Method • Compute eigenvectors of the centralized kernel-matrix ki,j = K(Ai,Aj). • Normalize eigenvectors by square-roots of corresponding eigenvalues → {vi} • for is the requested basis

Kernel-PCA Based Mapping 4/5 Common speaker subspace - Speaker unique subspace - Given sessions x, y, may be uniquely represented as: is a mapping x→Rnwith the property:

Kernel-PCA Based Mapping 5/5 Speaker unique subspace Session space Feature space ux uy K-PCA f(x) x y f(y) Tx Ty Anchor sessions Common speaker subspace (Rn)

Modeling in Segment-GMM Supervector Space Segment-GMM supervector space speech silence music Frame sequence: segment #n Frame sequence: segment #2 Frame sequence: segment #1

Goal Segment audio accurately and robustly into speech / silence / music segments. Novel idea Acoustic modeling is usually done on a frame-basis. Segmentation/classification is usually done on a segment-basis (using smoothing). Why not explicitly model whole segments? Note: speaker, noise, music-context, channel (etc.) are constant during a segment. Segmental Modeling for Audio Segmentation

Speech / Silence Segmentation – Results 1/2

Speech / Silence Segmentation – Results 2/2

LID in Session Space English Session space French Arabic Test session Training session

LID in Session Space - Algorithm • Front end: shifted delta cepstrum (SDC). • Represent every train/test session by a GMM super-vector. • Train a linear SVM to classify GMM super-vectors. • Results • EER=4.1% on the NIST-03 Eval (30sec sessions).

Speaker indexing [Sturim et al., 2001] Intersession variability modeling in projected space [Collet et al., 2005] Speaker clustering [Reynolds et al., 2004] Speaker segmentation [Collet et al., 2006] Language identification [Noor and Aronowitz, 2006] Anchor Modeling Projection Given: anchor models λ1,…,λn and session X= x1,…,xF Projection: = average normalized log-likelihood

Intra-Class Variability Modeling Introduction • The classic GMM algorithm does not explicitly model intra-speaker inter-session variability: • Noise • Channel • Language • Changing speaker characteristics – stress, emotion, aging • The frame independence assumption does not hold in these cases! (1) Instead, we get: (2)

Intra-Class Variability Modeling for Speech Processing

Intra-Class Variability Modeling for Speech Processing

Presentation Transcript

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

2nd meeting: Multilevel modeling: intra class correlation Subjects for today:

Modeling Reserve Variability

Intra-Class Variability Modeling for Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing