Robust speaker recognition
This presentation is the property of its rightful owner.
Sponsored Links
1 / 47

Robust Speaker Recognition PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

Robust Speaker Recognition. JHU Summer School 2008 Lukas Burget Brno University of Technology. Variability refers to changes in channel effects between training and successive detection attempts Channel/session variability encompasses several factors The microphones

Download Presentation

Robust Speaker Recognition

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Robust speaker recognition

Robust Speaker Recognition

JHU Summer School 2008

Lukas Burget

Brno University of Technology


Robust speaker recognition

Variability refers to changes in channel effects between training and successive detection attempts

Channel/session variability encompasses several factors

The microphones

Carbon-button, electret, hands-free, array, etc

The acoustic environment

Office, car, airport, etc.

The transmission channel

Landline, cellular, VoIP, etc.

The differences in speaker voice

Aging, mood, spoken language, etc.

Anything which affects the spectrum can cause problems

Speaker and channel effects are bound together in spectrum and hence features used in speaker verifiers

Intersession variability

NIST SRE2008 - Interview speech

Different microphone in training and test

about 3% EER

The same microphone in training and test

< 1% EER

The largest challenge to practical use of speaker detection systems is channel/session variability


Channel session compensation

Channel/Session Compensation

Channel/session compensation occurs at several levels in a speaker detection system

Signal domain

Feature domain

Model domain

Score domain

Target model

Adapt

Front-end

processing

LR score

normalization

S

L

Background

model

  • Speaker Model Synthesis

  • Eigenchannel compensation

  • Joint Factor Analysis

  • Nuisance Attribute Projection

  • Feature Mapping

  • Eigenchannel adaptation in feature domain

  • Noise removal

  • Tone removal

  • Cepstral mean subtraction

  • RASTA filtering

  • Mean & variance normalization

  • Feature warping

  • Z-norm

  • T-norm

  • ZT-norm


Robust speaker recognition

Signal domain

Feature domain

Model domain

Score domain

Target model

Adapt

Front-end

processing

LR score

normalization

S

L

Background

model

  • Speaker Model Synthesis

  • Eigenchannel compensation

  • Joint Factor Analysis

  • Nuisance Attribute Projection

  • Feature Mapping

  • Eigenchannel adaptation in feature domain

  • Noise removal

  • Tone removal

  • Cepstral mean subtraction

  • RASTA filtering

  • Mean & variance normalization

  • Feature warping

  • Z-norm

  • T-norm

  • ZT-norm


Adaptive noise suppression

Adaptive Noise Suppression

  • Basic idea of spectral subtraction (or Wiener filter):

  • Y(n) = X(n) - N(n)

  • Y(n) – enhanced speech

  • X(n) – spectrum of nth frame of noisy speech

  • N(n) – estimate of stationary additive noise spectrum

Reformulate as filtration: Y(n) = H(n)X(n) where H(n) = (X(n) – N(n)) / X(n)

It is necessary to

  • to smooth H(n) in time

  • make sure magnitude spectrum is not negative


Robust speaker recognition

Adaptive Noise Suppression

  • Goal: Suppress wideband noise and preserve the speech

  • Approach: Maintain transient and dynamic speech components, such as energy bursts in consonants, that are important “information-carriers”

  • Suppression algorithm has two primary components

    • Detection of speech or background in each frame

    • Suppressioncomponent usesan adaptive Wiener filter requiring:

      • Underlying speech signal spectrum, obtained by smoothing the enhanced output

      • Background spectrum

      • Signal change measure, given by a spectral derivative, for controlling smoothing constants


Robust speaker recognition

Adaptive Noise Suppression

  • C3 example from ICSI

  • Processed with LLEnhance toolkit for wideband noise reduction

SNR = 15 dB

SNR = 25 dB


Robust speaker recognition

Signal domain

Feature domain

Model domain

Score domain

Target model

Adapt

Front-end

processing

LR score

normalization

S

L

Background

model

  • Speaker Model Synthesis

  • Eigenchannel compensation

  • Joint Factor Analysis

  • Nuisance Attribute Projection

  • Feature Mapping

  • Eigenchannel adaptation in feature domain

  • Noise removal

  • Tone removal

  • Cepstral mean subtraction

  • RASTA filtering

  • Mean & variance normalization

  • Feature warping

  • Z-norm

  • T-norm

  • ZT-norm


Cepstral mean subtraction

x 0.5

Fourier

Transform

Cosine

transform

Magnitude

Log()

- 0.3

Cepstral Mean Subtraction

  • MFCC feature extraction scheme

  • Consider the same speech signal recorded over different microphone attenuating

    certain frequencies twice

  • Scaling in magnitude spectrum

    domain corresponds to constant

    shift of the log filter bank outputs

frames


Robust speaker recognition

Cepstral Mean Subtraction

  • Assuming the frequency characteristics of the two microphones do not change over time, the whole temporal trajectories of the affected log filter bank outputs differs by the constant.

  • The shift disappears after subtracting mean computed over the segment.

  • Usually only speech frames are considered for the mean estimation

  • Since Cosine transform is linear operation the same trick can be applied directly in cepstral domain

0.0


Nist sre 2005 all trials

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

Miss probability [%]

False alarm probability [%]


Rasta filtering

10

0

-10

Magnitude [dB]

-20

-30

-40

1

100

0.01

0.1

10

Frequency [Hz]

-100

0

100

200

300

400

Time [s]

RASTA filtering

Frequency characteristic

  • Filtering log filter bank output (or equivalently cepstral)temporal trajectories by band pass filter

  • Remove slow changes to compensate for the channel effect (≈CMS over 0.5 sec. sliding window)

  • Remove fast changes (> 25Hz) likely not caused by speaker with limited ability to quickly change vocal tract configuration

original

Impulse response

0.0

 frames

RASTA filtered

0.0


Nist sre 2005 all trials1

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

Miss probability [%]

False alarm probability [%]


Mean and variance normalization

 frames

Mean and Variance Normalization

  • While convolutive noise causes the constant shift of cepstral coeff. temporal trajectories, noiseadditive in spectral domain fills valleys in the trajectories

  • In addition to subtracting mean, trajectory can be normalized to unity variance (i.e. dividing by standard deviation) to compensate for his effect

original

Speech with additive noise

after CMN/CVN

Clean speech


Feature warping

Feature Warping

  • Warping each cepstral coefficients in 3 second sliding window into Gaussian distribution

  • Combines advantages of the previous techniques (CMN/CVN, RASTA)

  • Resulting coefficients are (locally) Gaussianized  more suitable for GMM models

0.0

0.5

1.0

0.0

Inverse Gaussian cumulative density function


Nist sre 2005 all trials2

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

with Feature Warping

Miss probability [%]

False alarm probability [%]


Nist sre 2005 all trials3

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

with Feature Warping

+ triple deltas

Miss probability [%]

+ HLDA

False alarm probability [%]


Example of 2d gmm

Example of 2D GMM


Robust speaker recognition

HLDA

Heteroscedastic Linear Discriminant Analysis provides a linear transformation that de-correlates classes.


Robust speaker recognition

HLDA

HLDA allows for dimensionality reduction while preserving the discriminability between classes (HLDA without dim. Reduction is also called MLLT)

Nuisance dimension

Useful dimension


Robust speaker recognition

Signal domain

Feature domain

Model domain

Score domain

Target model

Adapt

Front-end

processing

LR score

normalization

S

L

Background

model

  • Speaker Model Synthesis

  • Eigenchannel compensation

  • Joint Factor Analysis

  • Nuisance Attribute Projection

  • Feature Mapping

  • Eigenchannel adaptation in feature domain

  • Noise removal

  • Tone removal

  • Cepstral mean subtraction

  • RASTA filtering

  • Mean & variance normalization

  • Feature warping

  • Z-norm

  • T-norm

  • ZT-norm


Robust speaker recognition

It is generally difficult to get enrollment speech from all microphone types to be used

The SMS approach addresses this by synthetically generating speaker models as if they came from different microphones (Teunen, ICSLP 2000)

A mapping of model parameters between different microphone types is applied

Speaker Model Synthesis

synthesis

synthesis

cellular

electret

carbon button


Robust speaker recognition

Speaker Model Synthesis

  • Learning mapping of model parameters between different microphone types:

  • Start with channel-independent root model

  • Create channel models by adapting root with channel specific data

  • Learn mean shift between channel models


Speaker model synthesis

Speaker Model Synthesis

  • Training speaker model:

  • Adapt channel model which scores highest on training data to get target model

  • Synthesize new target channel model by applying the shift

Training data

Test data

  • GMM weights and variances can be also adapted and used to improve the mapping of model parameters between different microphone types


Robust speaker recognition

Signal domain

Feature domain

Model domain

Score domain

Target model

Adapt

Front-end

processing

LR score

normalization

S

L

Background

model

  • Speaker Model Synthesis

  • Eigenchannel compensation

  • Joint Factor Analysis

  • Nuisance Attribute Projection

  • Feature Mapping

  • Eigenchannel adaptation in feature domain

  • Noise removal

  • Tone removal

  • Cepstral mean subtraction

  • RASTA filtering

  • Mean & variance normalization

  • Feature warping

  • Z-norm

  • T-norm

  • ZT-norm


Robust speaker recognition

Aim: Apply transform to map channel-dependent feature space into a channel-independent feature space

Approach:

Train a channel-independent model using pooling of data from all types of channels

Train channel-dependent models using MAP adaptation

For utterance, find top scoring CD model (channel detection)

Map each feature vector in utterance into CI space

Feature mapping

CD 1

CD 2

CD N

CI

D.A. Reynolds, “Channel Robust Speaker Verification via Feature Mapping,” ICASSP 2003


Feature mapping

Feature mapping

  • As for SMS, sreate channel models by adapting root with channel specific data

  • Learn mean shifts between each channel models and channel-independent root model


Feature mapping1

Feature mapping

  • For each (training or test) speech segment, determine maximum likelihood channel model

  • For each frame of the segment, record top-1 Gaussian per frame

  • For each frame apply mapping to map x with CD pdf to y with CI pdf

  • Target model is adapted from CI model using mapped features

  • Mapped features and CI models are used in test


Nist sre 2005 all trials4

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

with Feature Warping

+ triple deltas

Miss probability [%]

+ HLDA

+ Feature mapping (14 classes)

False alarm probability [%]


Session variability in mean supervector space

Session variability in mean supervector space

  • GMM mean supervector – column vector created by concatenating mean vectors of all GMM components.

  • For the case of variances shared by all speaker models, supervector M fully defines speaker model

  • Speaker Model Synthesis can be rewritten as:

  • MCD2 = MCD1 + kCD1CD2, where kCD1CD2 is the cross-channel shift

  • Drawbacks of SMS (and Feature Mapping)

    • Channel dependent models must be created for each channel

    • Different factors causing intersession variability may combine (e.g. channel and language)  compensation must be trained for each such combination

    • The factors are not discrete (i.e. effects on the intersession variability may be more or less strong)

  • There is evidence that there is limited number of directions in the supervector space strongly affected by intersession variability. Different directions possibly corresponds to different factors.


Robust speaker recognition

Session variability in mean supervector space

Example: single Gaussian model with 2D features

Target speaker model

UBM

High speaker variability

High inter-session variability


Robust speaker recognition

Session compensation in supervector space

Target speaker model

Test data

UBM

High speaker variability

High intersession variability

For recognition, move both models along the high inter-session variability direction(s) to fit well the test data (e.g. in ML sense)


6d example of supervector space

6D example of supervector space


Identifying high intersession variability directions

Identifying high intersession variability directions

supervectors of speaker 1

  • Take multiple speech segments from many training speakers recorded under different channel conditions. For each segment derive supervector by MAP adapting UBM.

  • From each supervector, subtract mean computed over supervectors of corresponding speaker.

  • Find direction's with largest intersession variability using PCA (eigen vectors of the average with-in speaker covariance matrix).

speaker 2

speaker 3

Eigenchannel U


Eigenchannel adaptation

Eigenchannel adaptation

  • Speaker model obtained in usual way by MAP adapting UBM

  • For test, adapt speaker model and UBM by moving supervectors in the direction(s) of eigenchannel(s) to well fit the test data  find factors x maximizing likelihood of test data for

  • The score is LLR computed using the adapted speaker model and UBM

Target speaker model M

Test data

UBM

Eigenchannel U

N. Brummer,SDV NIST SRE’04 System description, 2004.


Nist sre 2005 all trials5

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

with Feature Warping

+ triple deltas

Miss probability [%]

+ HLDA

+ Eigenchannels adaptation

+ Feature mapping (14 classes)


Nuisance attribute projection

Nuisance Attribute Projection

NAP is an intersession compenzation technique proposed for SVMs

Project out the eigenchannel directions from supervectors before using the supervectors for training SVMs or test

U


Robust speaker recognition

Speaker Model Synthesis: MCD2 = MCD1 + kCD1CD2

constant supervector shift for recognized training and test channel

Eigenchannel adaptation: Mtest = Mtrain + Ux

the shift is given by linear combination of eigenchannel basis U with factors x tuned for test data

Eigenvoice adaptation

Consider also supervector subspace V with high speaker variability and use it to obtain speaker model

M = MUBM + Vy – speaker model given by linear combination of UBM supervec. and eigenvoice bases

speaker factors y tuned to match enrollment data

Can be combined with channel subspace:

M = MUBM + Vy + Ux

both x and y estimated on enrollment data

only x updated for test data to adapt speaker model to test channel condition

Constructing models in supervector space

High speaker variability

High intersession variability


Robust speaker recognition

Joint Factor analysis

  • M = MUBM + Vy + Dz + Ux

  • Probabilistic model

    • Gaussian priors assumed for factors y, z, x

    • Hyperparameters MUBM, V, D, U can be trained using EM algorithm

    • D - diagonal matrix describing remaining speaker variability not covered by eigenvoices

u2

u1

d11

v2

d22

d33

v1


Nist sre 2005 all trials6

NIST SRE 2005 all trials

2048 Gauss., 13 MFCC + delatas, CMS

with RASTA

with Feature Warping

+ triple deltas

+ HLDA

+ Eigenchannels adaptation

Joint Factor Analysis (extrapolated result)

+ Feature mapping (14 classes)

False alarm probability [%]


Robust speaker recognition

Signal domain

Feature domain

Model domain

Score domain

Target model

Adapt

Front-end

processing

LR score

normalization

S

L

Background

model

  • Speaker Model Synthesis

  • Eigenchannel compensation

  • Joint Factor Analysis

  • Nuisance Attribute Projection

  • Feature Mapping

  • Eigenchannel adaptation in feature domain

  • Noise removal

  • Tone removal

  • Cepstral mean subtraction

  • RASTA filtering

  • Mean & variance normalization

  • Feature warping

  • Z-norm

  • T-norm

  • ZT-norm


Robust speaker recognition

LR scores

znorm scores

Tgt1 scores

Tgt2 scores

pooled

Z-norm

  • Target model LR scores have different biases and scales for test data

    • Unusual channel or poor quality speech in training segments  lower scores from target model

    • Little training data  target model close to UBM  all LLR scores close to 0

  • Znorm attempts to remove these bias and scale differences from the LR scores

  • Estimate mean and standard deviation of non-target, same-sex utterances from data similar to test data

  • During testing normalize LR score

  • Align each model’s non-target scores to N(0,1)


Robust speaker recognition

Target model

Tnorm score

Cohort model

Cohort model

Cohort model

T-norm

  • Similar idea to Z-norm , but compensating for differences in test data

  • Estimates bias and scale parameters for score normalization using fixed “cohort” set of speaker models

    • Normalizes target score relative to a non-target model ensemble

    • Similar to standard cohort normalization except for standard deviation scaling

  • Used cohorts of same gender as target

  • Can be used in conjunction with Znorm

    • ZTnorm or TZnorm depending on order

Introduced in 1999 by Ensigma (DSP Journal January 2000)


Effect of zt norm

Effect of ZT-norm

NIST SRE2006

telephone trials

Miss probability [%]

Eigenchannel adaptation

Joint Factor Analysis

no normalization

ZT-norm

False alarm probability [%]


Score fusion

Score fusion

NISR SRE 2006 all trials

  • Linear logistic regression fusion of scores from:

  • GMM with eigenchannel adaptation

  • SVM based on GMM supervectors

  • SVM based on MLLR transformation (transformation adapting speaker indipendent LVCSR system to speaker)

  • LLR trained using many target and non-target trials from development set


Conclusions

Conclusions


  • Login