building a robust speaker recognition system old ich plchot ond ej glembek pavel mat jka n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Building a Robust Speaker Recognition System Old řich Plchot , Ondřej Glembek , Pavel Matějka PowerPoint Presentation
Download Presentation
Building a Robust Speaker Recognition System Old řich Plchot , Ondřej Glembek , Pavel Matějka

Loading in 2 Seconds...

play fullscreen
1 / 24

Building a Robust Speaker Recognition System Old řich Plchot , Ondřej Glembek , Pavel Matějka - PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on

Building a Robust Speaker Recognition System Old řich Plchot , Ondřej Glembek , Pavel Matějka. December 9 th 2012 . The PRISM Team. SRI International Harry Bratt , Lukas Burget , Luciana Ferrer , Martin Graciarena , Aaron Lawson, Yun Lei, Nicolas Scheffer

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Building a Robust Speaker Recognition System Old řich Plchot , Ondřej Glembek , Pavel Matějka' - takara


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
building a robust speaker recognition system old ich plchot ond ej glembek pavel mat jka

Building a Robust Speaker Recognition SystemOldřichPlchot, Ondřej Glembek, Pavel Matějka

December 9th 2012

the prism team

The PRISM Team

SRI International

Harry Bratt, Lukas Burget, Luciana Ferrer, Martin Graciarena, Aaron Lawson, Yun Lei, Nicolas Scheffer

SachinKajarekar, Elizabeth Shriberg, Andreas Stolcke

Brno University of Technology

Jan H. Cernocky, OndrejGlembek, PavelMatejka, OldrichPlchot

prism robustness
PRISM Robustness

~

~

~

~

Error rates lowered

“How did we achieve these results?”

“What are the outstanding research issues?”

BEST Phase I PI conference, Nov. 29th, 2011

robustness
Robustness
  • A need for effectiveness on non-ideal conditions
    • Moving beyond biometric evaluation on clean, controlled acquisition environments
    • Extract robust and discriminative biometric features, invariant to such variability types
  • A need for predictability
    • A system claiming 99% accuracy should not give 80% on unseen data
    • Unless otherwise warned by the system

BEST Phase I PI conference, Nov. 29th, 2011

a comprehensive approach
A comprehensive approach

Multi-stream High order and Low order features

Advanced speaker modeling and system combination

Prediction of difficult scenarios – QM vector

Robustness vs.Unknown– Carefully test on held-out data, bewareofovertraining

BEST Phase I PI conference, Nov. 29th, 2011

a comprehensive approach1
A comprehensive approach
  • Multi-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, …
    • Multiple HOFs: new complimentary information
    • Multiple LOFs: ditto + redundancy for increased robustness
  • Advanced speaker modeling and system combination: Unified modeling framework i-vector / probabilistic discriminant analysis
    • Robust variation-compensation scheme for multiple features and variability types
    • i-vector / PLDA framework adapted to all high- and low- level features
    • Discriminative training for more compact thus robust systems

BEST Phase I PI conference, Nov. 29th, 2011

the magic ivectors
THE MAGIC? - iVectors

iVector extractor is model similar to JFA

with single subspace T  easier to train

no need for speaker labels  the subspace can be trained on large amount of unlabeled recordings

We assume standard normal prior factors i.

iVector – point estimate of i – can now be extracted for every recording as its low-dimensional, fixed-length representation (typically 200 dimensions).

However, iVector contains information about both speaker and channel. Hopefully this can by separated by the following classifier.

Dehak, N., et al., Support Vector Machines versus Fast Scoring in the Low-Dimensional Total Variability Space for Speaker Verification In Proc Interspeech 2009, Brighton, UK, September 2009

illustration
Illustration

Low dimensional vector can represent complex patterns in multi-dimensional space

m1

m2

m1

m2

m1

m2

μ1

μ2

μ1

μ2

μ1

μ2

t11 t12

t21 t22

t11 t12

t21 t22

t11 t12

t21 t22

t13

t23

t13

t23

t13

t23

i1

i2

i3

=

+

probabilistic linear discriminant analysis plda
Probabilistic Linear Discriminant Analysis (PLDA)

Let every speech recording be represented by iVector.

What would be now the appropriate probabilistic model for verification?

iVector are assumed to be normal distributed

iVector still contains channel information  our model should consider both speaker and channel variability, just like in JFA.

Natural choice is simplified JFA model with only single Gaussian. Such model is known as PLDA and is described by familiar equation:

why p lda
For our low-dimensional iVectors, we usually choose U to be full rank matrix  no need to consider residual є

We can rewrite definition of PLDA as

or equivalently as

Why PLDA ?

… familiar LDA assumptions !

plda based verification
PLDA based verification

Lets again consider verification score given by log-likelihood ratio for same and different speaker hypothesis, now in the context of modeling iVectors using PLDA:

… before: intractable, with iVectors: feasible.

All the integral are now convolutions of Gaussians and can be solved analytically, giving, after some manipulation:

FAST !

performance compared to eigenchannels and jfa

Baseline (relevane MAP)

Eigenchannel adapt.

JFA

iVector+PLDA

Performance compared to Eigenchannels and JFA

NIST SRE 2010, tel-tel (cond. 5)

iVector+PLDA system:

  • Implementation simpler than for JFA
  • Allows for extremely fast verification
  • Provides significant improvements especially in important low False Alarm region
ivector plda enhancements

iVector+PLDA

iVector+PLDAfullcov UBM

LDA150+Length normalization

red + Mean normalization

iVector+PLDA – enhancements

NIST SRE 2010, tel-tel (cond. 5)

Ideas behind the enhancements:

  • Make it easier for PLDA by preprocessing the data by LDA
  • Make the heavy tail distributed iVectors more Gaussian
  • Help a little bit more with channel compensation by condition-based mean normalization
diverse systems unified
Diverse systems unified

%FA @ 10% Miss

“All features are now modeled using the i-vector paradigm, even for combination”

“New technologies for prosody modeling, e.g. subspace multinomial modeling”

BEST Phase I Final review, Nov. 3rd, 2011

best evaluation submissions
BEST evaluation submissions

Early iVector fusion, optimal

%FA @ 10% Miss

PRIMARY

Complex multi-feature / combination of low- and high- level systems

½% False Alarms @10% Miss for our PRISM MFCC system: Look at another operating point? (if that low for the evaluation)

BEST Phase I Final review, Nov. 3rd, 2011

a comprehensive approach2
A comprehensive approach
  • Multi-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, …
  • Advanced speaker modeling and system combination: Unified modeling framework
  • Prediction of difficult scenarios: Universal audio characterization for system combination
    • Detect the difficulty of the problem, eg: enroll on noise, test on telephone
    • React appropriately, eg: calibrate scores for sound decisions

BEST Phase I PI conference, Nov. 29th, 2011

predicting challenging scenarios
Predicting challenging scenarios

Microphone

Noise 20db

Tel

Noise 8db

Reverb 0.3

Reverb 0.7

Noise 15db

Reverb 0.5

  • Unified acoustic characterization: A novel approach to extract any metadata in a unified way
    • Designed with the BEST program goal in mind: ability to handle unseen data, or compounded variability types
    • Avoid the unnecessary burden to develop a new system for each new type of metadata
  • IDentificationsystem, where the training data is divided into conditions
  • Investigating how to integrate intrinsic conditions: language and vocal effort

BEST Phase I Final review, Nov. 3rd, 2011

robust calibration fusion
Robust calibration / fusion
  • Condition prediction features as new higher order information for calibration
    • Calibration: scale and shift scores for sound decision making on all operating points
    • Confidence under matched vs. mismatch conditions will differ
  • Discriminative training of the bilinear form
    • Model is giving a bias for each condition type
  • Further research
    • Assess generalization
    • Affect system fusion weights not just calibration
    • Early inclusion of the information

BEST Phase I Final review, Nov. 3rd, 2011

fusion with qm
Fusion with QM
  • Offset
  • Linear combination weigths
  • Score from system k
  • Vectors of metadata
  • Bilinear combination matrix

BEST Phase I PI conference, Nov. 29th, 2011

a comprehensive approach3
A comprehensive approach
  • Multi-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, …
  • Advanced speaker modeling and system combination: Unified modeling framework
  • Prediction of difficult scenarios: Unified condition prediction for system combination
  • Robustness vs. unknown: The PRISM data set
    • Expose systems to a diverse enough variability types of interest
    • Aim for generalization on non-ideal or unseen data scenarios
    • Use advanced strategies to compensate for these degradation

BEST Phase I PI conference, Nov. 29th, 2011

the prism data set
The PRISM data set

08dB

15dB

20dB

Noisy data set

  • Noises from freesound.org, mixied using FaNT (Aurora)
  • Real noise sample: cocktail party type, office noises
  • Different noises for training and evaluation

Reverb data set

  • Uses RIR + Fconv
  • Choose 3 RT30 values: 0.3, 0.5, 0.7
  • 15 different room configurations
  • 9 for training, 3 enrollment, 3 test
  • A multi-variability, large scale, speaker recognition evaluation set
    • Unprecedented design effort across many data sets
    • Simulation of extrinsic variability types: reverb & noise
    • Incorporation of intrinsic and cross-language variability
    • 1000 speakers, 30K audio files and more than 70M trials
  • Open design: Recipe published at SRE11 analysis workshop [Ferrer11]
  • Extrinsic data simulation
    • Degradation of a clean interview data set from SRE’08 and ‘10 (close mics)
    • A variety of degradation aiming at generalization: Diversity of SNRs / reverbs to cover unseen data

BEST Phase I Final review, Nov. 3rd, 2011

research opportunities
Research opportunities
  • Multi-feature systems
    • Use novel low-level features for noise robustness
    • Noise / Reverb robust pitch extraction algorithms
  • Deeper understanding of combination: Aiming for simpler systems
    • Information fusion at earlier stage than score level
    • New speech feature design?
  • Acoustic characterization
    • Deep integration of condition prediction in the pipeline
    • Affecting fusion weights during system combination
    • Integrate language and intrinsic variations
    • Assessing improvements on unseen data, compounded variations
  • Hard extrinsic variations brings up new domains of expertise borrowed from speech recognition and others (noise robust modeling, speech enhancement: De-reverberation, de-noising, binary masks, …)

BEST Phase I PI conference, Nov. 29th, 2011

research opportunities relaxing constraints even more
Research opportunities Relaxing constraints even more
  • Compounded variations: Reverb + noise + language switch
  • Explore new types of variations
    • New kinds of intrinsic variations: vocal effort (furtive, oration), Aging, Sickness
    • Naturally occurring reverberant and noisy speech
  • Other parametric relaxations
    • Unconstrained duration for speaker enrollment and testing (as low as a second?)
    • Robustness to multi-speaker audio enrollment and testing: another kind of variability: VERY important for interview data processing

BEST Phase I PI conference, Nov. 29th, 2011