Automatic Speaker Recognition: Technologies, Evaluations and Possible Future

Automatic Speaker Recognition:Technologies, Evaluations and Possible Future Gérard CHOLLET CNRS-LTCI, GET-ENST chollet@tsi.enst.fr Automatic Speaker Recogniton

Outline • Why Speaker Recognition ? • Taxonomy (i.e. tasks) • Applications (security, forensic,…) • Pros and Cons • Speaker Characteristics in the Speech Signal • How to perform Speaker Recognition ? • Evaluation (NIST,…) • Voice Transformations and Forgery (occasional, dedicated) • Audio-visual Speaker Verification • Conclusions, Perspectives Automatic Speaker Recogniton

1. Why Should a Computer Recognize Who Is Speaking? • Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...) • Limited access (secured areas, data bases) • Personalization (only respond to its master’s voice) • Locate a particular person in an audio-visual document (information retrieval) • Who is speaking in a meeting ? • Is a suspect the criminal ? (forensic applications) Automatic Speaker Recogniton

2. Taxonomy of the Automatic Speaker Recognition Tasks • Speaker verification (Voice Biometric??) • Are you really who you claim to be ? • Speaker identification (Speaker ID) : • Is this speech segment coming from a known speaker ? • How large is the set of speakers (population of the world) ? • Speaker detection, segmentation, indexing, retrieval, tracking : • Looking for recordings of a particular speaker • Combining speech and speaker recognition • Adaptation to a new speaker, speaker typology • Personalization in dialogue systems Automatic Speaker Recogniton

3. Applications • Access Control • Physical facilities, Computer networks, Websites • Transaction Authentication • Telephone banking, e-Commerce • Speech data Management • Voice messaging, Search engines • Law Enforcement • Forensics, Home incarceration Automatic Speaker Recogniton

4. Advantages and Disadvantages • Advantages • Most suited modality over the telephone • Low cost (microphone, A/D), Ubiquity • Possible integration on a smart (SIM) card • Natural bimodal fusion : speaking face • Disadvantages • Lack of discretion • Possibility of imitation and electronic imposture • Lack of robustness to noise, distortion,… • Temporal drift Automatic Speaker Recogniton

5. Speaker Characteristics in the Speech Signal • Differences in • Vocal tract shapes and muscular control • Fundamental frequency (typical values) • 100 Hz (Male), 200 Hz (Female), 300 Hz (Child) • Glottal waveform • Phonotactics • Lexical usage • The differences between voices of twins is a limit case • Voices can also be imitated, disguised and electronically transformed Automatic Speaker Recogniton

5.1 Speaker Characteristics: different factors Supra-segmental factors (>30ms) speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits Segmental factors (~30ms) glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness) vocal tract:characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef.) spectral envelope of / i: / Speaker A Speaker B A f Automatic Speaker Recogniton

5.2 Speaker Characteristics: Acoustic Features • Short term spectral analysis Automatic Speaker Recogniton

5.3 Speaker Characteristics: Intra- and Inter- Speaker Variability Automatic Speaker Recogniton

6.1 How: history Automatic Speaker Recogniton

6.2 How: current approaches Automatic Speaker Recogniton

6.3 How: HMM Structure is Application Dependent Automatic Speaker Recogniton

6.4 How: Gaussian Mixture Models (GMMs) • Parametric representation of the probability distribution of observations: Automatic Speaker Recogniton

6.5 How: GMM’s example 8 Gaussians per mixture Automatic Speaker Recogniton

6.6 How: Decision Theory for Speaker Verification • Two types of errors : • False rejection (a client is rejected) • False acceptation (an impostor is accepted) • Decision theory : given an observation O and a claimed identity • H0 hypothesis : it comes from an impostor • H1 hypothesis : it comes from our client • H1 is chosen if and only if P(H1|O) > P(H0|O) ,which could be rewritten (using Bayes law) as: Automatic Speaker Recogniton

6.8 How: Decision Automatic Speaker Recogniton

6.9 How: Distribution of scores Automatic Speaker Recogniton

6.10 How: Detection Error Tradeoff (DET) Curve Automatic Speaker Recogniton

7. Evaluation • Decision cost (FA, FR, priors, costs,…) • Reference systems (open software) • Torch - a Machine Learning library (www.torch.ch) • ALIZE (www.lia.univ-avignon.fr/heberges/ALIZE/) • BECARS (www.tsi.enst.fr/~blouet/Becars/) • Evaluations (algorithms, field trials, ergonomics,…) • NIST Speaker detection campaigns Automatic Speaker Recogniton

7.1 Evaluations: National Institute of Standards & Technology (NIST) • Annual evaluation since 1995 • Common paradigm for comparing technologies Automatic Speaker Recogniton

7.2 Evaluations: NIST 2004 Automatic Speaker Recogniton

8. Voice Transformations and Forgery (occasional, dedicated) • Isolated individuals with few resources or “professional impostors” with a dedicated budget can menace the security of speaker recognition systems • Voice transformation technologies (e.g. segmental synthesis using an inventory of client speech data) are nowadays available • Speaker recognition research should explicitly address this forgery issue and define appropriate countermeasures • Prevention by predicting many different forgery scenarios Automatic Speaker Recogniton

Speaking Faces : Motivations A person speaking in front of a camera offers 2 modalities for identity verification (speech and face). The sequence of face images and the synchronisation of speech and lip movements could be exploited. Imposture is much more difficult than with single modalities. Many PCs, PDAs, mobile phones are equiped with a camera. Audio-Visual Identity Verification will offer non-intrusive security for e-commerce, e-banking,… Automatic Speaker Recogniton

9.1 Speaking faces: Audio-Visual Approach Automatic Speaker Recogniton

A talking face model Using Hidden Markov Models (HMMs) Each state of the model generates a sequence of feature vectors Automatic Speaker Recogniton

10. Conclusions and Perspectives • Deliberate forgery is a challenge for speech only systems • Verification of identity based on features extracted from talking faces should be developed • Common databases and evaluation protocols are necessary • Free access to reference systems and databases, will facilitate future developments • Apply this paradigm to more than audio-visual modalities => see BioSecure-NoE Automatic Speaker Recogniton

Automatic Speaker Recognition: Technologies, Evaluations and Possible Future