Speaking faces verification
1 / 17

Speaking Faces Verification - PowerPoint PPT Presentation

  • Uploaded on

Speaking Faces Verification. Kevin McTait Raphaël Blouet Gérard Chollet Silvia Col ó n Guido Aversano. Outline. - Speaking faces verification problem - State of the art in speaking faces verification - Choice of system architecture - Fusion of audio and visual modalities

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Speaking Faces Verification' - ganya

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Speaking faces verification

Speaking Faces Verification

Kevin McTait

Raphaël Blouet

Gérard Chollet

Silvia Colón

Guido Aversano

SecurePhone Workshop - 24/25 June 2004



- Speaking faces verification problem

- State of the art in speaking faces verification

- Choice of system architecture

- Fusion of audio and visual modalities

- Initial results using BANCA database (Becars: voice only system)

SecurePhone Workshop - 24/25 June 2004

Problem definition

Problem definition

Detection and tracking of lips in the video sequence:

Locate head/face in image frame

Locate mouth/lips area (Region of Interest)

Determine/calculate lip contours coordinates and intensity parameters (visual feature extraction)

Other parameters: visible teeth, tongue jaw movement, eyebrows, cheeks etc…

Modelling parameters

Model deformation of lip (or other) parameters over time:


Fusion of visual and acoustic parameters/models

Calculate likelihood of model relative to client/world model in order to accept/reject

Augment in-house speaker verification system (Becars) with visual parameters

SecurePhone Workshop - 24/25 June 2004



Limited device (storage and CPU processing power)

Subject variability (aging, beard, glasses…), pose, illumination

Low complexity algorithms

Subspace transforms, learning methods

Image based approaches, hue colouration/chromaticity clues

Model based approaches

SecurePhone Workshop - 24/25 June 2004

Active shape models

Active Shape Models

Identification: based on spatio-temporal analysis of video sequence

Person represented by deformable parametric model of visible speech articulators (usually lips) with their temporal characteristics

Active Shape Model consists of shape parameters (lip contours) and greyscale/colour intensity (for illumination)

Model trained on training set using PCA to recover principal modes of deformation of the model

Model used to track lips over time, model parameters recovered from lip tracking results

Shape and intensity modelled by GMMs, temporal dependencies (state transition probabilities) by HMMs

Verification: using a Viterbi algorithm, if estimation of likelihood of model generating the observed sequence of features corresponding to a client is above a threshold, then accept, else reject

SecurePhone Workshop - 24/25 June 2004

Active shape models1

Active Shape Models

Robust detection, tracking & parameterisation of visual features

Statistical, avoids use of constraints, thresholds, penalties

Model only allowed to deform to shapes similar to those seen in training set (trained using PCA)

Represent object by set of labelled points representing contours, height width, area etc.

Model consists of 5 Bézier curves (B-spline functions), each defined as two end points PO and P1 and one control point P1 :

P(t) = θ0(t)P0 + θ1(t)P1 + θ2(t)P2

points distribution model

shape approximation

SecurePhone Workshop - 24/25 June 2004

Spatio temporal model

Spatio-temporal model

  • Visual observation of speaker: O = o1, o2…oT

  • Assumption: feature vectors follow normal distribution as in acoustic domain, modelled by GMMs

  • Assumption: temporal changes are piece-wise stationary and follow first order Markov process

  • Each state in HMM represents several consecutive feature vectors

SecurePhone Workshop - 24/25 June 2004

Asm training

ASM: Training

SecurePhone Workshop - 24/25 June 2004

Asm tracking

ASM: Tracking

SecurePhone Workshop - 24/25 June 2004

Asm lip tracking examples

ASM: Lip Tracking Examples

SecurePhone Workshop - 24/25 June 2004

Image based approach

Image Based Approach

Hue and saturation levels to find lip region (ROI)

Eliminate outliers (red blobs) by constraints (geometric, gradient, saturation)

Motion constraints: difference image (1d) pixelwise absolute difference between two adjacent frames

a) greyscale image

b) hue image

c) binary hue/saturation threshholding

c) accumulated difference image

e) binary image after threshholding

f) combined binary image c AND e

Find largest connecting region

SecurePhone Workshop - 24/25 June 2004

Image based approach 2

Image Based Approach (2)

Derive lip dimensions using colour and edge information

Random Markov field framework to combine two sources of info and segment lips from background

Implementation close to completion

SecurePhone Workshop - 24/25 June 2004

Other approaches

Other Approaches

Deformable template/model/contour based:

Geometric shapes, shape models, eigen vectors, appearance models, deform in order to minimise energy/distance function relating to template paramaters and image, template matching (correlation), best fit template, active shape models, active appearance models, model fitting problem

Learning based approach:


Knowledge based approach:

Subject rules or information to find and extract features, eye/nose detection symmetry

Visual Motion analysis:

Motion analysis techniques, motion cues, difference images after thresholding and filtering

Optical flow, filter tracking (computationally expensive)

Hue and saturation threshholding

Intensity of ruddy areas, pb of removal of outliers

Image subspace transforms:

DCT, PCA, Discrete Wavelet, KLT (DWT + PCA analysis of ROI), FFT

SecurePhone Workshop - 24/25 June 2004

Fusion of audio visual information

Fusion of audio-visual information

Instance of general classifier problem (bimodal classifier)

2 observation streams: audio + video providing info about hidden class labels

Typically each observation stream used to train a single modality classifier

Aim: combine both streams to produce bimodal classifier to recognise pertinent classes with higher level of accuracy

2 general types/levels of fusion:

Feature fusion

Decision fusion

SecurePhone Workshop - 24/25 June 2004

Feature fusion

Feature Fusion

Feature fusion: HMM classifier, concatenated feature vector of audio and visual parameters – time synchronous features, possibly including upsampling)

Generation process of feature vector

Using single stream HMM with emission (class conditional observation) probabilities given by Gaussian distribution:

SecurePhone Workshop - 24/25 June 2004

Decision fusion

Decision Fusion

State synchronous decision fusion

Captures reliability of each stream

HMM state level

combine single modality HMM classifier outputs

Class conditional log-likelihoods from the 2 classifiers linearly combined with appropriate weights

Various level: state (phone, syllable, word…)

multi-stream HMMs classifier, state emission probs:

Product HMMs, factorial HMMs…

Other classifiers (SVMs, Bayesian classifiers, MLP…)

SecurePhone Workshop - 24/25 June 2004

Banca results
Banca: results

SecurePhone Workshop - 24/25 June 2004