Video Rewrite: Driving Visual Speech with Audio

1 Video Rewrite:Driving Visual Speech with Audio • Christoph Bregler • Michele Covell • Malcolm Slaney • Interval Research Corporation

2 Goal: Photo-realistic Talking Face Video Rewrite Handcoded 3D Model OR

2 Facial Animation History: • Parke (1972) • Cohen & Massaro, Benoit et al. (1993) • Waters & Terzopolous (1990),  DEC-Face • Lewis (1991) • Litwinowicz & Williams (1994) • Chen, Graf, Petajan, et al (1995) • Scott et al (1994) • Ezzat & Poggio (1997) • Pighin et al + Gunter et al (1998) • Brand (1999) • Cosatto, Graf (2000)

3 Video Rewrite:Overview /D/ /IY/ /P/ /AH/ Analysis Synthesis

/D/ /OH/ /AH/ /N/ • 5 Annotation • Phonetic • Head Pose • Mouth Shape

6 Phonetic Annotation HMM Labels /D/ /IY/ /P/ /AH/ /IY-P-AH/ /D-IY-P/

6 Phonetic Annotation • Acoustic Front-End: RASTA-PLP (Channel Invariant) • HMM Models / Gaussian Mixture Models (HTK) • Phoneme Set: 56 categories (CMU) • Triphone models trained on TIMIT • Annotation using Forced-Viterbi • (and CMU pronunciation dictionary)

7 Head Pose Annotation match planar template

8 Mouth / Chin Annotation Eigenpoints

8 Eigenpoints - Training - Graylevel + XY Control points

8 Eigenpoints - Mapping - Graylevel + XY Control point Space

11 Synthesis - Overview - background face

/J/ /EH/ /IY/ /L/ • 12 Synthesis: • Transcribe • Find Lip Clips • Stitch Together

/AA/ /T/ /AA/ • 13 Matching:

/AA/ /T/ /AA/ • 14 Matching: Co-Articulation / UW - T - UW/ ?

/AA/ /T/ /AA/ / UW - T - UW/ / AA - T - AA/ • 15 Matching: Co-Articulation match

16 Co-Articulation: Tri-Phones / UW - T - UW/ More than 20,000 Tri-Phones in English / AA - T - AA/ / AA - S - AA/ ….

16 Viseme based Perceptual match P B S T K … P B S T K … 11 Consonant Clusters: - CH, JH, SH, ZH - K, G, N, L - T, D, S, Z - P, B, M - F, V - TH, DH Owens (1985) Confusion Matrix

McGurk Effect -- Baldy by Cohen & Massaro

/AA/ /T/ /AA/ / UW - T - UW/ / AA - S - AA/ • 17 Matching: Viseme-Distance correct phone wrong context: correct viseme correct context:

/AA/ /T/ /AA/ / UW - T - UW/ / AA - S - AA/ • 18 Matching: Viseme-Distance approximate match

Matching: Overlapping Triphones • 18 Shape Distance

18 Matching: Trade-Offs /IY/ /P/ /AA/ /T/ /AA/ N-Viseme Distance Shape Distance Rate of Speech Distance

18 Matching: N-Best Dynamic Programming Error = S a V(t) + b R(t) + g S(t-1,t) N-best t

19 Stitching + +

20 Stitching + +

21 Stitching Morphing

21 Morphing Affine-Warp + Beier-Neely

21 Simple Lighting Correction Internsity 1.) X Alpha Blending 2.) X

22 Video Rewrite Results Ellen - Video Model 8 minutes data JFK - Video Model 2 minutes data

23 Contributions • Data-driven lip animation • Automatic using vision and speech recognition • Photo realistic: • implicitly captures specific appearance + dynamics

24 Video Rewrite Thanks ! Acknowledgments: S. Ahmad M. Bajura F. Crow T. Darrell M. Davis G. Gordon K. Force B. Fuson B. Lassiter J. Lewis K. Rahardja S. Snibbe C. Sequine E. Tauber B. Verplank S. White J. Woodfill John F. Kennedy

1994: Scott et al (JPL + Graphco Technologies) /e/ /o/ /n/

1994: Scott et al (JPL + Graphco Technologies)

Video Rewrite: Driving Visual Speech with Audio

Video Rewrite: Driving Visual Speech with Audio

Presentation Transcript

Audio and Video on the Web – with a hint of Flash

Clock Synchronization in Audio/Video Bridging Networks using IEEE 1588 Version 2

TAXCO, MEXICO

Learn Driving Fast

The Speech Speech

Sound Gathering II:

An Overview of Perceptual Audio Coding and MPEG AAC

Why Inner Speech?

24-bit Audio CODEC

8- Speech Recognition

Multimodal Deep Learning

Alfred Hitchcock

Alfred Hitchcock

Speech Acts

VR Output

Embedded Audio Coder

Defensive Driving

Feature Extraction for speech applications

Defensive Driving

T325: Technologies for digital media

Chapter 28 – Multimedia: Audio, Video, Speech Synthesis and Recognition