1 / 54

Video Rewrite: Driving Visual Speech with Audio

1. Video Rewrite: Driving Visual Speech with Audio. Christoph Bregler Michele Covell Malcolm Slaney Interval Research Corporation. 2. Goal: Photo-realistic Talking Face. Video Rewrite. Handcoded 3D Model. OR. 2. Facial Animation History:. Parke (1972)

Download Presentation

Video Rewrite: Driving Visual Speech with Audio

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1 Video Rewrite:Driving Visual Speech with Audio • Christoph Bregler • Michele Covell • Malcolm Slaney • Interval Research Corporation

  2. 2 Goal: Photo-realistic Talking Face Video Rewrite Handcoded 3D Model OR

  3. 2 Facial Animation History: • Parke (1972) • Cohen & Massaro, Benoit et al. (1993) • Waters & Terzopolous (1990),  DEC-Face • Lewis (1991) • Litwinowicz & Williams (1994) • Chen, Graf, Petajan, et al (1995) • Scott et al (1994) • Ezzat & Poggio (1997) • Pighin et al + Gunter et al (1998) • Brand (1999) • Cosatto, Graf (2000)

  4. 3 Video Rewrite:Overview /D/ /IY/ /P/ /AH/ Analysis Synthesis

  5. 4 Video Rewrite:Overview /D/ /IY/ /P/ /AH/ Analysis Synthesis

  6. /D/ /OH/ /AH/ /N/ • 5 Annotation • Phonetic • Head Pose • Mouth Shape

  7. 6 Phonetic Annotation HMM Labels /D/ /IY/ /P/ /AH/ /IY-P-AH/ /D-IY-P/

  8. 6 Phonetic Annotation • Acoustic Front-End: RASTA-PLP (Channel Invariant) • HMM Models / Gaussian Mixture Models (HTK) • Phoneme Set: 56 categories (CMU) • Triphone models trained on TIMIT • Annotation using Forced-Viterbi • (and CMU pronunciation dictionary)

  9. /D/ /OH/ /AH/ /N/ • 5 Annotation • Phonetic • Head Pose • Mouth Shape

  10. 7 Head Pose Annotation match planar template

  11. /D/ /OH/ /AH/ /N/ • 5 Annotation • Phonetic • Head Pose • Mouth Shape

  12. 8 Mouth / Chin Annotation Eigenpoints

  13. 8 Eigenpoints - Training - Graylevel + XY Control points

  14. 8 Eigenpoints - Mapping - Graylevel + XY Control point Space

  15. 9 Video Rewrite:Overview /D/ /IY/ /P/ /AH/ Analysis Synthesis

  16. 10 Video Rewrite:Overview /D/ /IY/ /P/ /AH/ Analysis Synthesis

  17. 11 Synthesis - Overview - background face

  18. /J/ /EH/ /IY/ /L/ • 12 Synthesis: • Transcribe • Find Lip Clips • Stitch Together

  19. /AA/ /T/ /AA/ • 13 Matching:

  20. /AA/ /T/ /AA/ • 14 Matching: Co-Articulation / UW - T - UW/ ?

  21. /AA/ /T/ /AA/ / UW - T - UW/ / AA - T - AA/ • 15 Matching: Co-Articulation match

  22. 16 Co-Articulation: Tri-Phones / UW - T - UW/ More than 20,000 Tri-Phones in English / AA - T - AA/ / AA - S - AA/ ….

  23. 16 Viseme based Perceptual match P B S T K … P B S T K … 11 Consonant Clusters: - CH, JH, SH, ZH - K, G, N, L - T, D, S, Z - P, B, M - F, V - TH, DH Owens (1985) Confusion Matrix

  24. McGurk Effect -- Baldy by Cohen & Massaro

  25. /AA/ /T/ /AA/ / UW - T - UW/ / AA - S - AA/ • 17 Matching: Viseme-Distance correct phone wrong context: correct viseme correct context:

  26. /AA/ /T/ /AA/ / UW - T - UW/ / AA - S - AA/ • 18 Matching: Viseme-Distance approximate match

  27. Matching: Overlapping Triphones • 18 Shape Distance

  28. 18 Matching: Trade-Offs /IY/ /P/ /AA/ /T/ /AA/ N-Viseme Distance Shape Distance Rate of Speech Distance

  29. 18 Matching: N-Best Dynamic Programming Error = S a V(t) + b R(t) + g S(t-1,t) N-best t

  30. 19 Stitching + +

  31. 20 Stitching + +

  32. 21 Stitching Morphing

  33. 21 Morphing Affine-Warp + Beier-Neely

  34. 21 Simple Lighting Correction Internsity 1.) X Alpha Blending 2.) X

  35. 22 Video Rewrite Results Ellen - Video Model 8 minutes data JFK - Video Model 2 minutes data

  36. 23 Contributions • Data-driven lip animation • Automatic using vision and speech recognition • Photo realistic: • implicitly captures specific appearance + dynamics

  37. 24 Video Rewrite Thanks ! Acknowledgments: S. Ahmad M. Bajura F. Crow T. Darrell M. Davis G. Gordon K. Force B. Fuson B. Lassiter J. Lewis K. Rahardja S. Snibbe C. Sequine E. Tauber B. Verplank S. White J. Woodfill John F. Kennedy

  38. 1994: Scott et al (JPL + Graphco Technologies) /e/ /o/ /n/

  39. 1994: Scott et al (JPL + Graphco Technologies)

  40. 1994: Scott et al (JPL + Graphco Technologies)

More Related