Create photo realistic talking face
1 / 35

Create Photo-Realistic Talking Face - PowerPoint PPT Presentation

  • Updated On :
  • Presentation posted in: General

Create Photo-Realistic Talking Face. Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang. Outline. Introduction of talking face Motivations System overview Techniques Conclusions. Introduction. What is a talking face

Related searches for Create Photo-Realistic Talking Face

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Create Photo-Realistic Talking Face

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Create Photo-Realistic Talking Face

Changbo Hu


*This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang


  • Introduction of talking face

  • Motivations

  • System overview

  • Techniques

  • Conclusions


  • What is a talking face

    • Face (lip) animation, driven by voice

    • Applications

  • The process of talking face

    • Face model

    • Motion capture

    • Mapping between

      audio and video

    • Rendering,



  • Walter,93, DecFace, 2Dwire frame model

  • Terzopoulos,95, Skin and muscle model

  • Breglar,97, Video Rewrite, Sample image based

  • TS Huang,98,Mesh model from range data

  • Poggio,98, MikeTalk, Viseme morphing

  • Guenter,99, Making face, 3D from multicamera

  • Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint

  • Cosatto,00, Planar quads model

Some Face models


  • Aim: a graphics interface for conversation agent

    • Photo-realistic

    • Driven by Chinese

    • Smooth connection between sentences

  • Extended from “Video rewrite”

System overview:Pipeline of the system(1)

System overview: Pipeline of the system(2)

New text

TTS system

Wav sound


Triphone sequence

Train database

Synthesized triphone sequence

Background sequence

Lip motion sequence

Rewrite to faces


  • Analysis:

    • Audio process

    • Image process

  • Synthesis

    • Lip image

    • Background image

    • Stitch together

Audio part:Sound Segmentation

  • Given the wav file and the script

  • Using HMM to train the segment system

  • Segment wav file to phoneme sequence

  • Example of the segmentation result:















Annotation with Phoneme

  • Using phoneme to annotate video frames

  • Each phoneme in a sentence corresponds to a short time of video sequence

Phoneme Distance Analysis

  • Phoneme&triphone basics

  • Chinese Phoneme vs. English Phoneme

  • Distance Metrics definitions

  • Results

Phoneme Basics

  • Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes.

    CH, JH, S, EH, EY, OY, AE, SIL…

  • Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information.

    T-IY-P, IY-P-AA, P-AA-T…

Chinese Phoneme vs. English

  • Chinese phoneme has two basic groups: Initials and Finals.

    Initials: B, P, M, F, …

    Finals: a3, o1, e2, eng3, iang4, ue5, …

  • Chinese finals each has 5 tones: 1,2,3,4,5.

    Different tones: a1, a2, a3, a4, a5.

  • Chinese finals actually is not a basic elements of speech.

    For example: iang1, iao1, uang1, iong1…

  • Chinese phoneme set is much larger than English.

Phoneme Distance Analysis

  • Define the distance between any two phonemes.

  • Since we only synthesis video but not sound, so tone is ignored

  • Lip shape motion is the core element for distance metrics.

Phoneme Distance Analysis

Phoneme 1:

Video 1

Video 2

Video 3

Video 4

Video 1

Video 2

Video 3

Video 4

Video Average

Time Align to an uniform length

Average the videos to

get an average video

Phoneme 2:

Video 1

Video 2

Video 1

Video 2

Video Average

By comparing the two aligned average videos, we generate the

distance matrix of the whole phoneme set.

Image part: Pose Tracking

  • Assume a plane model for face

  • Standard minimization method to find transform matrix (affine transform)[Black,95]

  • Mask is used to constrain interests part of the face

Template Picture

Mask Image

Pose tracking

  • Motion prediction using parameters with physical meaning

Pose Tracking

Some tracking results:

Lip Motion Tracking

  • Using Eigen Points (Covell, 91)

  • Feature Points include Jaw, lip and teeth

  • Training database specified manually

  • Auto tracking through all pose-tracked images

Lip motion tracking

Lip MotionTracking

Train Database


Auto Tracking Results

Synthesis new sentences

  • New text converted by TTS system to wav

  • Wav is segmented to phoneme sequence

  • Using DP to find an optimal video sequence from the training database

  • Time-align triphone videos and stitch them together.

  • Transform the lip sequence and paste them to background faces.

Lip sequence synthesis

New phoneme sequences

Optimal phoneme sequences

New phoneme sequences

Triphone 1

Triphone 4

Triphone 7

Triphone A

Triphone 2

Triphone 5

Triphone 8

Triphone B

Triphone 3

Triphone 6

Triphone 9

Triphone C

Dynamic Programming








Edge Cost Definition

  • Two parts:

  • phoneme distance: 3 phonemes’ distances added together

  • Lip shape distance for the overlap portion of triphone video

  • Weighted add together two part

Background video generation

  • Background is a video sequence when the virtual character spoke something else

  • Similarity measurement of background

  • Select “standard frame”

    • The frame with maximal number of frames similar to it

    • Filter out the frames with jerkiness

Stitch the time-aligned result to background faces

  • Write back with a mask

  • Transform the synthesized lip to the background face

Mask image for

write-back operation

Original background frame

Write-back result of the same frame

More video results

More video results

Conclusion and Future Work

  • Pose tracking and lip motion tracking

  • Size of the train database

  • Talking face with expression

  • Real-time generation?

  • Fast modeling for different person


Thank you

  • Login