create photo realistic talking face
Skip this Video
Download Presentation
Create Photo-Realistic Talking Face

Loading in 2 Seconds...

play fullscreen
1 / 35

Create Photo-Realistic Talking Face - PowerPoint PPT Presentation

  • Uploaded on

Create Photo-Realistic Talking Face. Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang. Outline. Introduction of talking face Motivations System overview Techniques Conclusions. Introduction. What is a talking face

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Create Photo-Realistic Talking Face' - long

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
create photo realistic talking face

Create Photo-Realistic Talking Face

Changbo Hu


*This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang

  • Introduction of talking face
  • Motivations
  • System overview
  • Techniques
  • Conclusions
  • What is a talking face
    • Face (lip) animation, driven by voice
    • Applications
  • The process of talking face
    • Face model
    • Motion capture
    • Mapping between

audio and video

    • Rendering,


  • Walter,93, DecFace, 2Dwire frame model
  • Terzopoulos,95, Skin and muscle model
  • Breglar,97, Video Rewrite, Sample image based
  • TS Huang,98,Mesh model from range data
  • Poggio,98, MikeTalk, Viseme morphing
  • Guenter,99, Making face, 3D from multicamera
  • Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint
  • Cosatto,00, Planar quads model
  • Aim: a graphics interface for conversation agent
    • Photo-realistic
    • Driven by Chinese
    • Smooth connection between sentences
  • Extended from “Video rewrite”
system overview pipeline of the system 2
System overview: Pipeline of the system(2)

New text

TTS system

Wav sound


Triphone sequence

Train database

Synthesized triphone sequence

Background sequence

Lip motion sequence

Rewrite to faces

  • Analysis:
    • Audio process
    • Image process
  • Synthesis
    • Lip image
    • Background image
    • Stitch together
audio part sound segmentation
Audio part:Sound Segmentation
  • Given the wav file and the script
  • Using HMM to train the segment system
  • Segment wav file to phoneme sequence
  • Example of the segmentation result:



s 43 61

if4 62 74

j 75 80

ia1 81 97

sh 98 109

ang1 110 121

y 122 130

e4 131 133

y 134 145

in2 146 154

h 155 164

ang2 165 194

annotation with phoneme
Annotation with Phoneme
  • Using phoneme to annotate video frames
  • Each phoneme in a sentence corresponds to a short time of video sequence
phoneme distance analysis
Phoneme Distance Analysis
  • Phoneme&triphone basics
  • Chinese Phoneme vs. English Phoneme
  • Distance Metrics definitions
  • Results
phoneme basics
Phoneme Basics
  • Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes.


  • Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information.


chinese phoneme vs english
Chinese Phoneme vs. English
  • Chinese phoneme has two basic groups: Initials and Finals.

Initials: B, P, M, F, …

Finals: a3, o1, e2, eng3, iang4, ue5, …

  • Chinese finals each has 5 tones: 1,2,3,4,5.

Different tones: a1, a2, a3, a4, a5.

  • Chinese finals actually is not a basic elements of speech.

For example: iang1, iao1, uang1, iong1…

  • Chinese phoneme set is much larger than English.
phoneme distance analysis15
Phoneme Distance Analysis
  • Define the distance between any two phonemes.
  • Since we only synthesis video but not sound, so tone is ignored
  • Lip shape motion is the core element for distance metrics.
phoneme distance analysis16
Phoneme Distance Analysis

Phoneme 1:

Video 1

Video 2

Video 3

Video 4

Video 1

Video 2

Video 3

Video 4

Video Average

Time Align to an uniform length

Average the videos to

get an average video

Phoneme 2:

Video 1

Video 2

Video 1

Video 2

Video Average

By comparing the two aligned average videos, we generate the

distance matrix of the whole phoneme set.

image part pose tracking
Image part: Pose Tracking
  • Assume a plane model for face
  • Standard minimization method to find transform matrix (affine transform)[Black,95]
  • Mask is used to constrain interests part of the face

Template Picture

Mask Image

pose tracking
Pose tracking
  • Motion prediction using parameters with physical meaning
pose tracking19
Pose Tracking

Some tracking results:

lip motion tracking
Lip Motion Tracking
  • Using Eigen Points (Covell, 91)
  • Feature Points include Jaw, lip and teeth
  • Training database specified manually
  • Auto tracking through all pose-tracked images
lip motion tracking22
Lip MotionTracking

Train Database


Auto Tracking Results

synthesis new sentences
Synthesis new sentences
  • New text converted by TTS system to wav
  • Wav is segmented to phoneme sequence
  • Using DP to find an optimal video sequence from the training database
  • Time-align triphone videos and stitch them together.
  • Transform the lip sequence and paste them to background faces.
lip sequence synthesis
Lip sequence synthesis

New phoneme sequences

Optimal phoneme sequences

New phoneme sequences

Triphone 1

Triphone 4

Triphone 7

Triphone A

Triphone 2

Triphone 5

Triphone 8

Triphone B

Triphone 3

Triphone 6

Triphone 9

Triphone C

dynamic programming
Dynamic Programming








edge cost definition
Edge Cost Definition
  • Two parts:
  • phoneme distance: 3 phonemes’ distances added together
  • Lip shape distance for the overlap portion of triphone video
  • Weighted add together two part
background video generation
Background video generation
  • Background is a video sequence when the virtual character spoke something else
  • Similarity measurement of background
  • Select “standard frame”
      • The frame with maximal number of frames similar to it
      • Filter out the frames with jerkiness
stitch the time aligned result to background faces
Stitch the time-aligned result to background faces
  • Write back with a mask
  • Transform the synthesized lip to the background face
Mask image for

write-back operation

Original background frame

Write-back result of the same frame

conclusion and future work
Conclusion and Future Work
  • Pose tracking and lip motion tracking
  • Size of the train database
  • Talking face with expression
  • Real-time generation?
  • Fast modeling for different person