Create photo realistic talking face
1 / 35

Create Photo-Realistic Talking Face - PowerPoint PPT Presentation

  • Updated On :

Create Photo-Realistic Talking Face. Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang. Outline. Introduction of talking face Motivations System overview Techniques Conclusions. Introduction. What is a talking face

Related searches for Create Photo-Realistic Talking Face

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Create Photo-Realistic Talking Face' - long

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Create photo realistic talking face l.jpg

Create Photo-Realistic Talking Face

Changbo Hu


*This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang

Outline l.jpg

  • Introduction of talking face

  • Motivations

  • System overview

  • Techniques

  • Conclusions

Introduction l.jpg

  • What is a talking face

    • Face (lip) animation, driven by voice

    • Applications

  • The process of talking face

    • Face model

    • Motion capture

    • Mapping between

      audio and video

    • Rendering,


Literatures l.jpg

  • Walter,93, DecFace, 2Dwire frame model

  • Terzopoulos,95, Skin and muscle model

  • Breglar,97, Video Rewrite, Sample image based

  • TS Huang,98,Mesh model from range data

  • Poggio,98, MikeTalk, Viseme morphing

  • Guenter,99, Making face, 3D from multicamera

  • Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint

  • Cosatto,00, Planar quads model

Motivations l.jpg

  • Aim: a graphics interface for conversation agent

    • Photo-realistic

    • Driven by Chinese

    • Smooth connection between sentences

  • Extended from “Video rewrite”

System overview pipeline of the system 1 l.jpg
System overview:Pipeline of the system(1)

System overview pipeline of the system 2 l.jpg
System overview: Pipeline of the system(2)

New text

TTS system

Wav sound


Triphone sequence

Train database

Synthesized triphone sequence

Background sequence

Lip motion sequence

Rewrite to faces

Techniques l.jpg

  • Analysis:

    • Audio process

    • Image process

  • Synthesis

    • Lip image

    • Background image

    • Stitch together

Audio part sound segmentation l.jpg
Audio part:Sound Segmentation

  • Given the wav file and the script

  • Using HMM to train the segment system

  • Segment wav file to phoneme sequence

  • Example of the segmentation result:



s 43 61

if4 62 74

j 75 80

ia1 81 97

sh 98 109

ang1 110 121

y 122 130

e4 131 133

y 134 145

in2 146 154

h 155 164

ang2 165 194

Annotation with phoneme l.jpg
Annotation with Phoneme

  • Using phoneme to annotate video frames

  • Each phoneme in a sentence corresponds to a short time of video sequence

Phoneme distance analysis l.jpg
Phoneme Distance Analysis

  • Phoneme&triphone basics

  • Chinese Phoneme vs. English Phoneme

  • Distance Metrics definitions

  • Results

Phoneme basics l.jpg
Phoneme Basics

  • Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes.

    CH, JH, S, EH, EY, OY, AE, SIL…

  • Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information.

    T-IY-P, IY-P-AA, P-AA-T…

Chinese phoneme vs english l.jpg
Chinese Phoneme vs. English

  • Chinese phoneme has two basic groups: Initials and Finals.

    Initials: B, P, M, F, …

    Finals: a3, o1, e2, eng3, iang4, ue5, …

  • Chinese finals each has 5 tones: 1,2,3,4,5.

    Different tones: a1, a2, a3, a4, a5.

  • Chinese finals actually is not a basic elements of speech.

    For example: iang1, iao1, uang1, iong1…

  • Chinese phoneme set is much larger than English.

Phoneme distance analysis15 l.jpg
Phoneme Distance Analysis

  • Define the distance between any two phonemes.

  • Since we only synthesis video but not sound, so tone is ignored

  • Lip shape motion is the core element for distance metrics.

Phoneme distance analysis16 l.jpg
Phoneme Distance Analysis

Phoneme 1:

Video 1

Video 2

Video 3

Video 4

Video 1

Video 2

Video 3

Video 4

Video Average

Time Align to an uniform length

Average the videos to

get an average video

Phoneme 2:

Video 1

Video 2

Video 1

Video 2

Video Average

By comparing the two aligned average videos, we generate the

distance matrix of the whole phoneme set.

Image part pose tracking l.jpg
Image part: Pose Tracking

  • Assume a plane model for face

  • Standard minimization method to find transform matrix (affine transform)[Black,95]

  • Mask is used to constrain interests part of the face

Template Picture

Mask Image

Pose tracking l.jpg
Pose tracking

  • Motion prediction using parameters with physical meaning

Pose tracking19 l.jpg
Pose Tracking

Some tracking results:

Lip motion tracking l.jpg
Lip Motion Tracking

  • Using Eigen Points (Covell, 91)

  • Feature Points include Jaw, lip and teeth

  • Training database specified manually

  • Auto tracking through all pose-tracked images

Lip motion tracking22 l.jpg
Lip MotionTracking

Train Database


Auto Tracking Results

Synthesis new sentences l.jpg
Synthesis new sentences

  • New text converted by TTS system to wav

  • Wav is segmented to phoneme sequence

  • Using DP to find an optimal video sequence from the training database

  • Time-align triphone videos and stitch them together.

  • Transform the lip sequence and paste them to background faces.

Lip sequence synthesis l.jpg
Lip sequence synthesis

New phoneme sequences

Optimal phoneme sequences

New phoneme sequences

Triphone 1

Triphone 4

Triphone 7

Triphone A

Triphone 2

Triphone 5

Triphone 8

Triphone B

Triphone 3

Triphone 6

Triphone 9

Triphone C

Dynamic programming l.jpg
Dynamic Programming








Edge cost definition l.jpg
Edge Cost Definition

  • Two parts:

  • phoneme distance: 3 phonemes’ distances added together

  • Lip shape distance for the overlap portion of triphone video

  • Weighted add together two part

Background video generation l.jpg
Background video generation

  • Background is a video sequence when the virtual character spoke something else

  • Similarity measurement of background

  • Select “standard frame”

    • The frame with maximal number of frames similar to it

    • Filter out the frames with jerkiness

Stitch the time aligned result to background faces l.jpg
Stitch the time-aligned result to background faces

  • Write back with a mask

  • Transform the synthesized lip to the background face

Slide30 l.jpg

Mask image for

write-back operation

Original background frame

Write-back result of the same frame

Conclusion and future work l.jpg
Conclusion and Future Work

  • Pose tracking and lip motion tracking

  • Size of the train database

  • Talking face with expression

  • Real-time generation?

  • Fast modeling for different person