Create photo realistic talking face
Download
1 / 35

Create Photo-Realistic Talking Face - PowerPoint PPT Presentation


  • 178 Views
  • Updated On :

Create Photo-Realistic Talking Face. Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang. Outline. Introduction of talking face Motivations System overview Techniques Conclusions. Introduction. What is a talking face

Related searches for Create Photo-Realistic Talking Face

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Create Photo-Realistic Talking Face' - long


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Create photo realistic talking face l.jpg

Create Photo-Realistic Talking Face

Changbo Hu

2001.11.26

*This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang


Outline l.jpg
Outline

  • Introduction of talking face

  • Motivations

  • System overview

  • Techniques

  • Conclusions


Introduction l.jpg
Introduction

  • What is a talking face

    • Face (lip) animation, driven by voice

    • Applications

  • The process of talking face

    • Face model

    • Motion capture

    • Mapping between

      audio and video

    • Rendering,

      Photo-realistic?


Literatures l.jpg
Literatures

  • Walter,93, DecFace, 2Dwire frame model

  • Terzopoulos,95, Skin and muscle model

  • Breglar,97, Video Rewrite, Sample image based

  • TS Huang,98,Mesh model from range data

  • Poggio,98, MikeTalk, Viseme morphing

  • Guenter,99, Making face, 3D from multicamera

  • Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint

  • Cosatto,00, Planar quads model



Motivations l.jpg
Motivations

  • Aim: a graphics interface for conversation agent

    • Photo-realistic

    • Driven by Chinese

    • Smooth connection between sentences

  • Extended from “Video rewrite”


System overview pipeline of the system 1 l.jpg
System overview:Pipeline of the system(1)


System overview pipeline of the system 2 l.jpg
System overview: Pipeline of the system(2)

New text

TTS system

Wav sound

Segmentation

Triphone sequence

Train database

Synthesized triphone sequence

Background sequence

Lip motion sequence

Rewrite to faces


Techniques l.jpg
Techniques

  • Analysis:

    • Audio process

    • Image process

  • Synthesis

    • Lip image

    • Background image

    • Stitch together


Audio part sound segmentation l.jpg
Audio part:Sound Segmentation

  • Given the wav file and the script

  • Using HMM to train the segment system

  • Segment wav file to phoneme sequence

  • Example of the segmentation result:

SILOPEN 0 23

SILOPEN 24 42

s 43 61

if4 62 74

j 75 80

ia1 81 97

sh 98 109

ang1 110 121

y 122 130

e4 131 133

y 134 145

in2 146 154

h 155 164

ang2 165 194


Annotation with phoneme l.jpg
Annotation with Phoneme

  • Using phoneme to annotate video frames

  • Each phoneme in a sentence corresponds to a short time of video sequence


Phoneme distance analysis l.jpg
Phoneme Distance Analysis

  • Phoneme&triphone basics

  • Chinese Phoneme vs. English Phoneme

  • Distance Metrics definitions

  • Results


Phoneme basics l.jpg
Phoneme Basics

  • Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes.

    CH, JH, S, EH, EY, OY, AE, SIL…

  • Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information.

    T-IY-P, IY-P-AA, P-AA-T…


Chinese phoneme vs english l.jpg
Chinese Phoneme vs. English

  • Chinese phoneme has two basic groups: Initials and Finals.

    Initials: B, P, M, F, …

    Finals: a3, o1, e2, eng3, iang4, ue5, …

  • Chinese finals each has 5 tones: 1,2,3,4,5.

    Different tones: a1, a2, a3, a4, a5.

  • Chinese finals actually is not a basic elements of speech.

    For example: iang1, iao1, uang1, iong1…

  • Chinese phoneme set is much larger than English.


Phoneme distance analysis15 l.jpg
Phoneme Distance Analysis

  • Define the distance between any two phonemes.

  • Since we only synthesis video but not sound, so tone is ignored

  • Lip shape motion is the core element for distance metrics.


Phoneme distance analysis16 l.jpg
Phoneme Distance Analysis

Phoneme 1:

Video 1

Video 2

Video 3

Video 4

Video 1

Video 2

Video 3

Video 4

Video Average

Time Align to an uniform length

Average the videos to

get an average video

Phoneme 2:

Video 1

Video 2

Video 1

Video 2

Video Average

By comparing the two aligned average videos, we generate the

distance matrix of the whole phoneme set.


Image part pose tracking l.jpg
Image part: Pose Tracking

  • Assume a plane model for face

  • Standard minimization method to find transform matrix (affine transform)[Black,95]

  • Mask is used to constrain interests part of the face

Template Picture

Mask Image


Pose tracking l.jpg
Pose tracking

  • Motion prediction using parameters with physical meaning


Pose tracking19 l.jpg
Pose Tracking

Some tracking results:


Lip motion tracking l.jpg
Lip Motion Tracking

  • Using Eigen Points (Covell, 91)

  • Feature Points include Jaw, lip and teeth

  • Training database specified manually

  • Auto tracking through all pose-tracked images



Lip motion tracking22 l.jpg
Lip MotionTracking

Train Database

(hand-labeled)

Auto Tracking Results


Synthesis new sentences l.jpg
Synthesis new sentences

  • New text converted by TTS system to wav

  • Wav is segmented to phoneme sequence

  • Using DP to find an optimal video sequence from the training database

  • Time-align triphone videos and stitch them together.

  • Transform the lip sequence and paste them to background faces.


Lip sequence synthesis l.jpg
Lip sequence synthesis

New phoneme sequences

Optimal phoneme sequences

New phoneme sequences

Triphone 1

Triphone 4

Triphone 7

Triphone A

Triphone 2

Triphone 5

Triphone 8

Triphone B

Triphone 3

Triphone 6

Triphone 9

Triphone C


Dynamic programming l.jpg
Dynamic Programming

Begin

End

Triphone1

Triphone2

Triphone3

Triphone4

Triphone5


Edge cost definition l.jpg
Edge Cost Definition

  • Two parts:

  • phoneme distance: 3 phonemes’ distances added together

  • Lip shape distance for the overlap portion of triphone video

  • Weighted add together two part


Background video generation l.jpg
Background video generation

  • Background is a video sequence when the virtual character spoke something else

  • Similarity measurement of background

  • Select “standard frame”

    • The frame with maximal number of frames similar to it

    • Filter out the frames with jerkiness


Stitch the time aligned result to background faces l.jpg
Stitch the time-aligned result to background faces

  • Write back with a mask

  • Transform the synthesized lip to the background face


Slide30 l.jpg

Mask image for

write-back operation

Original background frame

Write-back result of the same frame




Conclusion and future work l.jpg
Conclusion and Future Work

  • Pose tracking and lip motion tracking

  • Size of the train database

  • Talking face with expression

  • Real-time generation?

  • Fast modeling for different person




ad