Create photo realistic talking face l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 35

Create Photo-Realistic Talking Face PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on
  • Presentation posted in: General

Create Photo-Realistic Talking Face. Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang. Outline. Introduction of talking face Motivations System overview Techniques Conclusions. Introduction. What is a talking face

Download Presentation

Create Photo-Realistic Talking Face

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Create photo realistic talking face l.jpg

Create Photo-Realistic Talking Face

Changbo Hu

2001.11.26

*This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang


Outline l.jpg

Outline

  • Introduction of talking face

  • Motivations

  • System overview

  • Techniques

  • Conclusions


Introduction l.jpg

Introduction

  • What is a talking face

    • Face (lip) animation, driven by voice

    • Applications

  • The process of talking face

    • Face model

    • Motion capture

    • Mapping between

      audio and video

    • Rendering,

      Photo-realistic?


Literatures l.jpg

Literatures

  • Walter,93, DecFace, 2Dwire frame model

  • Terzopoulos,95, Skin and muscle model

  • Breglar,97, Video Rewrite, Sample image based

  • TS Huang,98,Mesh model from range data

  • Poggio,98, MikeTalk, Viseme morphing

  • Guenter,99, Making face, 3D from multicamera

  • Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint

  • Cosatto,00, Planar quads model


Some face models l.jpg

Some Face models


Motivations l.jpg

Motivations

  • Aim: a graphics interface for conversation agent

    • Photo-realistic

    • Driven by Chinese

    • Smooth connection between sentences

  • Extended from “Video rewrite”


System overview pipeline of the system 1 l.jpg

System overview:Pipeline of the system(1)


System overview pipeline of the system 2 l.jpg

System overview: Pipeline of the system(2)

New text

TTS system

Wav sound

Segmentation

Triphone sequence

Train database

Synthesized triphone sequence

Background sequence

Lip motion sequence

Rewrite to faces


Techniques l.jpg

Techniques

  • Analysis:

    • Audio process

    • Image process

  • Synthesis

    • Lip image

    • Background image

    • Stitch together


Audio part sound segmentation l.jpg

Audio part:Sound Segmentation

  • Given the wav file and the script

  • Using HMM to train the segment system

  • Segment wav file to phoneme sequence

  • Example of the segmentation result:

SILOPEN023

SILOPEN2442

s4361

if46274

j7580

ia18197

sh98109

ang1110121

y122130

e4131133

y134145

in2146154

h155164

ang2165194


Annotation with phoneme l.jpg

Annotation with Phoneme

  • Using phoneme to annotate video frames

  • Each phoneme in a sentence corresponds to a short time of video sequence


Phoneme distance analysis l.jpg

Phoneme Distance Analysis

  • Phoneme&triphone basics

  • Chinese Phoneme vs. English Phoneme

  • Distance Metrics definitions

  • Results


Phoneme basics l.jpg

Phoneme Basics

  • Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes.

    CH, JH, S, EH, EY, OY, AE, SIL…

  • Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information.

    T-IY-P, IY-P-AA, P-AA-T…


Chinese phoneme vs english l.jpg

Chinese Phoneme vs. English

  • Chinese phoneme has two basic groups: Initials and Finals.

    Initials: B, P, M, F, …

    Finals: a3, o1, e2, eng3, iang4, ue5, …

  • Chinese finals each has 5 tones: 1,2,3,4,5.

    Different tones: a1, a2, a3, a4, a5.

  • Chinese finals actually is not a basic elements of speech.

    For example: iang1, iao1, uang1, iong1…

  • Chinese phoneme set is much larger than English.


Phoneme distance analysis15 l.jpg

Phoneme Distance Analysis

  • Define the distance between any two phonemes.

  • Since we only synthesis video but not sound, so tone is ignored

  • Lip shape motion is the core element for distance metrics.


Phoneme distance analysis16 l.jpg

Phoneme Distance Analysis

Phoneme 1:

Video 1

Video 2

Video 3

Video 4

Video 1

Video 2

Video 3

Video 4

Video Average

Time Align to an uniform length

Average the videos to

get an average video

Phoneme 2:

Video 1

Video 2

Video 1

Video 2

Video Average

By comparing the two aligned average videos, we generate the

distance matrix of the whole phoneme set.


Image part pose tracking l.jpg

Image part: Pose Tracking

  • Assume a plane model for face

  • Standard minimization method to find transform matrix (affine transform)[Black,95]

  • Mask is used to constrain interests part of the face

Template Picture

Mask Image


Pose tracking l.jpg

Pose tracking

  • Motion prediction using parameters with physical meaning


Pose tracking19 l.jpg

Pose Tracking

Some tracking results:


Lip motion tracking l.jpg

Lip Motion Tracking

  • Using Eigen Points (Covell, 91)

  • Feature Points include Jaw, lip and teeth

  • Training database specified manually

  • Auto tracking through all pose-tracked images


Lip motion tracking21 l.jpg

Lip motion tracking


Lip motion tracking22 l.jpg

Lip MotionTracking

Train Database

(hand-labeled)

Auto Tracking Results


Synthesis new sentences l.jpg

Synthesis new sentences

  • New text converted by TTS system to wav

  • Wav is segmented to phoneme sequence

  • Using DP to find an optimal video sequence from the training database

  • Time-align triphone videos and stitch them together.

  • Transform the lip sequence and paste them to background faces.


Lip sequence synthesis l.jpg

Lip sequence synthesis

New phoneme sequences

Optimal phoneme sequences

New phoneme sequences

Triphone 1

Triphone 4

Triphone 7

Triphone A

Triphone 2

Triphone 5

Triphone 8

Triphone B

Triphone 3

Triphone 6

Triphone 9

Triphone C


Dynamic programming l.jpg

Dynamic Programming

Begin

End

Triphone1

Triphone2

Triphone3

Triphone4

Triphone5


Edge cost definition l.jpg

Edge Cost Definition

  • Two parts:

  • phoneme distance: 3 phonemes’ distances added together

  • Lip shape distance for the overlap portion of triphone video

  • Weighted add together two part


Background video generation l.jpg

Background video generation

  • Background is a video sequence when the virtual character spoke something else

  • Similarity measurement of background

  • Select “standard frame”

    • The frame with maximal number of frames similar to it

    • Filter out the frames with jerkiness


Stitch the time aligned result to background faces l.jpg

Stitch the time-aligned result to background faces

  • Write back with a mask

  • Transform the synthesized lip to the background face


Slide30 l.jpg

Mask image for

write-back operation

Original background frame

Write-back result of the same frame


More video results l.jpg

More video results


More video results32 l.jpg

More video results


Conclusion and future work l.jpg

Conclusion and Future Work

  • Pose tracking and lip motion tracking

  • Size of the train database

  • Talking face with expression

  • Real-time generation?

  • Fast modeling for different person


Animation l.jpg

Animation


Slide35 l.jpg

Thank you


  • Login