1 / 29

Video Rewrite Driving Visual Speech with Audio

Video Rewrite Driving Visual Speech with Audio. Christoph Bregler Michele Covell Malcolm Slaney Presenter : Jack jeryes 3/3/2008. What is video rewrite ?. use existing footage to create new video of a person mouthing words that he did not speak in the original footage . Example:.

jui
Download Presentation

Video Rewrite Driving Visual Speech with Audio

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Video RewriteDriving Visual Speech with Audio Christoph Bregler Michele Covell Malcolm Slaney Presenter:Jack jeryes 3/3/2008

  2. What is video rewrite? use existing footage to create new video of a person mouthing words that he did not speak in the original footage

  3. Example:

  4. Why video rewrite? • movie dubbing : to sync the actors’ lip motions to the new soundtrack • Teleconferencing • Special effects

  5. Approach • Learn from example footage how a person’s face changes during speech (dynamics and idiosyncrasies)

  6. Stages Video rewrite have two statges: • Analysis stage • Synthesis stage

  7. Analysis stage: use the audio track to segment the video into triphones. Vision techniques find the head orientation , mouth & chin shape and position in each image

  8. Synthesis stage: segments new audio and uses it to select triphones from the video model. Based on labels from the analysis stage, the new mouth images are morphed into a new background face

  9. Analysis for video modeling • the analysis stage creates an annotated database of example video clips, derived from unconstrained footage. (video model) -Annotation Using Image Analysis -Annotation Using Audio Analysis

  10. Annotation Using Image Analysis As face moves within the frame, need to know -mouth position -lip shapes at all times. Using eigenpoints (good for low resolution)

  11. Eigenpoints : A small set of hand-labeled facial images is used to train subspace models. Given a new image, the eigenpoint models tell us the positions of points on the lips and jaw

  12. Eigenpoints (cont.) • 54 eigenpoints for each image : 34 on the mouth 20 on the chin and jaw line. • Only 26 images hand labeled 26 / 14,218 about 0.2% • Extended the hand-annotated dataset by morphing pairs to form intermediate images

  13. Eigenpoints (cont.) • Eigenpoints doesn’t allow variety of motions. • thus, warp each face image into a standard reference plane, prior to eigpoints labeling • Use affine transform to minimize the mean-squared error between a large portion of the face image and a facial template

  14. Mask to estimate global warp Each image is warped to account for changes in the head’s position, size, and rotation. The transform minimizes the difference between the transformed images and the face template. The mask (left) forces the minimization to consider only the upper face (right).

  15. global mapping… • Once the best global mapping is found, it is inverted and applied to the image, putting that face into the standard coordinate frame. • We then perform eigenpoints analysis on this pre-warped image to find the fiduciary points. • Finally, we back-project the fiduciary points through the global warp to place them on the original face image

  16. Annotation Using Audio Analysis • All the speech segmented into sequences of phonemes • the /T/ in “beet” looks different from the /T/ in “boot.” • Consider coarticulation

  17. Annotation Using Audio Analysis • Use triphones: collections of three sequential phonemes “teapot” is split into : /SIL-T-IY/ /T-IY-P/ /IY-P-AA/ /P-AA-T/ and /AA-T-SIL/

  18. Annotation Using Audio Analysis • While synthesize a video, -Emphasize the middle of each triphone. -Cross-fade the overlapping regions of neighboring triphones

  19. Synthesis using a video model segments new audio and uses it to select triphones from the video model. Based on labels from the analysis stage, the new mouth images are morphed into a new background face

  20. Synthesis using a video model • background, head tilts and the eyes blink taken from the source footage in the same order as they were shot • the triphone images include the mouth, chin, and part of the cheeks, • use illumination-matching techniques to avoid visible seams

  21. Selection of Triphone Videos choosing a sequence of clips that approximates the desired transitions and shape continuity

  22. Selection of Triphone Videos Given a triphone in the new speech utterance, we compute a matching distance to each triphone in the video database Dp = phoneme-context distance Ds = lip-shape distance

  23. Dp = phoneme-context distance Dp is based on categorical distances between phoneme categories and between viseme classes Dp= waited sum (viseme-distance , phonemic-distance)

  24. 26 viseme classes : 1- /CH/ /JH/ /SH/ /ZH/ 2- /K/ /G/ /N/ /L/ /T/ /D/ 3- /P/ /B/ /M/ ..

  25. Dp = phoneme-context distance -Phonemic-distance( /P/ , /P/ ) = 0 same phonemic category -Viseme-distance ( /P/ ,/IY/ ) = 1 different viseme classes Dp ( /P/ ,/B/ ) = between 0-1 same viseme class different phonemic category

  26. Ds = lip-shape distance Ds ,measures how closely the mouth Contours match in overlapping segments of adjacent triphone videos In “teapot” : /IY/ and /P/ in /T-IY-P/ shall match the contours for /IY/ and /P/ in /IY-P-AA/

  27. Ds = lip-shape distance Euclidean distance frame by frame between 4-elements feature vector (overall lip width , overall lip high, inner lip height, height of visible teeth)

  28. Stitching all Together • The remaining task is to stitch the triphone videos into the background sequence

More Related