1 / 21

End-to-End Speech-Driven Facial Animation with Temporal GANs

End-to-End Speech-Driven Facial Animation with Temporal GANs. Patrick Groot Koerkamp (6628478). High level overview. Generating videos of a talking head Audio synchronized lip movements Natural facial expressions (Blinks and Eyebrow movements) Temporal GAN.

refugioe
Download Presentation

End-to-End Speech-Driven Facial Animation with Temporal GANs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. End-to-End Speech-Driven Facial Animation with Temporal GANs Patrick Groot Koerkamp (6628478)

  2. High level overview • Generating videos of a talking head • Audio synchronized lip movements • Natural facial expressions (Blinks and Eyebrow movements) • Temporal GAN

  3. Generative Adversarial Networks (GAN) • Generator • Discriminator

  4. Motivation • Simplify film animation process • Better lip-syncing • Generate parts of occluded faces • Improve band-limited visual telecommunications

  5. Background • Generate realistic faces • Mapping Audio Features (MFCC) • Computer Graphics • Overhead • Transform Audio Features to Video Frames • Neglect facial expressions • Generate on present information • No facial dynamics • Challenging

  6. Proposal / Contributions • GAN capable of generating videos • Audio signal • Single still image • Subject independent • No handcrafted audio • No visual feature reliance • No post processing • Comprehensive assessment • Method performance • Image quality • Lip-reading verification • Identity maintaining • Realism (Turing)

  7. Related work • Speech-Driven Facial Animation • Acoustics, Vocal-tract, Facial motion • Hidden Markov Models (HMM) • Deep neural networks • Convolutional neural networks • GAN-Based Video Synthesis • Image/Video generation • MoCoGAN • Cross-modal applications

  8. End-to-End Speech-Driven Facial Synthesis • 1 Generator • ReLU > TanH • 2 Discriminators • ReLU > Sigmoid

  9. Generator • Identity Encoder • Context Encoder • Audio Encoder • RNN • Frame Decoder • Noise Generator

  10. Audio Encoder & Context Encoder • Audio Encoder • 7 Layer CNN • Extracts 256 dimensional features • Passed to RNN • Context Encoder • Audio Encoder • 2 Layer GRU (Gated Recurrent Unit)

  11. Identity Encoder & Frame Decoder • Identity Encoder • 6 Layer CNN • Produces identity encoding • Frame Decoder • 6 Layer CNN • Generates a frame of the sequence

  12. Discriminators • Frame Discriminator • 6 Layer CNN • Is frame real or not? • Sequence Discriminator

  13. Training Loss Formula: L1 Formula: Obtain optimal generator G* • Adam • Learning Rate • Generator: 2 * 10^-4 • Frame Discriminator: 10^-3 • Decay after epoch 20 (10% Rate) • Sequence Discriminator: 5 * 10^-5

  14. Experiments • PyTorch • Nvidia GTX 1080 Ti • Takes a week to train • Avg. generation time: 7ms • 75 sequential frames synthesized in 0.5s • CPU • Avg. generation time: 1s • 75 sequential frames synthesized in 15s

  15. Experiments (2) • Datasets • GRID • TCD • Increased training data by mirroring • Metrics • Generated video : PSNR & SSIM • Frame sharpness : FDBM & CPBD • Content : ACD • Accuracy spoken msg : WER

  16. Qualitative Results • Produces realistic videos • Also works on previously unseen faces • Characteristic human expressions • Frowns • Blinks

  17. Qualitative Results (2) • GAN-based method • L1 loss and adversarial loss • Baseline for quantitative assessment • Failures of static baseline • Opening mouth when silent • Neglecting previous face

  18. Quantitative Results • Performance measure • GRID & TCD datasets • Compare to static baseline • 30-person survey • Turing test • 10 videos • 153 responses • Avg. 63% correct

  19. Quiz

  20. Future work • Different architectures • More natural sequences • Expressions are generated randomly • Natural extension • Capture mood • Reflect mood in facial expressions

  21. Questions

More Related