Multimodal deep learning
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

Multimodal Deep Learning PowerPoint PPT Presentation


  • 306 Views
  • Uploaded on
  • Presentation posted in: General

Multimodal Deep Learning. Jiquan Ngiam Aditya Khosla , Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University. McGurk Effect. Audio-Visual Speech Recognition. Feature Challenge. Classifier (e.g. SVM). Representing Lips .

Download Presentation

Multimodal Deep Learning

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Multimodal deep learning

Multimodal Deep Learning

JiquanNgiam

AdityaKhosla, Mingyu Kim, Juhan Nam, HonglakLee & Andrew Ng

Stanford University


Mcgurk effect

McGurk Effect

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Audio visual speech recognition

Audio-Visual Speech Recognition

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Feature challenge

Feature Challenge

Classifier (e.g. SVM)

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Representing lips

Representing Lips

  • Can we learn better representations for audio/visual speech recognition?

  • How can multimodal data (multiple sources of input) be used to find better features?

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Unsupervised feature learning

Unsupervised Feature Learning

5

1.1

.

.

.

10

9

1.67

.

.

.

3

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Unsupervised feature learning1

Unsupervised Feature Learning

5

1.1

.

.

.

10

9

1.67

.

.

.

3

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Multimodal features

Multimodal Features

1

2.1

5

9

.

.

.

.

.

.

.

6.5

9

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Cross modality feature learning

Cross-Modality Feature Learning

5

1.1

.

.

.

10

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Feature learning models

Feature Learning Models

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Feature learning with autoencoders

Feature Learning with Autoencoders

Audio Reconstruction

Video Reconstruction

...

...

...

...

...

...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal autoencoder

Bimodal Autoencoder

Video Reconstruction

Audio Reconstruction

Hidden

Representation

...

...

...

...

...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal autoencoder1

Bimodal Autoencoder

Video Reconstruction

Audio Reconstruction

Hidden

Representation

...

...

...

...

...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Shallow learning

Shallow Learning

  • Mostly unimodal features learned

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Hidden Units

Audio Input

Video Input


Bimodal autoencoder2

Bimodal Autoencoder

Video Reconstruction

Audio Reconstruction

Hidden

Representation

...

...

...

...

...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal autoencoder3

Bimodal Autoencoder

Video Reconstruction

Audio Reconstruction

Hidden

Representation

...

...

...

...

Video Input

Cross-modality Learning:

Learn better video features by using audio as a cue

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Cross modality deep autoencoder

Video Reconstruction

Cross-modality Deep Autoencoder

Audio Reconstruction

Learned

Representation

...

...

...

...

...

...

...

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Cross modality deep autoencoder1

Video Reconstruction

Cross-modality Deep Autoencoder

Audio Reconstruction

Learned

Representation

...

...

...

...

...

...

...

Audio Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal deep autoencoders

Bimodal Deep Autoencoders

Video Reconstruction

Audio Reconstruction

Shared

Representation

...

...

...

...

...

...

...

...

...

“Visemes”

(Mouth Shapes)

“Phonemes”

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal deep autoencoders1

Bimodal Deep Autoencoders

Video Reconstruction

Audio Reconstruction

...

...

...

...

...

...

...

“Visemes”

(Mouth Shapes)

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal deep autoencoders2

Bimodal Deep Autoencoders

Video Reconstruction

Audio Reconstruction

...

...

...

...

...

...

...

“Phonemes”

Audio Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal deep autoencoders3

Bimodal Deep Autoencoders

Video Reconstruction

Audio Reconstruction

Shared

Representation

...

...

...

...

...

...

...

...

...

“Visemes”

(Mouth Shapes)

“Phonemes”

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Training bimodal deep autoencoder

Video Reconstruction

Training Bimodal Deep Autoencoder

Audio Reconstruction

Video Reconstruction

Video Reconstruction

Audio Reconstruction

Audio Reconstruction

Shared

Representation

Shared

Representation

Shared

Representation

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

  • Train a single model to perform all 3 tasks

  • Similar in spirit to denoisingautoencoders

  • (Vincent et al., 2008)

Audio Input

Video Input

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Evaluations

Evaluations

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Visualizations of learned features

Visualizations of Learned Features

0 ms

33 ms

67 ms

100 ms

0 ms

33 ms

67 ms

100 ms

Audio (spectrogram) and Video features

learned over 100ms windows

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with avletters

Video Reconstruction

Lip-reading with AVLetters

Audio Reconstruction

Learned

Representation

...

...

...

...

...

...

...

Video Input

  • AVLetters:

    • 26-way Letter Classification

    • 10 Speakers

    • 60x80 pixels lip regions

  • Cross-modality learning

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with avletters1

Lip-reading with AVLetters

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with avletters2

Lip-reading with AVLetters

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with avletters3

Lip-reading with AVLetters

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with cuave

Video Reconstruction

Lip-reading with CUAVE

Audio Reconstruction

Learned

Representation

...

...

...

...

...

...

...

Video Input

  • CUAVE:

    • 10-way Digit Classification

    • 36 Speakers

  • Cross Modality Learning

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with cuave1

Lip-reading with CUAVE

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with cuave2

Lip-reading with CUAVE

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with cuave3

Lip-reading with CUAVE

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Multimodal recognition

Video Reconstruction

Multimodal Recognition

Audio Reconstruction

Shared

Representation

...

...

...

...

...

...

...

...

...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  • CUAVE:

    • 10-way Digit Classification

    • 36 Speakers

  • Evaluate in clean and noisy audio scenarios

    • In the clean audio scenario, audio performs extremely well alone


Multimodal recognition1

Multimodal Recognition

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Multimodal recognition2

Multimodal Recognition

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Multimodal recognition3

Multimodal Recognition

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Shared representation evaluation

Shared Representation Evaluation

Supervised

Testing

Linear Classifier

Shared

Representation

Shared

Representation

Audio

Audio

Video

Video

Training

Testing

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Shared representation evaluation1

Shared Representation Evaluation

Supervised

Testing

Linear Classifier

Shared

Representation

Shared

Representation

Audio

Audio

Video

Video

Training

Testing

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Method: Learned Features + Canonical Correlation Analysis


Mcgurk effect1

McGurk Effect

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Mcgurk effect2

McGurk Effect

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Conclusion

Video Reconstruction

Conclusion

Video Reconstruction

Audio Reconstruction

Audio Reconstruction

Shared

Representation

Learned

Representation

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Video Input

Audio Input

Video Input

  • Applied deep autoencoders to discover features in multimodal data

  • Cross-modality Learning:

    We obtained better video features (for lip-reading) using audio as a cue

  • Multimodal Feature Learning:

    Learn representations that relate across audio and video data

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Multimodal deep learning

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Multimodal deep learning

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal learning with rbms

Bimodal Learning with RBMs

Hidden Units

...

...

…...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


  • Login