Multimodal deep learning
Download
1 / 46

Multimodal Deep Learning - PowerPoint PPT Presentation


  • 481 Views
  • Uploaded on

Multimodal Deep Learning. Jiquan Ngiam Aditya Khosla , Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University. McGurk Effect. Audio-Visual Speech Recognition. Feature Challenge. Classifier (e.g. SVM). Representing Lips .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Multimodal Deep Learning' - jeanne


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Multimodal deep learning
Multimodal Deep Learning

JiquanNgiam

AdityaKhosla, Mingyu Kim, Juhan Nam, HonglakLee & Andrew Ng

Stanford University


Mcgurk effect
McGurk Effect

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Audio visual speech recognition
Audio-Visual Speech Recognition

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Feature challenge
Feature Challenge

Classifier (e.g. SVM)

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Representing lips
Representing Lips

  • Can we learn better representations for audio/visual speech recognition?

  • How can multimodal data (multiple sources of input) be used to find better features?

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Unsupervised feature learning
Unsupervised Feature Learning

5

1.1

.

.

.

10

9

1.67

.

.

.

3

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Unsupervised feature learning1
Unsupervised Feature Learning

5

1.1

.

.

.

10

9

1.67

.

.

.

3

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Multimodal features
Multimodal Features

1

2.1

5

9

.

.

.

.

.

.

.

6.5

9

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Cross modality feature learning
Cross-Modality Feature Learning

5

1.1

.

.

.

10

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Feature learning models
Feature Learning Models

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Feature learning with autoencoders
Feature Learning with Autoencoders

Audio Reconstruction

Video Reconstruction

...

...

...

...

...

...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal autoencoder
Bimodal Autoencoder

Video Reconstruction

Audio Reconstruction

Hidden

Representation

...

...

...

...

...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal autoencoder1
Bimodal Autoencoder

Video Reconstruction

Audio Reconstruction

Hidden

Representation

...

...

...

...

...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Shallow learning
Shallow Learning

  • Mostly unimodal features learned

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Hidden Units

Audio Input

Video Input


Bimodal autoencoder2
Bimodal Autoencoder

Video Reconstruction

Audio Reconstruction

Hidden

Representation

...

...

...

...

...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal autoencoder3
Bimodal Autoencoder

Video Reconstruction

Audio Reconstruction

Hidden

Representation

...

...

...

...

Video Input

Cross-modality Learning:

Learn better video features by using audio as a cue

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Cross modality deep autoencoder

Video Reconstruction

Cross-modality Deep Autoencoder

Audio Reconstruction

Learned

Representation

...

...

...

...

...

...

...

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Cross modality deep autoencoder1

Video Reconstruction

Cross-modality Deep Autoencoder

Audio Reconstruction

Learned

Representation

...

...

...

...

...

...

...

Audio Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal deep autoencoders
Bimodal Deep Autoencoders

Video Reconstruction

Audio Reconstruction

Shared

Representation

...

...

...

...

...

...

...

...

...

“Visemes”

(Mouth Shapes)

“Phonemes”

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal deep autoencoders1
Bimodal Deep Autoencoders

Video Reconstruction

Audio Reconstruction

...

...

...

...

...

...

...

“Visemes”

(Mouth Shapes)

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal deep autoencoders2
Bimodal Deep Autoencoders

Video Reconstruction

Audio Reconstruction

...

...

...

...

...

...

...

“Phonemes”

Audio Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Bimodal deep autoencoders3
Bimodal Deep Autoencoders

Video Reconstruction

Audio Reconstruction

Shared

Representation

...

...

...

...

...

...

...

...

...

“Visemes”

(Mouth Shapes)

“Phonemes”

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Training bimodal deep autoencoder

Video Reconstruction

Training Bimodal Deep Autoencoder

Audio Reconstruction

Video Reconstruction

Video Reconstruction

Audio Reconstruction

Audio Reconstruction

Shared

Representation

Shared

Representation

Shared

Representation

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

  • Train a single model to perform all 3 tasks

  • Similar in spirit to denoisingautoencoders

  • (Vincent et al., 2008)

Audio Input

Video Input

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Evaluations
Evaluations

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Visualizations of learned features
Visualizations of Learned Features

0 ms

33 ms

67 ms

100 ms

0 ms

33 ms

67 ms

100 ms

Audio (spectrogram) and Video features

learned over 100ms windows

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with avletters

Video Reconstruction

Lip-reading with AVLetters

Audio Reconstruction

Learned

Representation

...

...

...

...

...

...

...

Video Input

  • AVLetters:

    • 26-way Letter Classification

    • 10 Speakers

    • 60x80 pixels lip regions

  • Cross-modality learning

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with avletters1
Lip-reading with AVLetters

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with avletters2
Lip-reading with AVLetters

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with avletters3
Lip-reading with AVLetters

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with cuave

Video Reconstruction

Lip-reading with CUAVE

Audio Reconstruction

Learned

Representation

...

...

...

...

...

...

...

Video Input

  • CUAVE:

    • 10-way Digit Classification

    • 36 Speakers

  • Cross Modality Learning

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with cuave1
Lip-reading with CUAVE

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with cuave2
Lip-reading with CUAVE

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Lip reading with cuave3
Lip-reading with CUAVE

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Multimodal recognition

Video Reconstruction

Multimodal Recognition

Audio Reconstruction

Shared

Representation

...

...

...

...

...

...

...

...

...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

  • CUAVE:

    • 10-way Digit Classification

    • 36 Speakers

  • Evaluate in clean and noisy audio scenarios

    • In the clean audio scenario, audio performs extremely well alone


Multimodal recognition1
Multimodal Recognition

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Multimodal recognition2
Multimodal Recognition

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Multimodal recognition3
Multimodal Recognition

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Shared representation evaluation
Shared Representation Evaluation

Supervised

Testing

Linear Classifier

Shared

Representation

Shared

Representation

Audio

Audio

Video

Video

Training

Testing

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Shared representation evaluation1
Shared Representation Evaluation

Supervised

Testing

Linear Classifier

Shared

Representation

Shared

Representation

Audio

Audio

Video

Video

Training

Testing

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Method: Learned Features + Canonical Correlation Analysis


Mcgurk effect1
McGurk Effect

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Mcgurk effect2
McGurk Effect

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Conclusion

Video Reconstruction

Conclusion

Video Reconstruction

Audio Reconstruction

Audio Reconstruction

Shared

Representation

Learned

Representation

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Video Input

Audio Input

Video Input

  • Applied deep autoencoders to discover features in multimodal data

  • Cross-modality Learning:

    We obtained better video features (for lip-reading) using audio as a cue

  • Multimodal Feature Learning:

    Learn representations that relate across audio and video data

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng




Bimodal learning with rbms
Bimodal Learning with RBMs Lee & Andrew Ng

Hidden Units

...

...

…...

Audio Input

Video Input

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


ad