audio visual graphical models l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Audio-Visual Graphical Models PowerPoint Presentation
Download Presentation
Audio-Visual Graphical Models

Loading in 2 Seconds...

play fullscreen
1 / 29

Audio-Visual Graphical Models - PowerPoint PPT Presentation


  • 250 Views
  • Uploaded on

Audio-Visual Graphical Models. Nebojsa Jojic Microsoft Research Redmond, Washington. Hagai Attias Microsoft Research Redmond, Washington. Matthew Beal Gatsby Unit University College London. Overview. Some background to the problem A simple video model A simple audio model

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Audio-Visual Graphical Models' - Gabriel


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
audio visual graphical models

Audio-Visual Graphical Models

Nebojsa Jojic

Microsoft Research

Redmond, Washington

Hagai Attias

Microsoft Research

Redmond, Washington

Matthew Beal

Gatsby Unit

University College London

overview
Overview
  • Some background to the problem
  • A simple video model
  • A simple audio model
  • Combining these in a principled manner
  • Results of tracking experiments
  • Further work and thoughts.

Beal, Jojic and Attias, ICASSP’02

motivation applications
Motivation – applications
  • Teleconferencing
    • We need speaker’s identity, position, and individual speech.
    • The case of multiple speakers.
  • Denoising
    • Speech enhancement using video cues (at different scales).
    • Video enhancement using audio cues.
  • Multimedia editing
    • Isolating/removing/adding objects, visually and aurally.
  • Multimedia retrieval
    • Efficient multimedia searching.

Beal, Jojic and Attias, ICASSP’02

motivation current state of art
Motivation – current state of art
  • Video models and Audio models
    • Abundance of work on object tracking, image stabilization…
    • Large amount in speech recognition, ICA (blind source separation), microphone array processing…
  • Very little work on combining these
    • We desire a principled combination.
    • Robust learning of environments using multiple modalities.
    • Various past approaches:
      • Information theory: Hershey & Movellan (NIPS 12)
      • SVD-esque: (FaceSync) Slaney & Covell (NIPS 13)
      • Subspace stats.: Fisher et al. (NIPS 13).
      • Periodicity analysis: Ross Cutler
      • Particle filters: Vermaak and Blake et al (ICASSP 2001).
      • System engineering: Yong Rui (CVPR 2001).
  • Our approach: Graphical Models, Bayes nets.

Beal, Jojic and Attias, ICASSP’02

generative density modeling
Generative density modeling
  • Probability models that
    • reflect desired structure
    • randomly generate plausible images and sounds,
    • represent the data by parameters
  • ML estimation
  • p(image|class) used for recognition, detection, ...
  • Examples: Mixture of Gaussians, PCA/FA/ICA, Kalman filter, HMM
  • All parameters can be learned from data!

Beal, Jojic and Attias, ICASSP’02

speaker detection tracking problem

camera

mic.1

mic.2

µt

source at lx

Speaker detection & tracking problem

Video scenario

Audio scenario

ly

lx

Beal, Jojic and Attias, ICASSP’02

bayes nets for multimedia
Bayes Nets for Multimedia
  • Video models
    • Models such as Jojic & Frey (NIPS’99, CVPR’99’00’01).
  • Audio models
    • Work of: Attias (Neural Comp’98); Attias, Platt, Deng & Acero (NIPS’00,EuroSpeech’01).

Beal, Jojic and Attias, ICASSP’02

a generative video model for scenes see frey jojic cvpr 99 nips 01
A generative video model for scenes(see Frey&Jojic, CVPR’99, NIPS’01)

Class s

Mean s

Latent image z

Shift

(lx,ly)

Transformed image z

Generated/observed image y

Beal, Jojic and Attias, ICASSP’02

example

Mean

One class summary

Variance

5 classes

Example
  • Hand-held camera
  • Moving subject
  • Cluttered background

DATA

Beal, Jojic and Attias, ICASSP’02

a generative video model for scenes see frey jojic cvpr 99 nips 0110
A generative video model for scenes(see Frey&Jojic, CVPR’99, NIPS’01)

Class s

Mean s

Latent image z

Shift

(lx,ly)

Transformed image z

Generated/observed image y

Beal, Jojic and Attias, ICASSP’02

a failure mode of this model
A failure mode of this model

Beal, Jojic and Attias, ICASSP’02

modeling scenes the audio part

camera

mic.1

mic.2

µt

source at lx

Modeling scenes - the audio part

mic.1

mic.2

Beal, Jojic and Attias, ICASSP’02

unaided audio model

+15

t

+15

-15

t

-15

time

Unaided audio model

audio waveform

video frames

  • Posterior probability over t, the time delay.
  • Periods of quiet cause uncertainty in t – (grey blurring).
  • Occasionally reverberations / noise corrupt inference on t
    • and we become certain of a false time delay.

Beal, Jojic and Attias, ICASSP’02

limit of this simple audio model
Limit of this simple audio model

Beal, Jojic and Attias, ICASSP’02

multimodal localization
Multimodal localization
  • Time delay t is approximately linear in horizontal position lx
  • Define a stochastic mapping from spatial location to temporal shift:

Beal, Jojic and Attias, ICASSP’02

slide16

The combined model

Beal, Jojic and Attias, ICASSP’02

the combined model
The combined model
  • Two halves connected by t - lx link

Maximize  nalog p(xt)+nvlog p(yt)

Beal, Jojic and Attias, ICASSP’02

learning using em e step
Learning using EM: E-Step

Distribution Q over hidden variables is inferred given the current setting of all model parameters.

Beal, Jojic and Attias, ICASSP’02

learning using em m step
Learning using EM: M-Step

Given the distribution over hidden variables, the parameters are set to maximize the data likelihood.

  • Video:
    • object templates ms and precisions fs
    • camera noise y
  • Audio:
    • Relative microphone attenuations l1,l2 and noise levels n1n2
  • AV Calibration between modalities
    • a, b, nt

Beal, Jojic and Attias, ICASSP’02

efficient inference and integration over all shifts frey and jojic nips 01
Efficient inference and integration over all shifts (Frey and Jojic, NIPS’01)

E Estimating posterior Q(lx,ly,) involves computing Mahalanobis distances for all possible shifts in the image

M Estimating model parameters involves integrating over all possible shifts taking into account the probability map Q(lx,ly,)

E reduces to correlation, M reduces to convolution

Efficiently done using FFTs

Beal, Jojic and Attias, ICASSP’02

demonstration of tracking
Demonstration of tracking

A

AV

V

na/nv

Beal, Jojic and Attias, ICASSP’02

learning using em m step22
Learning using EM: M-Step

Given the distribution over hidden variables, the parameters are set to maximize the data likelihood.

  • Video:
    • object templates ms and precisions fs
    • camera noise y
  • Audio:
    • Relative microphone attenuations l1,l2 and noise levels n1n2
  • AV Calibration between modalities
    • a, b, nt

Beal, Jojic and Attias, ICASSP’02

inside em iterations
Inside EM iterations

1

2

4

10

Q(|x1,x2,y)

Q(lx|x1,x2,y)

Beal, Jojic and Attias, ICASSP’02

tracking stabilization
Tracking Stabilization

Beal, Jojic and Attias, ICASSP’02

work in progress models
Work in progress: models
  • Incorporating a more sophisticated speech model
    • Layers of sound
  • Reverberation filters
    • Extension to y-localization is trivial.
    • Temporal models of speech.
  • Incorporating a more sophisticated video model
    • Layered templates (sprites) each with their own audio (circumvents dimensionality issues).
    • Fine-scale correlations between pixel intensities and speech.
    • Hierarchical models? (Factor Analyser trees).
  • Tractability issues:
    • Variational approximations in both audio and video.

Beal, Jojic and Attias, ICASSP’02

basic flexible layer model cvpr 01
Basic flexible layer model (CVPR’01)

Beal, Jojic and Attias, ICASSP’02

future work applications
Future work: applications
  • Multimedia editing
    • Removing/adding objects’ appearances and associated sounds.
    • With layers in both audio and video (cocktail party / danceclub).
  • Video-assisted speech enhancement
    • Improved denoising with knowledge of source location.
    • Exploit fine-scale correlations of video with audio. (e.g. lips)
  • Multimedia retrieval
    • Given a short clip as a query, search for similar matches in a database.

Beal, Jojic and Attias, ICASSP’02

summary
Summary
  • A generative model of audio-visual data
  • All parameters learned from the data, including camera/microphones calibration in a few iterations of EM
  • Extensions to multi-object models
  • Real issue: the other curse of dimensionality

Beal, Jojic and Attias, ICASSP’02

pixel audio correlations analysis
Pixel-audio correlations analysis

Original video sequence

Factor Analysis (probabilistic PCA).

SVD.

Inferred activation of latent variables

(factors, subspace vectors)

Beal, Jojic and Attias, ICASSP’02