Audio-Visual Graphical Models - PowerPoint PPT Presentation

Gabriel
audio visual graphical models l.
Skip this Video
Loading SlideShow in 5 Seconds..
Audio-Visual Graphical Models PowerPoint Presentation
Download Presentation
Audio-Visual Graphical Models

play fullscreen
1 / 29
Download Presentation
Audio-Visual Graphical Models
254 Views
Download Presentation

Audio-Visual Graphical Models

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Audio-Visual Graphical Models Nebojsa Jojic Microsoft Research Redmond, Washington Hagai Attias Microsoft Research Redmond, Washington Matthew Beal Gatsby Unit University College London

  2. Overview • Some background to the problem • A simple video model • A simple audio model • Combining these in a principled manner • Results of tracking experiments • Further work and thoughts. Beal, Jojic and Attias, ICASSP’02

  3. Motivation – applications • Teleconferencing • We need speaker’s identity, position, and individual speech. • The case of multiple speakers. • Denoising • Speech enhancement using video cues (at different scales). • Video enhancement using audio cues. • Multimedia editing • Isolating/removing/adding objects, visually and aurally. • Multimedia retrieval • Efficient multimedia searching. Beal, Jojic and Attias, ICASSP’02

  4. Motivation – current state of art • Video models and Audio models • Abundance of work on object tracking, image stabilization… • Large amount in speech recognition, ICA (blind source separation), microphone array processing… • Very little work on combining these • We desire a principled combination. • Robust learning of environments using multiple modalities. • Various past approaches: • Information theory: Hershey & Movellan (NIPS 12) • SVD-esque: (FaceSync) Slaney & Covell (NIPS 13) • Subspace stats.: Fisher et al. (NIPS 13). • Periodicity analysis: Ross Cutler • Particle filters: Vermaak and Blake et al (ICASSP 2001). • System engineering: Yong Rui (CVPR 2001). • Our approach: Graphical Models, Bayes nets. Beal, Jojic and Attias, ICASSP’02

  5. Generative density modeling • Probability models that • reflect desired structure • randomly generate plausible images and sounds, • represent the data by parameters • ML estimation • p(image|class) used for recognition, detection, ... • Examples: Mixture of Gaussians, PCA/FA/ICA, Kalman filter, HMM • All parameters can be learned from data! Beal, Jojic and Attias, ICASSP’02

  6. camera mic.1 mic.2 µt source at lx Speaker detection & tracking problem Video scenario Audio scenario ly lx Beal, Jojic and Attias, ICASSP’02

  7. Bayes Nets for Multimedia • Video models • Models such as Jojic & Frey (NIPS’99, CVPR’99’00’01). • Audio models • Work of: Attias (Neural Comp’98); Attias, Platt, Deng & Acero (NIPS’00,EuroSpeech’01). Beal, Jojic and Attias, ICASSP’02

  8. A generative video model for scenes(see Frey&Jojic, CVPR’99, NIPS’01) Class s Mean s Latent image z Shift (lx,ly) Transformed image z Generated/observed image y Beal, Jojic and Attias, ICASSP’02

  9. Mean One class summary Variance 5 classes Example • Hand-held camera • Moving subject • Cluttered background DATA Beal, Jojic and Attias, ICASSP’02

  10. A generative video model for scenes(see Frey&Jojic, CVPR’99, NIPS’01) Class s Mean s Latent image z Shift (lx,ly) Transformed image z Generated/observed image y Beal, Jojic and Attias, ICASSP’02

  11. A failure mode of this model Beal, Jojic and Attias, ICASSP’02

  12. camera mic.1 mic.2 µt source at lx Modeling scenes - the audio part mic.1 mic.2 Beal, Jojic and Attias, ICASSP’02

  13. +15 t +15 -15 t -15 time Unaided audio model audio waveform video frames • Posterior probability over t, the time delay. • Periods of quiet cause uncertainty in t – (grey blurring). • Occasionally reverberations / noise corrupt inference on t • and we become certain of a false time delay. Beal, Jojic and Attias, ICASSP’02

  14. Limit of this simple audio model Beal, Jojic and Attias, ICASSP’02

  15. Multimodal localization • Time delay t is approximately linear in horizontal position lx • Define a stochastic mapping from spatial location to temporal shift: Beal, Jojic and Attias, ICASSP’02

  16. The combined model Beal, Jojic and Attias, ICASSP’02

  17. The combined model • Two halves connected by t - lx link Maximize  nalog p(xt)+nvlog p(yt) Beal, Jojic and Attias, ICASSP’02

  18. Learning using EM: E-Step Distribution Q over hidden variables is inferred given the current setting of all model parameters. Beal, Jojic and Attias, ICASSP’02

  19. Learning using EM: M-Step Given the distribution over hidden variables, the parameters are set to maximize the data likelihood. • Video: • object templates ms and precisions fs • camera noise y • Audio: • Relative microphone attenuations l1,l2 and noise levels n1n2 • AV Calibration between modalities • a, b, nt Beal, Jojic and Attias, ICASSP’02

  20. Efficient inference and integration over all shifts (Frey and Jojic, NIPS’01) E Estimating posterior Q(lx,ly,) involves computing Mahalanobis distances for all possible shifts in the image M Estimating model parameters involves integrating over all possible shifts taking into account the probability map Q(lx,ly,) E reduces to correlation, M reduces to convolution Efficiently done using FFTs Beal, Jojic and Attias, ICASSP’02

  21. Demonstration of tracking A AV V na/nv Beal, Jojic and Attias, ICASSP’02

  22. Learning using EM: M-Step Given the distribution over hidden variables, the parameters are set to maximize the data likelihood. • Video: • object templates ms and precisions fs • camera noise y • Audio: • Relative microphone attenuations l1,l2 and noise levels n1n2 • AV Calibration between modalities • a, b, nt Beal, Jojic and Attias, ICASSP’02

  23. Inside EM iterations 1 2 4 10 Q(|x1,x2,y) Q(lx|x1,x2,y) Beal, Jojic and Attias, ICASSP’02

  24. Tracking Stabilization Beal, Jojic and Attias, ICASSP’02

  25. Work in progress: models • Incorporating a more sophisticated speech model • Layers of sound • Reverberation filters • Extension to y-localization is trivial. • Temporal models of speech. • Incorporating a more sophisticated video model • Layered templates (sprites) each with their own audio (circumvents dimensionality issues). • Fine-scale correlations between pixel intensities and speech. • Hierarchical models? (Factor Analyser trees). • Tractability issues: • Variational approximations in both audio and video. Beal, Jojic and Attias, ICASSP’02

  26. Basic flexible layer model (CVPR’01) Beal, Jojic and Attias, ICASSP’02

  27. Future work: applications • Multimedia editing • Removing/adding objects’ appearances and associated sounds. • With layers in both audio and video (cocktail party / danceclub). • Video-assisted speech enhancement • Improved denoising with knowledge of source location. • Exploit fine-scale correlations of video with audio. (e.g. lips) • Multimedia retrieval • Given a short clip as a query, search for similar matches in a database. Beal, Jojic and Attias, ICASSP’02

  28. Summary • A generative model of audio-visual data • All parameters learned from the data, including camera/microphones calibration in a few iterations of EM • Extensions to multi-object models • Real issue: the other curse of dimensionality Beal, Jojic and Attias, ICASSP’02

  29. Pixel-audio correlations analysis Original video sequence Factor Analysis (probabilistic PCA). SVD. Inferred activation of latent variables (factors, subspace vectors) Beal, Jojic and Attias, ICASSP’02