1 / 17

Audiovisual Event Detection & Recognition

Audiovisual Event Detection & Recognition. Audiovisual speech recognition Manifold Discriminant Features Fusion using boosted combination of DBNs TRECVid and PASCAL Competitions, 2009 GMM supervector normalizes inter-session variability

Download Presentation

Audiovisual Event Detection & Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Audiovisual Event Detection & Recognition • Audiovisual speech recognition • Manifold Discriminant Features • Fusion using boosted combination of DBNs • TRECVid and PASCAL Competitions, 2009 • GMM supervector normalizes inter-session variability • Sparse coding to model manifold of low level features • Non-speech audiovisual event detection • Over-generate features, select, tandem NN+HMM, and compensate variability using a GMM supervector

  2. BACKGROUND:AUDIOVISUAL SPEECH RECOGNITION

  3. AVICAR (Audiovisual Speech) Database

  4. Lip Rectangle Dimensionality Reduction using Local Discriminant Graph • Maximize Local Inter-Manifold Interpolation Errors, • subject to a constant Same-Class Interpolation Error: Find P to maximize DD||PT(xi-ckyk)||2, ykЄ KNN(xi), other classes Subject to DS = constant, DS =||PT(xi-cjxj)||2, xjЄ KNN(xi), same class

  5. Lip Reading Results (Digits) DCT=discrete cosine transform; PCA=principal components analysis; LDA=linear discriminant analysis; LEA=local eigenvector analysis; LDG=local discriminant graph

  6. Audiovisual Speech RecognitionWord Error Rate (Connected Digits)

  7. Best Result: (AV CHMM) + (AV Articulatory Feature DBN)

  8. TREC VIDEO RETRIEVAL EVALUATIONandPASCAL VISUAL OBJECT CLASS CHALLENGING 2009

  9. TRECVID: NIST competition on Text and Video retrieval Task: surveillance video classification PASCAL: PATTERN ANALYSIS, STATISTICAL MODELING AND COMPUTATIONAL LEARNING Task: predict at least one object of a given class is present in the image. 20 classes are selected including person, animals, vehicles, and indoor objects.

  10. Method

  11. Variability Compensation using WCCN • Treat log likelihoods, Zj=-log p(x|j), as a high-dimensional pseudo feature vector, called the “supervector” • Z-normalize the supervector to reduce the effect of irrelevant variability using a robust regularized covariance matrix: S=(g S+(1-g )I) • Z-normalization results is better linear separability

  12. RESULTS Our methods: Gaussian Mixtures (GMM) models distribution of patches in the image Local sparse coding to model manifold of image patches (1) + (2) combined at the kernel level TRECVid: Illinois/NEC team ranks #1 out of 16 teams in TRECVid 2009 Surveillance video task PASCAL: Illinois/NEC team ranks #1 in the classification task out of 48 entered methods from 20 groups worldwide.

  13. AUDIOVISUAL EVENT DETECTION

  14. Non-Speech Acoustic Event Detection

  15. AED: Why is it Hard? DIFFICULTIES - Unknown spectral structure - Different spectral structure for each events - Low SNR (speech as background noise)

  16. AED: Solution System Overview Result: Illinois team ranked #1 out of 6 teams in CLEAR AED 2007

  17. Current Research: Audiovisual Fusion • (feature selection)+(ANN)+(HMM)+(supervector compensation) • Likelihood-space fusion of audio and video features

More Related