Information Fusion from Multiple Modalities for Multimedia Mining Applications

Information Fusion from Multiple Modalities for Multimedia Mining Applications Giridharan Iyengar (joint work w/ H. Nock) Audio-Visual Speech Technologies Human Language Technologies IBM TJ Watson Research Center

Acknowledgements • John R. Smith and the IBM Video TREC team • Members of Human Language Technologies department

Outline • Project Overview • TREC 2002 • Semantic Modeling using Multiple modalities • Concept Detection • Special detector – AV Synchrony • Retrieval • Summary

What do we want? • A framework that facilitates learning and detection of concepts in digital media: • Given a concept and annotated example(s), learn representation for concept in digital media • Learn models directly using statistical methods (e.g. Face detector) • Leverage existing concepts and learn mapping of the new concept to existing concepts (people spending leisure time in the beach) • Given digital media, detect instances of concepts

Our Thesis Combining modalities to robustly infer knowledge

Concept Acquisition Concept Coverage Concept Accuracy Performance dimensions • Concept Accuracy: • What is the accuracy? • What accuracy desired for acquiring new concepts? • Concept Acquisition: • How many training examples for a desired level of accuracy? • For a general concept? • For a specific concept? • Concept Coverage: • How many concepts? Dimensions are inter-related

Multimedia Semantic Learning Framework Signals Features Semantics Subjective Today Our Goal Recent past Near future training models features Annotation* (MPEG-7) Fusion Non-speech Audio Models Audio Features (MFCC…) Visual Features (Color…) Speech Models Retrieval Video Repository Visual Segmentation Visual Models *Available from Alphaworks

Multimodal Video Annotation tool (in Alphaworks) • MPEG1 in  MPEG7 out • Embedded Automatic Shot change detection • Lexicon editing • Handle multiple video formats

Model Retrieval tools • Model browser • Allows browsing of all models under a given modality • Primarily for a user of the models • Model analyzer • Primarily for model builders • Permits comparisons between different models for a given concept • Presents model statistics such as PR curves, Average Precision

Concept modeling approaches SVM BPM Graphical Model C HMM M1 M2 M3

What is TREC? • The NIST coordinated Text Retrieval Conference (TREC) • Series of annual information retrieval benchmarks • Spawned in 1992 from the DARPA TIPSTER information retrieval project • TREC has become important forum for evaluating and advancing the state-of-the-art in information retrieval • Tracks for tracks for spoken document retrieval, cross language retrieval, and Web document retrieval and video retrieval • Document collections are huge and standardized • Groups participating represent who’s who of IR research • 10-12 commercial companies (I.e., Excalibur, Lexis-Nexis, Xerox, IBM) • 20-30 university / research groups across all tracks • 70% participation from US

Video TREC 02 • 2nd Video TREC • 70 Hours of MPEG1, partitioned into development, search and feature-test sets • Shot-boundary detection • Concept Detection (10 concepts) • Benchmarking • Donations • Search • 25 Queries (named entities, events, scenes, objects)

Performance of IBM Concept detectors at TREC02 AP = Precision at Relevant Retrieved/Total Ground truth in corpus. Normalized area under the ideal PR curve

Is (Spoken) Text useful to detect Visual Concepts? • Use Speech transcripts to detect visual concepts in shots • Turns concept detection into a speech-based retrieval problem • Index Speech transcripts • Use training set to identify useful query terms • Generic approach to improve concept coverage • Comparable performance on TREC02 data

Discriminative Model Fusion • Novel approach to build models for concepts using a basis of existing models • Incorporates information about model confidences • Can be viewed as a feature projection New Concept Model New Concept Annotations Model vector space | | | | | | | | | “model vector” M2 M3 M4 M5 M6 M1

Discriminative Model Fusion: Algorithm • Support Vector Machine • Largest margin hyperplane in the projected feature space • With good kernel choices, all operations can be done in low-dimensional input feature space • We use Radial Basis Functions as our kernels • Sequential Minimal Optimization algorithm

M9 M1 Model vector space Model vector space | | | | | | | | | | | | | | | | | | “model vector” “model vector” M1 M1 M2 M2 M3 M3 M4 M4 M5 M5 M6 M6 Discriminative Model Fusion: Advantages • Can be used to improve existing models (accuracy) and build new models (coverage) • Can be used to fuse text-based models with content-based models (multimodality)

DMF results • Experiment • Build model vector from 6 text-based detectors and 42 pre-existing concept detectors • Build 6 target concept detectors in this model vector space using DMF • Accuracy Results • Concepts either improve (by 23-91%) or stay the same • MAP improves by 12% over all 6 visual concepts.

Audio-visual Synchrony detection • Problem: • Is it a narration (voiceover) or a monologue? • Are they synchronous? Plausible? (caused by the same person speaking) • Applications include: • ASR Metadata (Speaker turns) • Talking head detection (video summarization) • Dubbing

Existing Work: • Hershey and Movellan (NIPS 1999) • Single Gaussian Model. Mutual Information between gaussian models • Cutler and Davis (ICME 2000) • Time Delay Neural Network • Fisher III et al (NIPS 2001) • Learn projection to a common subspace. Projection maximizes mutual information • Slaney and Covell (NIPS 2001) • Learn Canonical Correlation

Approach 1: Evaluate Synchrony Using Mutual Information • Detect faces and speech • Convert speech audio and face video into feature vectors • E.g. A1,…,AT = MFC coefficient vectors for audio • E.g. V1,…,VT = DCT coefficients for video • Consider each (A1,V1),…,(AT,VT) as an independent sample from a joint distribution p(A, V) • Evaluate mutual information as Consistency Score I(A;V) • assume distributional forms for p(A), p(V) and p(A,V)

Implementation 1: Discrete Distributions (“VQ-MI”) 1. Build codebooks to quantize audio and video feature vectors (use training set) VIDEO CODEBOOK AUDIO CODEBOOK 2. Convert test audio and video into feature vectors and quantize using codebooks TEST SPEECH A1,…,AT V1,…,VT TEST FACE SEQUENCE 3. Use to estimate discrete distributions p(A), p(V) and p(A,V) and calculate Mutual Information

Implementation 2: Gaussian distributions (“G MI”) 1. Convert test audio and video into feature vectors TEST SPEECH A1,…,AT V1,…,VT TEST FACE SEQUENCE • 2. Use to estimate multivariate Gaussian distributions p(A), p(V) and p(A,V) • (some similarities with Hershey and Movellan, NIPS 1999) 3. Calculate Consistency Score I(A;V) NOTE: long test sequences may not be Gaussian, so divide into locally Gaussian segments using Bayesian Information Criterion (Chen and Gopalakrishnan, 1998)

Approach 2: Evaluate Plausibility (“AV-LL”) 1. Convert test audio and video into feature vectors TEST SPEECH A1,…,AT V1,…,VT TEST FACE SEQUENCE 2. Hypothesize uttered word sequence W using audio-only automatic speech recognition (or ground truth script if available) 3. Calculate Consistency Score p(A,V|W) - here, likelihoods from Hidden Markov Models as used in audio-visual speech recognition NOTE: preliminary experiments also considered an approximation to p(W|A,V), but results were less successful

Experimental Setup • Corpus and test set construction: • Constructed from IBM Via-Voice AV data • Full-face, front-facing speakers • Continuous, clean speech • 1016 four-second long “true” (ie. corresponding) speech and face combinations extracted • For each “true” case, three “confuser” examples pair the same speech with faces saying something else • Separate training data used for training models (for schemes “VQ-MI”, ”AV-LL”) • Pre-processing: • Audio: MFC coefficients • Video: DCT coefficients of mouth region-of-interest • Test Purpose: • Assume perfect face and speech detection • Evaluate usefulness of different consistency definitions

Synchrony results • Gaussian clearly superior to VQ and AV schemes • VQ and AV require training  possible mismatch between training and test data • For VQ, estimation of discrete densities suffers from resolution/accuracy trade-off

Application to Speaker Localization • Data from Clemson University’s CUAVE corpus: • Investigating two tasks: • “Active Speaker”: Is left or right person speaking? • “Active Mouth”: Where is active speaker’s mouth? • Assume only one person speaking at any given time (for now)

Speaker Localization • Task 1: Active Speaker • Compute Gaussian-based MI between each video pixel and audio signal over a time window: • Scheme 1: use pixel intensities and audio LogFFT • Scheme 2: use “delta-like” features based on changes in pixel intensity across time (Butz) and audio LogFFT • Compare: • Total MI (left screen) vs. Total MI (right screen) • Shift window and repeat • Task 2: Active Speaker’s mouth • Search for compact region with good MI scores • No smoothing of region between images • Estimate mouth center every second. Considered correct if estimate is within search pixels of true center.

Speaker Localization: Mutual Information Images Video Features: Pixel Intensities Video Features: Intensity Deltas

Speaker localization results Note: No significant gain for speaker localization from adding prior face detection. Speaker mouth localization improves by 4% for AVMI and 2% for video-only

Using synchrony to detect monologues in TREC02 • Monologue detector • Should have a face • Should contain speech • Speech and Face should be synchronized • Threshold the mutual information image at various levels • Ratio of the average mutual information of the highest-scoring NxN pixel region with the average mutual information of the image

Monologue Results IBM Monologue detector best in TREC 2002 Using Synchrony does better than Face+Speech alone

Video Retrieval using Speech Remove Frequent words, POS-tag + morph Speech transcripts (ASR) Divide into documents Create Morph Index Eg. Use 100 word, overlapping windows Map Documents to shots Eg. “RUNS” -> RUN RETRIEVE: rank documents Remove Frequent Words, POS-tag + morph Query Term String Challenges: ASR, text document definition, document ranking, mapping documents to shots

SDR Details: Fusion of Multiple SDR Systems • Take multiple SDR systems • OKAPI 1, OKAPI 2, Soft Boolean using the same ASR • Examine complementarity using common query set: • no system returns a superset of the other system’s results • Form additive weighted combination: “fusion” system The fusion system resulted in 2nd overall performance at TREC02.

Integrating Speech and Video Retrieval: ongoing research • TREC 02 SDR + CBR examples: • sometimes multimodal integration improves over top unimodal (“rockets taking off”) • (Fused, SDR, CBR) • sometimes multimodal integration degrades top unimodal • Video degrades fusion (Fused, SDR, CBR) (“Parrots”) • Audio degrades fusion: (Fused, SDR, CBR) (“nuclear clouds”) • Use non-speech audio and video cues to improve “speech-score-to-shot-score” score mapping • Manual subsetting of videos results in 44% improvement in MAP. Speech-only ranking of videos results in a 5% improvement • Can multimodal cues be used to come closer to manual performance • Is using multimodal cues to rank videos simpler than ranking shots?

Summary • Information fusion across modalities helps a variety of tasks • Though, speech/text + image-based retrieval remains an open issue

Open Research Challenges • Multimodal Information Fusion is not a solved problem! • Combining Text with Image Content for Retrieval • Model performance • Progressive improvements in model performance (accuracy) • Under limited training data, with increasing complexity (acquisition) • Maintain accuracy as number of concepts increase (coverage) • SDR improvements • What level of ASR performance is the minimum for optimal retrieval performance? • Limits on text (speech)-based visual models? • Automatic query processing (acquisition)

Discussion • Papers, Presentations, Other work • http://www.research.ibm.com/AVSTG/

SDR Details (1): ASR Performance • SUMMARY: 41% relative improvement in Word Error Rate (WER) over TREC 2001 speech transcription approach • VIDEO TREC 2001: used ViaVoice for Broadcast News system • 93k prototypes • single trigram model • Improve ASR using Watson HUB-4 Broadcast News system • 285k prototypes speaker-independent system • mixture of 3 LMs (4 gram, 3 gram, etc.) • Additionally, incorporate: • supervised adaptation (8 videos from training set) • improved speech vs non-speech segmentation • unsupervised test-set adaptation

SDR Details (1) ctd: Does improving ASR affect Video Retrieval? • Manually compile (limited) ground truth on FeatureTrain+Validate subset of TREC 2002 • set up retrieval systems using ASR of different word error rates (WER’s) • use TREC-2002 set of 25 queries to evaluate Mean Average Precision (MAP) • open question: what is upper limit for MAP given “perfect” ASR? Improvements in ASR => Improved Video Retrieval

Information Fusion from Multiple Modalities for Multimedia Mining Applications