1 / 27

Mohammad S. Al Awad 985426 26-May-2008

Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems, 2005. Mohammad S. Al Awad 985426 26-May-2008. Outline. Introduction Background Audio Event? Semantic Context? Problem statement

janine
Download Presentation

Mohammad S. Al Awad 985426 26-May-2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Toward Semantic Indexing and Retrieval Using Hierarchical Audio ModelsWei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWuMultimedia Systems, 2005 Mohammad S. Al Awad 985426 26-May-2008

  2. Outline • Introduction • Background • Audio Event? • Semantic Context? • Problem statement • Hierarchical Framework • Modeling • Performance • Indexing and Retrieval

  3. Introduction • Semantic indexing and content retrieval in: • Audio: speech, music, noise and silence • Audiovisual: shots, dialogue and action scene • Representation of high-level query semantics • E.g Scenes associated with semantic meaning vs. color layouts and object positions

  4. Background • Previous work concentrated on identifying sounds like applause, gunshot, cheer or silence. • Tools used: Bayesian Network and Support Vector Machine SVM to fuse information from different sounds • Critique: isolated sounds carry less solid semantic

  5. Audio Event • Short audio clip that represent the sound of an object or event • They can be characterized by statistical patterns and temporal evolution

  6. Semantic Context • The context of semantic concept is an analysis unit that represents more reasonable granularity for multimedia content usage • Semantic Concept: gunplay scene • Semantic Context: gunshots and explosions in action movie

  7. Enhancing the problem statement? • Index multimedia documents by detecting high-level semantic contexts. To characterize a semantic context, audio events highly relevant to specific semantic concepts are collected and modeled. • Occurrence patterns of gunshot and explosion events are used to characterize “gunplay” scenes, and the patterns of engine and car-braking events are used to characterize “car chasing” scenes.

  8. Hierarchical Framework • Low-level events, such as gunshot, explosion, engine and car braking sounds are modeled • Based on the statistical information collected from various audio event detection results two methods are investigated to fuse this information: Gaussian mixture model (GMM) and Hidden Markov model (HMM)

  9. Hierarchical Framework (cont.)

  10. Modeling • Feature extraction • Audio event modeling • Confidence evaluation • Semantic context modeling • Gaussian mixture model • Hidden Markov model

  11. Feature Extraction • Extract suitable time and frequency domain features to build feature vector • Audio streams: 16-KHz, 16-bit mono, 400 samples, 50% overlap

  12. Feature Extraction (tools) • Perceptual Features • STE short-time energy: is the loudness or volume • BER band-energy ratio: the spectrum is divided to four bands where energy of each sub band is divided by total energy • ZCR zero-crossing rate: average number of signal sign change in audio frame • Mel-Frequency Cepstral Coefficients MFCC • Frequency Centroid (FC) • Bandwidth (BW)

  13. Feature Extraction (feature vector) • 16-dimension feature vector • 1(STE)+4(BER)+1(ZCR)+1(FC)+1(BW)+8(MFCC) • 16-dimension feature vector • Audio frame difference between Ai-Ai+1 • Result is 32-dimension feature vector

  14. Audio Event Modeling • Hidden Markov Model HMM is used to model audio samples • Each HMM module takes the extracted features as input • Forward algorithm is used to compute the log-likelihood of an audio segment with respect to each audio event • Baum-Welch algorithm to estimate transitions probabilities between states (physical meaning) • Clustering algorithm to determine model size and states

  15. Audio Event Modeling (training) • HMM models: gunshot, explosion, engine, car braking • Training data: 100 audio events 3-10 sec representing each HMM model

  16. Confidence Evaluation • To determine how a segment is close to an audio event, a confidence metric is calculated • Compare in 1second step (analysis window) the audio segment with the audio event model • Use log-likelihood from Forward algorithm • Audio segment might not belong to audio model • Likelihood ratio test: distribution of log-likelihood

  17. Confidence Evaluation (depicted) These confidence scores are the input of high-level modeling and provide important clues to bridge audio event and semantic context.

  18. Semantic Context Modeling (GMM) • Goal: detect high-level semantic context based on confidence scores of audio events that are highly relevant to the semantic concept • Training data: 30 gunplay and car chasing scenes each 3-5 min are selected from 10 Hollywood action movies • Five-fold cross validation (random 24 training, 6 testing)

  19. GMM how does it work? • Semantic context last for a period of time and not all relevant audio events exists • A texture window of 5 sec is defined with 2.5 sec overlap • Go through confidence values (analysis window of 1 sec step) • Construct pseudo-semantic features

  20. GMM how does it work? • Semantic context detection • In the case of gunplay scenes, if all the feature elements of gunshot and explosion events are located in the detection regions, it is said that the segment conveys the semantics of gunplay.

  21. Semantic Context Modeling (HMM) • Critique of GMM Model: • Does not model the time duration density • Segments with low or high confidence scores due to environment sounds or sound emerge • HMM model captures the spectral variation of acoustic features in time by considering state transitions and giving different likelihood values • Ergodic-HMM or fully connected HMM

  22. HMM how does it work? • Calculate the probability of partial observation sequence, and state i at time t given some model λ • Using Forward algorithm calculate the log-likelihood value that represent how likely a semantic context is to occur

  23. Performance • Uncertainty is avoided: aural information tend to remain the same whether visual scene was day or night, downtown or forest • Rare to have car chasing concept without engine sound !! • Precision is high indicates high confidence of detection results • Short length evens e.g. car braking infer lower precision • False alarms i.e. incorrect detection

  24. Performance

  25. Indexing and Retrieval • Concept match between aural and visual information • If visual information is taken into account, characteristic consistency between different video clips with the same concept • Generalized framework: replacing audio events by visual object models. Thus, detect both audio and audiovisual

  26. Future Work • Careful design of pseudo-semantic feature vectors to construct meta-classifier (feature selection pool) • Blind source separation (media-aesthetic rules)

  27. Thank you

More Related