Semantic Indexing of multimedia content using visual, audio and text cues Written By: . W . H. Adams . Giridharan Iyengar . Ching -Yung Lin . Milind Ramesh Naphade . Chalapathy Neti . Harriet J. Nock . John R. Smith. Presented by: Archana R eddy Jammula 800802639.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Semantic Indexing of multimedia content using visual, audio and text cuesWritten By:.W. H. Adams.GiridharanIyengar. Ching-Yung Lin .MilindRamesh Naphade. ChalapathyNeti. Harriet J. Nock. John R. Smith
searching, and retrieving content.
QBE TO QUERY-BY-KEYWORD (QBK)
Emphasis on the extraction of semantics from individual modalities, in some instances, using audio and visual modalities.
CHANGE: Approach that combines audio and visual content analysis with textual information retrieval for semantic modeling of multimedia
Combines content analysis with information
retrieval in a unified setting for the semantic labeling
of multimedia content. Proposal of anovel
approach for representing semantic-concepts using a basis of
other semantic-concepts and propose a novel discriminant
framework to fuse the different modalities.
A priori definition of a set of atomic semantic-concepts (objects, scenes, and events) which is assumed to be broad enough to cover the semantic query space of interest high levelConcepts.
The reliable estimation of class conditional parameters in the requires large amounts of training data for each class and the forms assumed for class conditional distributions may not be the most appropriate.
Use of a more discriminant learning approach requires fewer parameters and assumptions may yield better results
a particular form of the joint probability density function.
P(E, A,V,T) = P(E)P(A/E)P(V/E)P(T/E)
(a) Bayesian Networks
(b) Support Vector Machines
Music- 84% of manually labeled audio samples
Rocket engine explosion- 60% of manually labeled audio samples
in audio, video, and/or speech.
summarize performance, defined as average precision . over the top 100 retrieved documents.
PRECISION-RECALL CURVES and text cues
Fig: Effect of duration modeling
Fig: Implicit versus explicit fusion
FOM results: audio retrieval, different intermediate concepts.
FOM results: audio retrieval, GMM versus HMM performance
and implicit versus explicit fusion.
Two sets of results:
FOM results: speech retrieval using human knowledge based query.
Take the scores from all semantic models concatenating them into a 9-dimensional feature vector.
Fig: Fusion and text cuesof audio, text, and visual models using the SVM
fusion model for rocket launch retrieval.
Fig: The and text cuestop 20 video shots of rocket launch/take-off retrieved using multimodal detection based on the SVM model. Nineteen of the
top 20 are rocket launch shots.
FOM results for unimodal retrieval and the two multimodal fusion models
semantic-concept rocket launch:
-For concept classification using information in single modalities
-For concept classification using information from multiple modalities.
determining the low-level features which are most appropriate for labeling atomic concepts and for determining atomic concepts which are related
to higher-level semantic-concepts.
THANK YOU and text cues