Semantics from Sound: Modeling Audio and Text

Semantics from Sound:Modeling Audio and Text Thesis Proposal May 31, 2006 Douglas Turnbull Department of Computer Science & Engineering UC San Diego Committee: Charles Elkan, Gert Lanckriet, Serge Belogie, Sanjoy Dasgupta, Shlomo Dubnov

Describing What We Hear Sound carries rich information from which we derive semantic understanding: • Interpreter translating the speech of a foreign dignitary • Analyst critiquing the tone of a political debate • Critic writing a music review • Movie director describing a sound effect

Computer Audition We are interested in developing computer-based systems that can listen to and describe sound. Sound Classes: • Speech • Music • Sound Effects • Environmental Sounds

Computer Audition We are interested in developing computer-based systems that can listen to and describe sound. Common Tasks: • Speech-to-text recognition • Speaker characterization - emotion, tone, accents • Music information retrieval • Monitoring using sound - sonar, bird migration We will primarily focus on non-speech audio research.

Semantic Audio Annotation and Retrieval We would like a system that can both • annotate audio content with semantically meaningful words • retrieve relevant audio given a text-based query We learn a probabilistic model using a heterogeneous data set of audio and text. • This can be framed as a supervised or unsupervised learning problem. • Our initial approach uses supervised multi-class naïve Bayes model.

Semantic Audio Annotation and Retrieval This work involves 3 main components: • Heterogeneous data modeling • Audio data representation • Extracting features from short-time segments • Integrating to derive medium-time features • Modeling long-time (track-level) aspects of audio Audio representation will depend on the class of audio.

Semantic Audio Annotation and Retrieval This work involves 3 main components: • Heterogeneous data modeling • Audio data representation • Text processing • Identifying the semantics of sound - vocabulary selection • Processing of music review documents or sound effect captions.

Outline • Related work • “Modeling Music with Words” • Future research directions • Thesis logistics

Semantic-Audio Retrieval (SAR) Slaney’s SAR system is the only existing audio annotation and retrieval system [Sla02a, Sla02b, Buc05]. • Learn separate hierarchical models in each space • Semantic space • cluster documents • represent each cluster with a multinomial distribution [BF01]. • Audio space • learn GMM for each ‘anchor’ audio file • create distance matrix: -[LA(B) + LB(A)]/2 for points A & B • agglomerative clustering based distance to all anchor points • Create separate linkages between spaces • Annotation (audio-to-text linkage) • Retrieval (text-to-audio linkage)

Semantic-Audio Retrieval (SAR) Annotation • Evaluate query song under each nodes (GMM) in acoustic space. • Identify documents associated the node with the highest likelihood • Learn a multinomial distribution from those documents • Generate an annotation

Semantic-Audio Retrieval (SAR) Retrieval • Learn a GMM for audio associated with each node of the semantic hierarchy • Evaluate the query text under each node in semantic space • Estimate a GMM from audio files associated with the highest likelihood node • Retrieve audio files that have high likelihood under this GMM

Semantic-Audio Retrieval (SAR) Comments: • Subsequent work includes • Mixture of Expert formulation [Sla02b] • Alternative text and audio representations [Buc05] • Hierarchy for semantic concepts may be restrictive • i.e. top-level concept? instrumentation or genre • Inference is computationally expensive • Annotation: evaluation of query song under each anchor model • Retrieval: evaluation of training set under learned query model • Few quantitative results shown

Automatic Record Review Whitman and Ellis model heterogeneous data of audio and text to generate ‘unbiased’ record reviews[WE04]. • A classifier (SVM) is learned for each word in their vocabulary. • ‘Grounded’ words produce classifiers that can separate the musical audio content. • Sentences from existing review with many ‘ground’ terms are retained, while sentences with ‘biased’ terms are deleted. Comments: • Creative solution for vocabulary selection • Retrieval is discussed but never implemented

Related Research Areas • Semantic Multimedia Retrieval • Image/Video annotation and retrieval models • System evaluation • Music Information Retrieval • Query-by-example - acoustic similarity • Music classification - genre, instrument, emotion • Digital Signal Processing • Audio feature extraction • i.e. Mel-frequency cepstral coefficients • Audio representation • i.e. Gaussian mixture models (GMM), hidden Markov models (HMM)

Related Research Areas • Computer Vision • Feature design and selection • Boosting • Interest point detection • Automatic segmentation • Joint semantic audio and visual models for video data • Multiple Instance Learning • Modeling when each data point is represented as a bag of features • Some semantic concepts apply to just a few of the features • Multiple View Learning • Incorporating additional sources of information • i.e. Music domain: • Multiple song or album reviews • Musical playlists and sales records • Song lyrics

Modeling Music with Words • Joint work with Luke Barrington and Gert Lanckriet • We developed the first automatic music annotation and retrieval system • Supervised multiple class naïve Bayes approach • Each word in our vocabulary is a ‘class’ • Developed by Carneiro and Vasconcelos [CV05] for image annotation as a reaction against supervised one-versus-all and unsupervised models [BDF+02, BJ03,FLM04] • Scalable in database size • Fewer demands on quality of labeling (e.g. weakly labeled data) • Produces a natural ranking of semantic concepts • Explicitly models semantics of the problem

System Overview Parametric Model Training Data T Text-Feature Extraction T Parameter Estimation Review Audio-Feature Extraction Novel Song Evaluation (annotation) Inference (retrieval) Review Query

System Overview Vocabulary: M semantic tokens (referred to as ‘words’) • Each semantic token is a ‘musically informative’ unigram or bigram • i.e. ‘rock’, ‘romantic’, ‘bob;dylan’, ‘electric;guitar’ Text representation: each document is represented as a binary document vector y= (y1,…,yM) where yi is 1 if word i is present, and 0 otherwise. Audio representation: each audio track X = {x1,…,xT} is represented by a bag of T real-valued feature vectors, where T depends on track length. • Music vectors are extracted every 3/4 second [MB03] • dynamic Mel-frequency cepstral coefficient (DMFCC) • auditory filterbank temporal envelopes (AFTE) • Sound effect vectors are extracted every 10 msec [Buc05] • delta cepstum features based on MFCC + derivatives Heterogeneous data set: track-document pairs - {(X1, y1), … , (XD, yD)}

Multi-class naïve Bayes model We learn a class conditional distribution P(x|i) for each word i in our vocabulary. • Each ‘word-level’ distribution is modeled with a Gaussian mixture model (GMM). • The training data is the set of tracks that have word i in the associated text document. • The parameters of the GMM are learned using the expectation-maximization (EM) algorithm. • Direct estimation • Naïve Averaging estimation using ‘track-level’ models • ‘Hierarchy of mixtures’ estimation using ‘track-level’ models [Vas01]

Annotation Given our densities p(x|i) and a query track (x1,…,xT), we select word i* by If we assume xi and xj are conditionally independent, then Assuming a uniform word prior and taking a log transform, we have We compute average log likelihood (dividing by T) since longer songs will have disproportionately low likelihoods [RQD00].

Retrieval We would like to rank test songs by the posterior probability P(x1, …,xt|q) given a query word q. However, in practice, this results in almost the same ranking for all query words. There are two reasons: • Length Bias • Longer songs will have proportionatelylower likelihood resulting from the sum of additional log terms. • This result from the poor conditional independence assumption between audio feature vectors [RQD00]. • Solution: we compute the average posterior probability for each song.

Retrieval We would like to rank test songs by the posterior probability P(x1, …,xt|q) given a query word q. However, in practice, this results in almost the same ranking for all query words. There are two reasons: • Length Bias • Song Bias • Many conditional word distribution P(x|q) are similar to the generic song distribution P(x) • High probability (e.g. generic songs) under p(x) will often have high probability under p(x|q)

Retrieval Instead, we normalize by the song bias P(x1,…,xT), and thus rank songs by the likelihood p(q|x1,…,xT): Normalizing by p(x) allows each song to place emphasis (e.g. weight) on words that increase the probability of x.

Experimental Setup Data: 2131 Song-review pairs • Audio: mainstream western music from the last 60 years • DMFCC and AFTE features • Text: expert song reviews from AMG Allmusic database • Vocabulary of 317 unigrams and bigrams Model: Supervised multi-class naïve Bayes • Direct Estimation and Naïve averaging for word densities Tasks: Annotation: annotate each test song with 10 words Retrieval: rank order all test songs given a query word

Qualitative Annotation Results

Qualitative Retrieval Results

Evaluation Metrics Annotation: mean per-word recall and precision For each word w, let |wH| = # of human annotations with word w |wA| = # of automatic annotations with word w |wC| = # of correct automatic annotations Per-word Recall = |wC| / |wH| Per-word Precision = |wC| / |wA | Mean recall and precision: average over all words in the vocabulary Precision ranges between 0.0 (bad) and 1.0 (good). Recall ranges between 0.0 (bad) and 0.84 (best possible) since we annotate with far fewer words than are present in the test set corpus. • The test set contains 8232 words • Our system outputs 4250 words

Evaluation Metrics Retrieval: mean AP and AROC • Average precision (AP): • iterate through ranking • average the precisions at each point when we correctly identify a new song. • values between 0.0 (bad) and 1.0 (perfect) • Area under the ROC curve (AROC) • ROC function - true positive rate vs. false positive rate • Area under ROC - integrate ROC function as we iterating through ranking • Values between 0.0 (bad) and 1.0 (perfect) • Random model will give you an AROC = 0.5 Mean AP and AROC is the average over all words in the vocabulary

Quantitative System Evaluation Tasks: Annotation: annotate each test song with 10 words Retrieval: rank order all test songs given a query word Evaluation metrics: Annotation: mean per-word recall and precision Retrieval: mean AP and AROC Baseline models: • Random words - uniform distribution over words • Prior (stochastic) - pick words from a multinomial distribution parameterized by prior probability P(i) of word i • Prior (deterministic) - rank words according to P(i)

Quantitative System Evaluation Annotation: system produces significantly better results than random (3x recall, 2x precision). • DMFCC feature produce superior results • Direct and naïve average estimation produce comparable results • Best recall (0.09) and precision (0.12) leaves room for improvement

Quantitative System Evaluation Retrieval: system produces significantly better results than random • DMFCC feature produce superior results • Direct and naïve average estimation produce comparable results • Best AP (0.11) and AROC (0.61) leave room for improvement

Comments on results Our results leave much to be desired, but: • Ground truth is noisy • Authors do not make an explicit list of words when reviewing a song • Companies (Microsoft, Moodlogic) have collected clean music data • State-of-the-art image annotation and vision systems have precision and recall of about 0.25. • Directly comparing image and audio systems is dangerous since relative objectivity of the tasks and the quality of the data affect performance. • Early sound effects result show promise • Recall = 0.16 (random 0.01) • Precision = 0.10 (random 0.02) • Mean AP = 0.15 (random 0.05) • Area ROC = 0.75 (random 0.50)

1: Implementing unsupervised models Recent unsupervised models have be proposed for modeling semantics [BJ03, FLM04, LW03, BDF+02] • Introduce a ‘latent’ variable that encodes a set of states • Each state represents a joint distribution between content-based features and semantic concepts. Two recent models are: • Correspondence Latent Dirichlet Allocation [BJ03] • Latent states: hidden topics that are learned during training • Parameter estimation involves variational approximation or MCMC sampling • Multiple Bernoulli Relevance Model [FML04] • Latent states: each training images • This parameter estimation reduces to counting and clever smoothing

2: Audio Representation Modeling longer-term temporal aspects of audio • GMM formulation represents audio as a bag of vectors • HMM and conditional random fields (CRF) can be used to model trajectories of features over time • Feature integration using signal processing techniques (i.e. modulation spectra [SAP04]) Incorporating automatic audio segmentation [Got03] • Represent homogeneous segments of audio content Adapting feature selection algorithms used in vision research • Boosting [VJ02, OFPA06] • Interest point detection [MS02, AR05]

Other Directions 3: Text Processing Incorporate parts of speech tagging, model synonym relationships, learn dependencies between semantic tokens [TH02]. 4: Multiple Instance Learning [DLLP97, RC05] Using average log likelihood for annotation is problematic if some semantic concepts are related to just a few feature vectors. We could use alternatives, such as minimum or maximum log likelihood.

Other Directions 5: Multiple View Learning [RS05] Each of our heterogeneous data points may have many sources of information: • Alternative song or album reviews • Lyrics • Playlist or sales records information Multiple view learning involves combining these additional sources of information. 6: Modeling with Clean Annotations Our noisy dataset degrades our performance. Working with companies that have collected cleaner data sets can lead to mutually beneficial collaboration. • Microsoft - paid experts produce high quality manual annotations • Moodlogic - large user base fills out standard annotation forms

Prioritized Work Schedule • Unsupervised Models: CorrLDA [BJ03] • Audio Representation • HMM [FPW05] • Feature selection using boosting [VJ02] • Automatic segmentation [Got03] • Alternative extraction and integration techniques • Multiple instance learning • Using clean annotations • Language modeling Note: Bold represents current or ‘near future’ research.

Thesis Chapters • Introduction to modeling the semantics of audio data • Related work • Semantic modeling of multimedia data • Audio feature design • Text processing • Audio and text representation • Modeling using supervised multi-class approach • Second technical contribution • Unsupervised approach, or • Multiple instance learning, or • Others (MVL, similarity with semantics)… • Conclusions, discussions, and future work Note: Bold represents chapters with novel technical material.

Schedule Spring 06 • ISMIR paper “Modeling Music with Words” • Thesis proposal • NIPS paper “Modeling the Semantics of Sound” Summer 06 • NSF EAPSI research fellowship with Dr Masataka Goto at the Japanese National Institute of Advanced Industrial Science and Technology (AIST) Fall, Winter, Spring 06-07 • Research at UCSD • Potential papers at ICASSP, ICML, NIPS, ISMIR, SIGIR Summer 07 • Research outside of UCSD - Columbia (Dan Ellis), U. of Victoria (George Tzanetakis), Microsoft (John Platt), OFAI (Gerhard Widmer), IRCAM, Moodlogic, etc. Fall, Winter 07 • Prepare and defend thesis • Apply for post doctoral, research lab, and/or teaching positions Spring 07 • Submit thesis, continue job search, finish outstanding papers

The beginning… Image from www.therecordcollector.org

References

Correspondence Latent Dirichlet Allocation Corr-LDA is a popular latent model develop for image annotation [BJ03] • Each image is an (r,w) pair • r is a vector of N image region feature vectors • w is a vector of M keywords from an image caption vocabulary of W keywords

Correspondence Latent Dirichlet Allocation The generative process for an image and image annotation according to the Corr-LDA model is • Draw θ from a Dirichlet distribution with parameter α • θ is a multinomial random variable and can be thought of as a “distribution over topics.” • For each of the N image regions rn, draw a topic zn. Draw an image region from the Gaussian distribution associated with topic zn. • For each of the M keywords wm, pick one of the topics that was chosen in step 2. Draw a keyword from the multinomial distribution associated with this topic.

Semantics from Sound: Modeling Audio and Text