AdvAIR: Advanced Audio Information Retrieval System

AdvAIR An Advanced Audio Information Retrieval System Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall

Outline • Introduction • System Overview • Applications • Experiment • Future Work • Q&A

Introduction

Motivation • Rapid expansion of audio information due to blooming of internet • Little attention paid on audio mining • Lack of a framework for generic audio information processing

Targets • Open platform that can provide a basis for various voice oriented applications • Enhance audio information retrieval by performance with guaranteed accuracy • Generic speech analysis tools for data mining

Approaches • Robust low-level sound information preprocess module • Speed oriented but accuracy algorithms • Generalized model concept for various usage • A visual framework for presentation

System Design

System Flow Chart Scene Cutting Audio Signal Implements Video Scene Change And Speaker Tracking Features Extraction Database Storage Segmentation and clustering Preprocessing Speaker Identification Training and Modeling Linguistic Identification Core Platform Extended tools

Features Extraction • Energy Measurement • Zero Crossing Rate • Pitch • Human resolves frequencies non-linearly across the audio spectrum • MFCC approach • Simulate vocal track shape

Features Extraction (con’t) • The idea of filter-bank, which approximates the non-linear frequency resolution • Bins hold a weighted sum representing the spectral magnitude of channels • Lower and upper frequency cut-offs Frequency … magnitude

Segmentation • Segmentation is to cut audio stream at the acoustic change point • BIC (Bayesian Information Criterion) is used • It is threshold-free and robust • Input audio stream is modeled as Gaussians Gaussian Mean

Segmentation • Notations for an audio stream: • N : Number of frames • X = {xi : i = 1,2,…,N} : a set of feature vectors • μ is the mean • Σ is the full covariance matrix

Audio Stream Changepoint Frame N Frame 1 Frame i Segmentation for single change pt. • Assume change point is at frame i • H0,H1 : two different models • H0 models the data as one Gaussian • X1… XN ~ N( μ , Σ ) • H1 models the data as two Gaussians • X1… Xi ~ N( μ1 , Σ1 ) • Xi+1…XN ~ N( μ2 , Σ2 )

Segmentation for single change pt. (con’t) • maximum likelihood ratio statistics is R(i) = N log | Σ | - N1 log | Σ1 | - N2 log | Σ2 | Audio Stream Changepoint Frame N Frame 1 Frame i

model H0 model H1 Segmentation for single change pt. (con’t) • BIC(i) = R(i) -λ* P • BIC(i) is +ve: i is the change point • BIC(i) is –ve: i is not the change point • Which model fits the data better, single Gaussian(H0) or 2 Gaussians(H1)?

Segmentation for single change pt. (con’t) • To detect a single change point, we need to calculate BIC(i) for all i = 1,2,…,N • The frame i with largest BIC value is the change point • O(N) to detect a single change point

Segmentation for multiple change pt. • Step 1: Initialize interval [a,b], set a = 1, b = 2 • Step 2: Detect change point in interval [a,b] through BIC single change point detection algorithm • Step 3: If no change point in interval [a,b], then set b = b+1 else let t be the changing point detected, set a = t+1, b = t+2 • Step 4: Go to Step (2)

Enhanced Implementation Algorithm • Original multiple change point detection algorithm: • Start to detect change point within 2 frames • Increase investigation interval by 1 every time • Enhanced Implementation algorithm: • minimum processing interval used in our engine is 100 frames • Increase investigation interval by 100 every time

Enhanced Implementation Algorithm (con’t) • Why do we choose to increase the interval by 100 frames? • It increases is too large, then scene change may be missed. • Must be smaller than 170 frames because there are around 170 frames in 1 second • It increases is too small, then speed of processing is too slow

Enhanced Implementation Algorithm (con’t) • Advantage: Speed up • Trade-off: the change point we detected is not too accurate • To compensate: • investigate on the frames around the change point again • investigation interval is incremented by 1 to locate a more accurate change point

Training and Modeling • Before doing various identification, training and modeling is needed • Probability-based Model  Gaussian Mixture Model (GMM) is used • GMM is used for language identification, gender identification and speaker identification • GMM is modeled by many different Gaussian distributions • A Gaussian distribution is represented by its mean and variance

Gaussian Mixture Model (GMM) Model for Speaker i • To train a model is to calculate the mean , variance and weight (λ) for each of the Gaussian distribution ………………

Training of speaker GMMs • Collect sound clips that is long enough for each speaker (e.g. 20 minutes sound clips) • Steps for training one speaker model: • Step 1. Start with an initial model λ • Step 2. Calculate new mean, variance, weighting (new λ) by training • Step 3. Use a newλif it represents the model better than the oldλ • Step 4. Repeat Step 2 to Step 3 • Finally, we get λthat can represent the model

Applications

Applications • Video scene change and speaker tracking • Speaker Identification • Telephony message notification

Video scene change and Speaker tracking Multimedia Presentation Video Clip AdvAIR core Segmentation Timing And Speaker Information Video Playing Mechanism Speakers Index Information

Usage • Speaker tracking enhance data mining about a particular person (e.g. Political person in a conference) • Audio information indexing and sorting for audio library storage • It as an auxiliary tool for video cutting and editing applications

Screenshot Input clip Multimedia player Time information and indexing

Speaker Identification Preprocessed Speaker clip Sound source GMM Model Training Speaker Comparison Mechanism Speaker Models Database Speaker Identity Testing Stage Training Stage

Usage • Security authentication • Speaker identification of telephone base system • Criminal investigation (For example, similar to fingerprint)

Screenshot Input source Flexible length comparison Media Player for visual verification Speaker Identity

Telephony Message Notification Caller phone Desired group Model database GMM model comparison User can’t listen Record the leaving message of caller Desired group Non-desired Group AdvAIR segmentation Messaging API Short Message System E-mail system

Experiment Results

Threshold-free BIC criterion Background Noise affect accuracy

Enhanced Implementation Speed enhance is determined by relative number of changing point by length

GMM modal closed-set speaker identification Training Stage 10 speaker 5 males, 5 females 20 minutes for each speaker Testing Stage 50 sound clips with 5 seconds duration 48 sound clips are correct, i.e. 96 %

GMM modal open-set speaker identification • Accept or Reject as result • Same setting as closed-set • i.e. 10 speaker, which each 20 minutes • Correct 45/50 = 90% • False reject 3/50 = 6 % • False accept 2/50 = 4 %

Problems and Limitation

Problems and limitations • Accuracy is affected by background noise • Some speakers have very likely features of sound • Open set speaker identification determination function is not so accurate if duration is short • Segmentation is still a time consuming process

Future Work • Speaker gender identification • Robust open-set speaker identification • Speech content recognition • Music pattern matching • Distributed system for segmentation

Q & A

AdvAIR: Advanced Audio Information Retrieval System

AdvAIR: Advanced Audio Information Retrieval System

Presentation Transcript

ADVAIR TM DISKUS (R) (Fluticasone propionate/salmeterol inhalation powder)

Get your Canadian Pharmacy Advair Diskus Coupon | PricePro Pharmacy

Buy Advair HFA Inhaler Online

Buy Advair Online at Pricepropharmacy.com

Advair Diskus - Buy Advair Diskus Online