1 / 41

AdvAIR

AdvAIR. An Advanced Audio Information Retrieval System. Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall. Outline . Introduction System Overview Applications Experiment Future Work Q&A. Introduction. Motivation.

Download Presentation

AdvAIR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AdvAIR An Advanced Audio Information Retrieval System Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall

  2. Outline • Introduction • System Overview • Applications • Experiment • Future Work • Q&A

  3. Introduction

  4. Motivation • Rapid expansion of audio information due to blooming of internet • Little attention paid on audio mining • Lack of a framework for generic audio information processing

  5. Targets • Open platform that can provide a basis for various voice oriented applications • Enhance audio information retrieval by performance with guaranteed accuracy • Generic speech analysis tools for data mining

  6. Approaches • Robust low-level sound information preprocess module • Speed oriented but accuracy algorithms • Generalized model concept for various usage • A visual framework for presentation

  7. System Design

  8. System Flow Chart Scene Cutting Audio Signal Implements Video Scene Change And Speaker Tracking Features Extraction Database Storage Segmentation and clustering Preprocessing Speaker Identification Training and Modeling Linguistic Identification Core Platform Extended tools

  9. Features Extraction • Energy Measurement • Zero Crossing Rate • Pitch • Human resolves frequencies non-linearly across the audio spectrum • MFCC approach • Simulate vocal track shape

  10. Features Extraction (con’t) • The idea of filter-bank, which approximates the non-linear frequency resolution • Bins hold a weighted sum representing the spectral magnitude of channels • Lower and upper frequency cut-offs Frequency … magnitude

  11. Segmentation • Segmentation is to cut audio stream at the acoustic change point • BIC (Bayesian Information Criterion) is used • It is threshold-free and robust • Input audio stream is modeled as Gaussians Gaussian Mean

  12. Segmentation • Notations for an audio stream: • N : Number of frames • X = {xi : i = 1,2,…,N} : a set of feature vectors • μ is the mean • Σ is the full covariance matrix

  13. Audio Stream Changepoint Frame N Frame 1 Frame i Segmentation for single change pt. • Assume change point is at frame i • H0,H1 : two different models • H0 models the data as one Gaussian • X1… XN ~ N( μ , Σ ) • H1 models the data as two Gaussians • X1… Xi ~ N( μ1 , Σ1 ) • Xi+1…XN ~ N( μ2 , Σ2 )

  14. Segmentation for single change pt. (con’t) • maximum likelihood ratio statistics is R(i) = N log | Σ | - N1 log | Σ1 | - N2 log | Σ2 | Audio Stream Changepoint Frame N Frame 1 Frame i

  15. model H0 model H1 Segmentation for single change pt. (con’t) • BIC(i) = R(i) -λ* P • BIC(i) is +ve: i is the change point • BIC(i) is –ve: i is not the change point • Which model fits the data better, single Gaussian(H0) or 2 Gaussians(H1)?

  16. Segmentation for single change pt. (con’t) • To detect a single change point, we need to calculate BIC(i) for all i = 1,2,…,N • The frame i with largest BIC value is the change point • O(N) to detect a single change point

  17. Segmentation for multiple change pt. • Step 1: Initialize interval [a,b], set a = 1, b = 2 • Step 2: Detect change point in interval [a,b] through BIC single change point detection algorithm • Step 3: If no change point in interval [a,b], then set b = b+1 else let t be the changing point detected, set a = t+1, b = t+2 • Step 4: Go to Step (2)

  18. Enhanced Implementation Algorithm • Original multiple change point detection algorithm: • Start to detect change point within 2 frames • Increase investigation interval by 1 every time • Enhanced Implementation algorithm: • minimum processing interval used in our engine is 100 frames • Increase investigation interval by 100 every time

  19. Enhanced Implementation Algorithm (con’t) • Why do we choose to increase the interval by 100 frames? • It increases is too large, then scene change may be missed. • Must be smaller than 170 frames because there are around 170 frames in 1 second • It increases is too small, then speed of processing is too slow

  20. Enhanced Implementation Algorithm (con’t) • Advantage: Speed up • Trade-off: the change point we detected is not too accurate • To compensate: • investigate on the frames around the change point again • investigation interval is incremented by 1 to locate a more accurate change point

  21. Training and Modeling • Before doing various identification, training and modeling is needed • Probability-based Model  Gaussian Mixture Model (GMM) is used • GMM is used for language identification, gender identification and speaker identification • GMM is modeled by many different Gaussian distributions • A Gaussian distribution is represented by its mean and variance

  22. Gaussian Mixture Model (GMM) Model for Speaker i • To train a model is to calculate the mean , variance and weight (λ) for each of the Gaussian distribution ………………

  23. Training of speaker GMMs • Collect sound clips that is long enough for each speaker (e.g. 20 minutes sound clips) • Steps for training one speaker model: • Step 1. Start with an initial model λ • Step 2. Calculate new mean, variance, weighting (new λ) by training • Step 3. Use a newλif it represents the model better than the oldλ • Step 4. Repeat Step 2 to Step 3 • Finally, we get λthat can represent the model

  24. Applications

  25. Applications • Video scene change and speaker tracking • Speaker Identification • Telephony message notification

  26. Video scene change and Speaker tracking Multimedia Presentation Video Clip AdvAIR core Segmentation Timing And Speaker Information Video Playing Mechanism Speakers Index Information

  27. Usage • Speaker tracking enhance data mining about a particular person (e.g. Political person in a conference) • Audio information indexing and sorting for audio library storage • It as an auxiliary tool for video cutting and editing applications

  28. Screenshot Input clip Multimedia player Time information and indexing

  29. Speaker Identification Preprocessed Speaker clip Sound source GMM Model Training Speaker Comparison Mechanism Speaker Models Database Speaker Identity Testing Stage Training Stage

  30. Usage • Security authentication • Speaker identification of telephone base system • Criminal investigation (For example, similar to fingerprint)

  31. Screenshot Input source Flexible length comparison Media Player for visual verification Speaker Identity

  32. Telephony Message Notification Caller phone Desired group Model database GMM model comparison User can’t listen Record the leaving message of caller Desired group Non-desired Group AdvAIR segmentation Messaging API Short Message System E-mail system

  33. Experiment Results

  34. Threshold-free BIC criterion Background Noise affect accuracy

  35. Enhanced Implementation Speed enhance is determined by relative number of changing point by length

  36. GMM modal closed-set speaker identification Training Stage 10 speaker 5 males, 5 females 20 minutes for each speaker Testing Stage 50 sound clips with 5 seconds duration 48 sound clips are correct, i.e. 96 %

  37. GMM modal open-set speaker identification • Accept or Reject as result • Same setting as closed-set • i.e. 10 speaker, which each 20 minutes • Correct 45/50 = 90% • False reject 3/50 = 6 % • False accept 2/50 = 4 %

  38. Problems and Limitation

  39. Problems and limitations • Accuracy is affected by background noise • Some speakers have very likely features of sound • Open set speaker identification determination function is not so accurate if duration is short • Segmentation is still a time consuming process

  40. Future Work • Speaker gender identification • Robust open-set speaker identification • Speech content recognition • Music pattern matching • Distributed system for segmentation

  41. Q & A

More Related