1 / 40

Multiple Audio Sources Detection and Localization

Multiple Audio Sources Detection and Localization. Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP. Outline. Context and problem. Approach. Discretize: ( sector, time frame, frequency bin ). Example. Experiments. Multiple loudspeakers. Multiple humans. Conclusion. Context.

willa-mack
Download Presentation

Multiple Audio Sources Detection and Localization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP

  2. Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.

  3. Context • Automatic analysis of recordings: • Meeting annotation. • Speaker tracking for speech acquisition. • Surveillance applications.

  4. Context • Automatic analysis of recordings: • Meeting annotation. • Speaker tracking for speech acquisition. • Surveillance applications. • Questions to answer: • Who? What? Where? When? • Location can be used for very precise segmentation.

  5. Microphone Array

  6. Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps.

  7. Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps. • Problem: frame-levelmultisoure localization and detection. One frame = 16 ms.

  8. Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps. • Problem: frame-level multisoure localization and detection. One frame = 16 ms. • Many localization methods exist…But: • Speech is wideband. • Detection issue: how many?

  9. Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.

  10. Sector-based Approach Question: is there at least one active source in a given sector?

  11. Sector-based Approach Question: is there at least one active source in a given sector?  Answer it for each frequency bin separately

  12. Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. s Sector of space f Frequency bin

  13. Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption [Roweis 03]. s Sector of space f Frequency bin

  14. Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption[Roweis 03]. s 0 Sector of space 9 2 0 10 0 1 f Frequency bin

  15. Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption[Roweis 03]. s 0 Sector of space 9 2 0 10 0 1 f Frequency bin

  16. Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. P=M(M-1)/2

  17. Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. • For each sector s, compare measured phases Q(f) with the centroidFs: pseudo-distance d( Q(f), Fs ). P=M(M-1)/2 d( Q(f), F1 ) d( Q(f), F2 ) sector d( Q(f), F3 ) … d( Q(f), F7 ) f

  18. Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. • For each sector s, compare measured phases Q(f) with the centroid Fs: pseudo-distance d( Q(f), Fs ). • Apply sparsity assumption: • The best one only is active. P=M(M-1)/2

  19. Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.

  20. Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01]

  21. Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01] With sparsity assumption (this work)

  22. Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.

  23. Real Data: Multiple Loudspeakers

  24. Task 2: Multiple Loudspeakers 2 loudspeakers simultaneously active

  25. Real Data: Multiple Loudspeakers 2 loudspeakers simultaneously active

  26. Real Data: Multiple Loudspeakers 3 loudspeakers simultaneously active

  27. Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.

  28. Real data: Humans

  29. Real data: Humans 2 speakers simultaneously active (includes short silences)

  30. Real data: Humans 3 speakers simultaneously active (includes short silences)

  31. Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data.

  32. Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data. • Current work: • Optimize centroids. • Multi-level implementation. • Compare multilevel with existing methods.

  33. Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data. • Current work: • Optimize centroids. • Multi-level implementation. • Compare multilevel with existing methods. • Possible integration with Daimler.

  34. Thank you!

  35. Pseudo-distance • Measured phases Q(f) = [q1(f) …qP(f)]in [-p,+p]P. • For each sector a centroid Fs=[Fs,1… Fs,P]. • d( Q(f), Fs ) = Sp sin2( (qp(f) – Fs,p) / 2 ) • cos(x) = 1 – 2 sin2( x / 2 )  argmax beamformed energy = argmin d

  36. Delay-sum vs Proposed (1/3) With delay-sum centroids (this work) With optimized centroids (this work)

  37. Delay-sum vs Proposed (2/3) 2 loudspeakers simultaneously active 3 loudspeakers simultaneously active

  38. Delay-sum vs Proposed (3/3) 2 humans simultaneously active 3 humans simultaneously active

  39. Energy and Localization

More Related