400 likes | 528 Views
This research explores the problem of detecting and localizing multiple audio sources within spontaneous multi-party conversations. It leverages a sector-based approach to analyze audio recordings for speaker tracking, meeting annotation, and surveillance applications. The study details methods for discretizing signals into time frames and frequency bins, highlights experimental results with multiple loudspeakers and human speakers, and discusses the implications for effective real-time analysis. The approach aims to enhance frame-level localization and detection despite challenges posed by overlapping speech.
E N D
Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP
Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.
Context • Automatic analysis of recordings: • Meeting annotation. • Speaker tracking for speech acquisition. • Surveillance applications.
Context • Automatic analysis of recordings: • Meeting annotation. • Speaker tracking for speech acquisition. • Surveillance applications. • Questions to answer: • Who? What? Where? When? • Location can be used for very precise segmentation.
Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps.
Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps. • Problem: frame-levelmultisoure localization and detection. One frame = 16 ms.
Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps. • Problem: frame-level multisoure localization and detection. One frame = 16 ms. • Many localization methods exist…But: • Speech is wideband. • Detection issue: how many?
Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.
Sector-based Approach Question: is there at least one active source in a given sector?
Sector-based Approach Question: is there at least one active source in a given sector? Answer it for each frequency bin separately
Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. s Sector of space f Frequency bin
Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption [Roweis 03]. s Sector of space f Frequency bin
Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption[Roweis 03]. s 0 Sector of space 9 2 0 10 0 1 f Frequency bin
Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption[Roweis 03]. s 0 Sector of space 9 2 0 10 0 1 f Frequency bin
Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. P=M(M-1)/2
Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. • For each sector s, compare measured phases Q(f) with the centroidFs: pseudo-distance d( Q(f), Fs ). P=M(M-1)/2 d( Q(f), F1 ) d( Q(f), F2 ) sector d( Q(f), F3 ) … d( Q(f), F7 ) f
Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. • For each sector s, compare measured phases Q(f) with the centroid Fs: pseudo-distance d( Q(f), Fs ). • Apply sparsity assumption: • The best one only is active. P=M(M-1)/2
Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.
Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01]
Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01] With sparsity assumption (this work)
Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.
Task 2: Multiple Loudspeakers 2 loudspeakers simultaneously active
Real Data: Multiple Loudspeakers 2 loudspeakers simultaneously active
Real Data: Multiple Loudspeakers 3 loudspeakers simultaneously active
Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.
Real data: Humans 2 speakers simultaneously active (includes short silences)
Real data: Humans 3 speakers simultaneously active (includes short silences)
Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data.
Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data. • Current work: • Optimize centroids. • Multi-level implementation. • Compare multilevel with existing methods.
Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data. • Current work: • Optimize centroids. • Multi-level implementation. • Compare multilevel with existing methods. • Possible integration with Daimler.
Pseudo-distance • Measured phases Q(f) = [q1(f) …qP(f)]in [-p,+p]P. • For each sector a centroid Fs=[Fs,1… Fs,P]. • d( Q(f), Fs ) = Sp sin2( (qp(f) – Fs,p) / 2 ) • cos(x) = 1 – 2 sin2( x / 2 ) argmax beamformed energy = argmin d
Delay-sum vs Proposed (1/3) With delay-sum centroids (this work) With optimized centroids (this work)
Delay-sum vs Proposed (2/3) 2 loudspeakers simultaneously active 3 loudspeakers simultaneously active
Delay-sum vs Proposed (3/3) 2 humans simultaneously active 3 humans simultaneously active