Multiple Audio Sources Detection and Localization

Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP

Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.

Context • Automatic analysis of recordings: • Meeting annotation. • Speaker tracking for speech acquisition. • Surveillance applications.

Context • Automatic analysis of recordings: • Meeting annotation. • Speaker tracking for speech acquisition. • Surveillance applications. • Questions to answer: • Who? What? Where? When? • Location can be used for very precise segmentation.

Microphone Array

Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps.

Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps. • Problem: frame-levelmultisoure localization and detection. One frame = 16 ms.

Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps. • Problem: frame-level multisoure localization and detection. One frame = 16 ms. • Many localization methods exist…But: • Speech is wideband. • Detection issue: how many?

Sector-based Approach Question: is there at least one active source in a given sector?

Sector-based Approach Question: is there at least one active source in a given sector?  Answer it for each frequency bin separately

Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. s Sector of space f Frequency bin

Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption [Roweis 03]. s Sector of space f Frequency bin

Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption[Roweis 03]. s 0 Sector of space 9 2 0 10 0 1 f Frequency bin

Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. P=M(M-1)/2

Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. • For each sector s, compare measured phases Q(f) with the centroidFs: pseudo-distance d( Q(f), Fs ). P=M(M-1)/2 d( Q(f), F1 ) d( Q(f), F2 ) sector d( Q(f), F3 ) … d( Q(f), F7 ) f

Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. • For each sector s, compare measured phases Q(f) with the centroid Fs: pseudo-distance d( Q(f), Fs ). • Apply sparsity assumption: • The best one only is active. P=M(M-1)/2

Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01]

Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01] With sparsity assumption (this work)

Real Data: Multiple Loudspeakers

Task 2: Multiple Loudspeakers 2 loudspeakers simultaneously active

Real Data: Multiple Loudspeakers 2 loudspeakers simultaneously active

Real Data: Multiple Loudspeakers 3 loudspeakers simultaneously active

Real data: Humans

Real data: Humans 2 speakers simultaneously active (includes short silences)

Real data: Humans 3 speakers simultaneously active (includes short silences)

Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data.

Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data. • Current work: • Optimize centroids. • Multi-level implementation. • Compare multilevel with existing methods.

Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data. • Current work: • Optimize centroids. • Multi-level implementation. • Compare multilevel with existing methods. • Possible integration with Daimler.

Thank you!

Pseudo-distance • Measured phases Q(f) = [q1(f) …qP(f)]in [-p,+p]P. • For each sector a centroid Fs=[Fs,1… Fs,P]. • d( Q(f), Fs ) = Sp sin2( (qp(f) – Fs,p) / 2 ) • cos(x) = 1 – 2 sin2( x / 2 )  argmax beamformed energy = argmin d

Delay-sum vs Proposed (1/3) With delay-sum centroids (this work) With optimized centroids (this work)

Delay-sum vs Proposed (2/3) 2 loudspeakers simultaneously active 3 loudspeakers simultaneously active

Delay-sum vs Proposed (3/3) 2 humans simultaneously active 3 humans simultaneously active

Energy and Localization

Multiple Audio Sources Detection and Localization

Multiple Audio Sources Detection and Localization

Presentation Transcript

Localization and Secure Localization

Particle Filters for Localization Abnormality Detection

Interactive Event Detection in Video and Audio

Localization and Secure Localization

Candidate marker detection and multiple testing

Intelligent Audio Localization System

Pedestrian Detection and Localization

Pedestrian Detection and Localization

Network Flow Multiple Sources and Sinks

Lightning detection and localization using extended Kalman filter

Onset Detection in Audio Music

Scream and Gunshot Detection and Localization for Audio-Surveillance Systems

Point Source Detection and Localization

Attack Detection in Wireless Localization

Using Lane Detection for Vehicle Localization

Gravitational-waves: Sources and detection

Multiple Sources of Recovery

Localization and Secure Localization

Visualizing Audio for Anomaly Detection

Speech and Crosstalk Detection in Multichannel Audio

Localization of GW sources and implication for cosmology