Spatial vs. Blind Approaches for Speaker Separation: Structural Differences and Beyond

Spatial vs. Blind Approaches for Speaker Separation: Structural Differences and Beyond Julien Bourgeois RIC/AD

Problem Context Several simultaneous speakers (sources) spatially located Road Noise spatially diffuse s2(t) s1(t) Microphone Array get mixtures of the sources and noise x1(t) x4(t) Individual speech flows Array Processor Recover clean individual speech flows: separate and denoise the sources

+ Min Dependence Filters “Spatial” vs. “Statistical” Techniques Spatial Statistical Min Power Filter - “Cocooning”

Spatial technique (Beamforming) x1 (signal ref) s1 + + y1 h1 weak w2 h2 s2 + x2 (noise ref) unknown High cross-talk levels : cancellation of the target signal (leakage). Solution : Voice Activity Detector.

Blind Source Separation (BSS) x1 s1 + + y1 h1 w1 h2 w2 s2 + + y2 x2 unknown Sources are assumed to be independent. w1 and w2 are jointly optimized such that the outputs are independent. Dependence measure

BSS - Second Order Criteria There are plenty independence measures... We choose a decorrelation criterion. Other separation criteria include Higher Order Statistics, that are difficult to estimate. Second Order Statistics are easier to estimate...

BSS - Second Order Criteria .... but they do not determine w1 and w2 uniquely. Specifically Set (hyperbolas) of decorrelators (not all are separators) We need more info. Non-stationary sources: “non stationary hyperbolas” They intersect at the solution:

BSS - Graphically... BSS - Graphically... D2(t1) D2(t2) D2(t1) +D2(t2) Non-stationary sources generates hyperbolas that intersect at the separation point -(h1 , h2) and at -(1/h2 , 1/h1).

Beamformingvs. BSS Weak cross-talk levels or Voice Activity Detector. Leakage problem. 1D Search. Independence prior on (s1,s2) Permutation ambiguity. 2D Search. Asymptotic performances of BSS are more “robust” than Beamforming.

Adaptive Behavior: Comparison Framework x1 s1 = 0 + + y1 h2 w2 s2 y2 x2 Comparison framework: only one source s2 stationaryGaussian s1 = h1 = 0 (no leakage) Avoid structural differences between the two criterions. Both criteria are minimized with a STOCHASTIC gradient descent. Q: How well estimated is this gradient with finite length signals ?

Estimation Error on the Gradient At the starting point w2 = 0 , numerical evaluation of the variance of the estimation error. BSS Beamforming BSS converges more slowly because its gradient is more “random”. In noisy condition, BSS does not bring any gain if the cross-talk is below a certain threshold. This threshold is smaller for MV (beamforming)

Conclusion Beamforming is based on power minimization principle. In practice: weak cross-talk levels or needs a Voice Activity Detector (VAD) Asymptotic performances depends on the quality of the VAD. Robust stochastic behavior. Blind Source Separation based on independence of the sources. Asymptotic performances: exact separation. Stochastic behavior: needs a longer signals to estimate the gradient. Moreover sources on a finite (short) time scale are not exactly independent. Both methods cannot reduce diffuse background noise.

Spatial vs. Blind Approaches for Speaker Separation: Structural Differences and Beyond