1 / 45

Data-Adaptive Source Separation for Audio Spatialization

Data-Adaptive Source Separation for Audio Spatialization. M. Tech. project presentation. by Pradeep Gaddipati 08307029. Super visors: Prof. Preeti Rao and Prof. V. Rajbabu. Outline. Problem statement Audio spatialization Source separation Data-adaptive TFR

sheng
Download Presentation

Data-Adaptive Source Separation for Audio Spatialization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Adaptive Source Separation for Audio Spatialization M. Tech. project presentation by PradeepGaddipati 08307029 Supervisors: Prof. PreetiRao and Prof. V. Rajbabu

  2. Outline • Problem statement • Audio spatialization • Source separation • Data-adaptive TFR • Concentration measure (sparsity) • Re-construction of signal from TFR • Performance evaluation • Data-adaptive TFR for sinusoid detection • Conclusions and future work

  3. Problem statement • Spatial audio – surround sound • commonly used in movies, gaming, etc. • suspended disbelief • applicable when the playback device is located at a considerable distance from the listener • Mobile phones • headphones – for playback • spatial audio – ineffective over headphones • lacks body reflection cues – in-the-head localization • can‘t re-record – so need for audio spatialization

  4. Audio spatialization • Audio spatialization – a spatial rendering technique for conversion of the available audio into desired listening configuration • Analysis – separating individual sources • Re-synthesis – re-creating the desired listener-end configuration

  5. Source separation Source 1 Mixtures (stereo) Source 2 Source 3 • Source separation – obtaining the estimates of the underlying sources, from a set of observations from the sensors • Time-frequency transform • Source analysis – estimation of mixing parameters • Source synthesis – estimation of sources • Inverse time-frequency representation

  6. Mixing model • Anechoic mixing model • mixtures, xi • sources, sj • Under-determined (M < N) • M = number of mixtures • N = number of sources • Mixing parameters • attenuation parameters, aij • delay parameters, Figure: Anechoic mixing model – Audio is observed at the microphones with differing intensity and arrival times (because of propagation delays) but with no reverberations Source: P. O. Grady, B. Pearlmutter and S. Rickard, “Survey of sparse and non-sparse methods in source separation,” International Journal of Imaging Systems and Technology, 2005.

  7. Mixtures

  8. Time-frequency transform

  9. Source analysis (estimation of mixing parameters) • Time-frequency representation of mixtures • Requirement for source separation [1] • W-disjoint orthogonality

  10. Source analysis (estimation of mixing parameters) • For every time-frequency bin • estimate the mixing parameters [1] • Create a 2-dimensional histogram • peaks indicate the mixing parameters

  11. Source analysis (estimation of mixing parameters)

  12. Source synthesis (estimation of sources) Source 1 Mixture Source 2 Source 3 Masks Sources

  13. Source synthesis (estimation of sources) Source 1 Mixture Source 2 Source 3

  14. Source synthesis (estimation of sources) • Source estimation techniques • degenerate unmixing technique (DUET) [1] • lq-basis pursuit (LQBP) [2] • delay and scale subtraction scoring (DASSS) [3]

  15. Source synthesis (DUET) • Every time-frequency bin of the mixture is assigned to one of the source based on the distance measure

  16. Source synthesis (LQBP) • Relaxes the assumption of WDO – assumes at most ‘M’ sources present at each T-F bin • M = no. of mixtures, N = no. of sources, (M < N) • lq measure decides which ‘M’ sources are present

  17. Source synthesis (DASSS) • Identifies which bins have only one dominant source • uses DUET for that bins • assumes at most ‘M’ sources present in rest of the bins • error threshold decides which ‘M’ sources are present

  18. Inverse time-frequency transform Mixtures (stereo) Orig. source 1 Est. source 1 Orig. source 2 Est. source 2 Orig. source 3 Est. source 3

  19. Scope for improvement • Requirement for source separation • W-disjoint orthogonality (WDO) amongst the sources • Sparser the TFR of the mixtures [4] • the less will be the overlap amongst the sources (i.e. higher WDO) • easier will be their separation

  20. Data-adaptive TFR • For music/speech signals • different components (harmonic/transients/modulations) at different time-instants • best window differs for different components • this suggests use of data-dependent time-varying window function to achieve a high sparsity [6] • To obtain sparser TFR of mixture • use different analysis window lengths for different time-instants, the one which gives maximum sparsity

  21. Data-adaptive TFR Data-adaptive time-frequency representation of singing voice, window function = hamming window sizes = 30, 60 and 90 ms, hop size = 10 ms, conc. measure = kurtosis

  22. Sparsity measure(concentration measure) • What is sparsity ? • small number of coefficients contain a large proportion of the energy • Common sparsity measures [5] • Kurtosis • Gini Index • Which sparsity measure to use for adaptation ? • the one which shows the same trend as WDO as a function of analysis window size

  23. WDO and sparsity (some formulae) • W-disjoint orthogonality [4] • Kurtosis • Gini Index

  24. Dataset description • Dataset : BSS oracle • Sampling frequency : 22050 Hz • 10 sets each of music and speech signals • One set : 3 signals • Duration : 11 seconds

  25. WDO and sparsity • WDO vs. window size • obtain TFR of the sources in a set • obtain source-masks based on the magnitude of the TFRs in each of the T-F bins • using the source-masks and the TFR of the sources obtain the WDO measure • NOTE: In case of data-adaptive TFR, obtain the TFR of sources using the window sequence obtained from the adaptation of the mixture • Sparsity vs. window size • obtain the TFR of one of the channel of the source • calculate the frame-wise sparsity of the TFR of the mixture

  26. WDO vs. window size

  27. Kurtosis vs. window size

  28. Gini Index vs. window size

  29. WDO and sparsity (observations) • Highest sparsity (kurtosis/Gini Index) is obtained when data-adaptive TFR is used • Highest WDO is obtained by using data-adaptive TFR (with kurtosis as the adaptation) • Kurtosis is observed to have similar trend as that of WDO

  30. Inverse data-adaptive TFR • Constraint (introduced by source separation) • TFR should be invertible • Solution • Select analysis windows such that they satisfy constant over-lap add (COLA) criterion [7] • Techniques • transition window • modified (extended) window

  31. Transition window technique

  32. Modified window technique

  33. Problems with re-construction • Transition window technique • adaptation carried out only on alternate frames • WDO obtained amongst the underlying sources is less • Modified window technique • the extended window as compared to a normal hamming window has larger side-lobes • spreading the signal energy into neighboring bins • WDO measure decreases

  34. Dataset description • Dataset – BSS oracle • Mixtures per set (72 = 24 x 3) • attenuation parameters (24 = 4P3) • {100, 300, 600, 800} • Delay parameters • {(0,0,0), (0, 1, 2), (0 2 1)} • A total of 720 (72 x 10) mixtures (test cases) for each of music and speech groups

  35. Performance (mixing parameters)

  36. Performance (source estimation) • Evaluate the source-masks using one of the source estimation techniques (DUET or LQBP) • Using the set of estimated source-masks and the TFRs of the original sources calculate the WDO measure of each of the source-masks • WDO measure indicates how well the mask • preserves the source of interest • suppresses the interfering sources

  37. Performance (source estimation)

  38. Data-adaptive TFR (for sinusoid detection) Data-adaptive time-frequency representation of a singing voice window function = hamming; window sizes = 20, 40 and 60 ms; hop size = 10 ms, concentration measure = kurtosis; frequency range = 1000 to 3000 Hz

  39. Data-adaptive TFR (for sinusoid detection)

  40. Conclusions • Mixing model – anechoic • Kurtosis can be used as the adaptation criterion for data-adaptive TFR • Data-adaptive TFR provides higher WDO measure amongst the underlying sources as compared to fixed-window STFT • Better estimates of the mixing parameters and the sources are obtained using data-adaptive TFR • Performance of DUET is better than LQBP

  41. Future work • Testing of the DASSS source estimation technique • Re-construction of the signal from TFR • Need to consider a more realistic mixing model to account for reverberation effects, like echoic mixing model

  42. Acknowledgments I would like to thank Nokia, India for providing financial support and technical inputs for the work reported here

  43. References • A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing n sources from 2 mixtures,” IEEE Conference on Acoustics, Speech and Signal Processing, 2000 • R. Saab, O. Yilmaz, M. J. Mckeown and R. Abugharbieh, “Underdetermined anechoic blind source separation via lq basis pursuit with q<1,” IEEE Transactions on Signal Processing, 2007 • A. S. Master, “Bayesian two source modelling for separation of N sources from stereo signal,” IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 281-284, 2004

  44. References • S. Rickard, “Sparse sources are separated sources,” European Signal Processing Conference, 2006 • N. Hurley and S. Rickard, “Comparing measures of sparsity,” IEEE Transactions on Information Theory, 2009 • D. L. Jones and T. Parks, “A high resolution data-adaptive time-frequency representation,” IEEE Transactions on Acoustics, Speech and Signal Processing, 1990 • P. Basu, P. J. Wolfe, D. Rudoy, T. F. Quatieri and B. Dunn, “Adaptive short-time analysis-synthesis for speech enhancement,” IEEE Conference on Acoustics, Speech and Signal Processing, 2008

  45. Thank you Questions ?

More Related