1 / 18

Machine Learning on Sound

Machine Learning on Sound. ... how hard can it be? Audio Information Seminar Thursday, June 8, 2006 Kaare Brandt Petersen. Agenda. Motivation The reason it might be hard: - From data and information - Features The good news: - Computer power and machine learning - Examples Conclusions.

kert
Download Presentation

Machine Learning on Sound

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning on Sound ... how hard can it be? Audio Information Seminar Thursday, June 8, 2006 Kaare Brandt Petersen Kaare Brandt Petersen

  2. Agenda • Motivation • The reason it might be hard:- From data and information- Features • The good news:- Computer power and machine learning- Examples • Conclusions Kaare Brandt Petersen

  3. Motivation • What can we do with audio information? • News archive: Find the grumpy voice in a TV broadcasting from a busy street in the middle east. Search in newsarchives • Music: 6 billion friends. Navigating in the world landscape of music Kaare Brandt Petersen

  4. Data -0.00076293945313 0.00231933593750 -0.00714111328125 0.00772094726563 0.00076293945313 -0.00772094726563 -0.00900268554688 -0.00527954101563 -0.00076293945313 -0.00231933593750 -0.00714111328125 0.00024414062500 0.01312255859375 0.00650024414063-0.01052856445313 -0.01089477539063 -0.00305175781250 -0.01052856445313 -0.01089477539063 -0.00305175781250 • Sound as perceived by humansand by computers 12 MonkeysMovie from 1995 Dialogue Sound events [ Beeps ] [ Male voice - indoor ] - "There's the televison" - "Its all right there" [ Steps ] - "All right there!" [ Music - violins ] - "Look. Listen. Neel. Pray" - "Commericals!" Kaare Brandt Petersen

  5. Data • Is the data-to-information translation really necessary? Archive 1) Query by signal processing[ humans learn how computers think ] 2) Query by information[ computers learn how humans think ] 3) Query by example[ various approaches ] ZCR < 198 "happy jazz" Kaare Brandt Petersen

  6. Going from 5 million real numbers to "Opera"Bridging the gap: From data to information Constructing soundfeatures the right way Data Meaning Context Information Kaare Brandt Petersen

  7. 12 Monkeys sound clip Features Waveform • Many shorttime featuresZero crossing rateSpectral flatnessSpectral bandwidthSpectral centroidsSpectral rolloffSpectral fluxEnergy...Mel Frequency Cepstral Coefficients (MFCC) [Foote97, Rabiner93]Real Cepstral Coefficients (RCC) Linear Prediction Coefficients (LPC)Wavelets Gamma-tone-filterbanksSone / BarkChroma features... Spec ZCR Sp-Flatness Sp-Bandwidth Sp-Centroid Chroma MFCC 1 MFCC 2-7 Kaare Brandt Petersen

  8. Aggregating shorttime featuresAudio clip = data cloudDistribution of valuesBasic statistics [Wold96]Histograms and vector quantization [Foote97]Gaussian Mixture Models [Auc02]K-means clustering [Logan01]Anchors by Neural Networks [Beren03]Temporal modellingSVD of e.g. spectrogram [Gu04] AR-coefficients [Meng05] Features Kaare Brandt Petersen

  9. Low-levelFeatures High-levelFeatures Information "Rough""Deep""Sparky" "Broad""Melancolic""Majestic""Jazz""Rock" ... Basic stats GMM KmeansAnchors AR coeffSVDHMM... ZCR Spectral MFCC ChromaSone/BarkRCCLPC... Features • What we are trying to do: From data to information Data -0.00076293945313 0.00231933593750 -0.00714111328125 0.00772094726563 0.00076293945313 -0.00772094726563 -0.00900268554688 -0.00527954101563 -0.00076293945313 -0.00231933593750 -0.00714111328125 0.00024414062500 0.01312255859375 0.00650024414063-0.01052856445313 -0.01089477539063 Kaare Brandt Petersen

  10. Features • Music similarity example "Shape of my heart"Backstreet Boys, 2000 "Thats the way it is"Celine Dion, 2000 "The limitations observed in this paper (...) suggests that the usual route to timbre similarity may not be the optimal one" [Auc04] "Cantaloop"Us3, 1993 Kaare Brandt Petersen

  11. The bad news • Sound data is far from the information • Not all features are useful • It is not obvious what the information labels should be Kaare Brandt Petersen

  12. Computer power Signal processing- strong development in signal processing and machine learning in general- Large amounts of data- Increased interest in sound and music processing The good news Kaare Brandt Petersen

  13. Example: Genre estimation • Genre estimation by temporal integrationPeter AhrendtAnders Meng[Meng05] • Processing:Sound -> MFCC -> AR Kaare Brandt Petersen

  14. Example: Genre estimation • Genre estimation by temporal integration + kernel methods Jeronimo Arenas-GarciaTue Lehn-SchiølerKaare Brandt Petersen [ArGa06] • Processing:Sound -> MFCC -> AR -> KOPLS Btw: A data harvesting tool coming up - ISMIR 2006 Kaare Brandt Petersen

  15. Original (mixed) Separated sources (Harp) (Flute) Example: Source separation • Spectrogram modelling with sparse NTF2DMorten MørupMikkel Schmidt, [Mørup06]W = time-frequency patternsH = time, amplitude, pitch Kaare Brandt Petersen

  16. Example: CNN • Translating a CNN news broadcastKasper JørgensenLasse MølgaardLars Kai Hansen[Jorg06] • Music or Speech?Sound -> MFCC, STE, SpF, ZCR -> mean/var • Speaker change detectionSound -> MFCC -> VQ • Speech recognitionSphinx 4 (Carnegie Mellon) Kaare Brandt Petersen

  17. Conclusions It is hard: • Sound data is far from the information • Good features are hard to find but machine learning is catching up: • Examples: Genre, Source separation, CNN-translation Kaare Brandt Petersen

  18. References [Wold96] Wold, E.; Blum, T.; Keislar, D. & Wheaton, J. "Content-based Classification, Search, and Retrieval of Audio" IEEE Multimedia, 1996, 3, 27-36 [Foote97] Foote, J."Content-based retrieval of music and audio", Multimedia Storage and Archiving Systems II, Proc. of SPIE, 1997, 3229, 138-147[Logan01] Logan and Salomon, "A music similarity function based on signal analysis", ICME 2001[Beren03] Berenzweig, Ellis and Lawrence, "Anchorspace for classification and similarity measurement of music" ICME 2003[Rabiner93] Rabiner, L. & Juang, B.H. "Fundamentals of Speech Recognition", Prentice-Hall, 1993[Gu04] Gu, Lu, Cai and Zhang, "Dominant Feature vector based audio similarity measure", Proceedings of the Pacific Rim Conference on Multimedia, PCM, 2004[Tza02] Tzanetakis and Cook, "Music Genre Classification of Music", IEEE Transactions on Speech and Audio Processing, 2002, 10, 293-302[Auc02] Aucouturier and Pachet, "Music Similarity Measures: Whats the use?" ISMIR 2002[Meng05] Anders Meng, Peter Ahrendt and Jan Larsen: "Improving Music Genre Classification by Short-Time Feature Integration", ICASSP, 2005. [Auc04] Aucouturier, Pachet, "Improving Timbre Similarity: How high is the sky?", JNRSAS, 2004[Mørup06] Sparse Non-negative Tensor Factor Double Deconvolution (SNTF2D) for multi channel time-frequency analysis", submitted to JMLR 2006[ArGa06], "Reduced Kaernel Orthonormal Partial Least Squares", submitted for NIPS 2006[Jorg06] Kasper Jørgensen, Lasse Mølgaard, Lars Kai Hansen, "Unsupervised speaker change detection for broadcast news segmentation", EUSIPCO 2006 Kaare Brandt Petersen

More Related