Machine learning on sound l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

Machine Learning on Sound PowerPoint PPT Presentation


  • 204 Views
  • Uploaded on
  • Presentation posted in: General

Machine Learning on Sound. ... how hard can it be? Audio Information Seminar Thursday, June 8, 2006 Kaare Brandt Petersen. Agenda. Motivation The reason it might be hard: - From data and information - Features The good news: - Computer power and machine learning - Examples Conclusions.

Download Presentation

Machine Learning on Sound

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Machine learning on sound l.jpg

Machine Learning on Sound

... how hard can it be?

Audio Information Seminar

Thursday, June 8, 2006

Kaare Brandt Petersen

Kaare Brandt Petersen


Agenda l.jpg

Agenda

  • Motivation

  • The reason it might be hard:- From data and information- Features

  • The good news:- Computer power and machine learning- Examples

  • Conclusions

Kaare Brandt Petersen


Motivation l.jpg

Motivation

  • What can we do with audio information?

  • News archive: Find the grumpy voice in a TV broadcasting from a busy street in the middle east. Search in newsarchives

  • Music: 6 billion friends. Navigating in the world landscape of music

Kaare Brandt Petersen


Slide4 l.jpg

Data

-0.00076293945313

0.00231933593750

-0.00714111328125 0.00772094726563

0.00076293945313

-0.00772094726563

-0.00900268554688

-0.00527954101563

-0.00076293945313

-0.00231933593750

-0.00714111328125

0.00024414062500

0.01312255859375

0.00650024414063-0.01052856445313

-0.01089477539063

-0.00305175781250

-0.01052856445313

-0.01089477539063

-0.00305175781250

  • Sound as perceived by humansand by computers

12 MonkeysMovie from 1995

Dialogue

Sound events

[ Beeps ]

[ Male voice - indoor ]

- "There's the televison"

- "Its all right there"

[ Steps ]

- "All right there!"

[ Music - violins ]

- "Look. Listen. Neel. Pray"

- "Commericals!"

Kaare Brandt Petersen


Slide5 l.jpg

Data

  • Is the data-to-information translation really necessary?

Archive

1) Query by signal processing[ humans learn how computers think ]

2) Query by information[ computers learn how humans think ]

3) Query by example[ various approaches ]

ZCR < 198

"happy jazz"

Kaare Brandt Petersen


Slide6 l.jpg

Going from 5 million real numbers to "Opera"Bridging the gap: From data to information

Constructing soundfeatures the right way

Data

Meaning

Context

Information

Kaare Brandt Petersen


Features l.jpg

12 Monkeys sound clip

Features

Waveform

  • Many shorttime featuresZero crossing rateSpectral flatnessSpectral bandwidthSpectral centroidsSpectral rolloffSpectral fluxEnergy...Mel Frequency Cepstral Coefficients (MFCC) [Foote97, Rabiner93]Real Cepstral Coefficients (RCC) Linear Prediction Coefficients (LPC)Wavelets Gamma-tone-filterbanksSone / BarkChroma features...

Spec

ZCR

Sp-Flatness

Sp-Bandwidth

Sp-Centroid

Chroma

MFCC 1

MFCC 2-7

Kaare Brandt Petersen


Features8 l.jpg

Aggregating shorttime featuresAudio clip = data cloudDistribution of valuesBasic statistics [Wold96]Histograms and vector quantization [Foote97]Gaussian Mixture Models [Auc02]K-means clustering [Logan01]Anchors by Neural Networks [Beren03]Temporal modellingSVD of e.g. spectrogram [Gu04] AR-coefficients [Meng05]

Features

Kaare Brandt Petersen


Features9 l.jpg

Low-levelFeatures

High-levelFeatures

Information

"Rough""Deep""Sparky"

"Broad""Melancolic""Majestic""Jazz""Rock"

...

Basic stats

GMM

KmeansAnchors

AR coeffSVDHMM...

ZCR

Spectral

MFCC

ChromaSone/BarkRCCLPC...

Features

  • What we are trying to do: From data to information

Data

-0.00076293945313

0.00231933593750

-0.00714111328125 0.00772094726563

0.00076293945313

-0.00772094726563

-0.00900268554688

-0.00527954101563

-0.00076293945313

-0.00231933593750

-0.00714111328125

0.00024414062500

0.01312255859375

0.00650024414063-0.01052856445313

-0.01089477539063

Kaare Brandt Petersen


Features10 l.jpg

Features

  • Music similarity example

"Shape of my heart"Backstreet Boys, 2000

"Thats the way it is"Celine Dion, 2000

"The limitations observed in this paper (...) suggests that the usual route to timbre similarity may not be the optimal one" [Auc04]

"Cantaloop"Us3, 1993

Kaare Brandt Petersen


The bad news l.jpg

The bad news

  • Sound data is far from the information

  • Not all features are useful

  • It is not obvious what the information labels should be

Kaare Brandt Petersen


The good news l.jpg

Computer power

Signal processing- strong development in signal processing and machine learning in general- Large amounts of data- Increased interest in sound and music processing

The good news

Kaare Brandt Petersen


Example genre estimation l.jpg

Example: Genre estimation

  • Genre estimation by temporal integrationPeter AhrendtAnders Meng[Meng05]

  • Processing:Sound -> MFCC -> AR

Kaare Brandt Petersen


Example genre estimation14 l.jpg

Example: Genre estimation

  • Genre estimation by temporal integration + kernel methods Jeronimo Arenas-GarciaTue Lehn-SchiølerKaare Brandt Petersen [ArGa06]

  • Processing:Sound -> MFCC -> AR -> KOPLS

    Btw: A data harvesting tool coming up - ISMIR 2006

Kaare Brandt Petersen


Example source separation l.jpg

Original (mixed)

Separated sources

(Harp)

(Flute)

Example: Source separation

  • Spectrogram modelling with sparse NTF2DMorten MørupMikkel Schmidt, [Mørup06]W = time-frequency patternsH = time, amplitude, pitch

Kaare Brandt Petersen


Example cnn l.jpg

Example: CNN

  • Translating a CNN news broadcastKasper JørgensenLasse MølgaardLars Kai Hansen[Jorg06]

  • Music or Speech?Sound -> MFCC, STE, SpF, ZCR -> mean/var

  • Speaker change detectionSound -> MFCC -> VQ

  • Speech recognitionSphinx 4 (Carnegie Mellon)

Kaare Brandt Petersen


Conclusions l.jpg

Conclusions

It is hard:

  • Sound data is far from the information

  • Good features are hard to find

    but machine learning is catching up:

  • Examples: Genre, Source separation, CNN-translation

Kaare Brandt Petersen


References l.jpg

References

[Wold96] Wold, E.; Blum, T.; Keislar, D. & Wheaton, J. "Content-based Classification, Search, and Retrieval of Audio" IEEE Multimedia, 1996, 3, 27-36 [Foote97] Foote, J."Content-based retrieval of music and audio", Multimedia Storage and Archiving Systems II, Proc. of SPIE, 1997, 3229, 138-147[Logan01] Logan and Salomon, "A music similarity function based on signal analysis", ICME 2001[Beren03] Berenzweig, Ellis and Lawrence, "Anchorspace for classification and similarity measurement of music" ICME 2003[Rabiner93] Rabiner, L. & Juang, B.H. "Fundamentals of Speech Recognition", Prentice-Hall, 1993[Gu04] Gu, Lu, Cai and Zhang, "Dominant Feature vector based audio similarity measure", Proceedings of the Pacific Rim Conference on Multimedia, PCM, 2004[Tza02] Tzanetakis and Cook, "Music Genre Classification of Music", IEEE Transactions on Speech and Audio Processing, 2002, 10, 293-302[Auc02] Aucouturier and Pachet, "Music Similarity Measures: Whats the use?" ISMIR 2002[Meng05] Anders Meng, Peter Ahrendt and Jan Larsen: "Improving Music Genre Classification by Short-Time Feature Integration", ICASSP, 2005. [Auc04] Aucouturier, Pachet, "Improving Timbre Similarity: How high is the sky?", JNRSAS, 2004[Mørup06] Sparse Non-negative Tensor Factor Double Deconvolution (SNTF2D) for multi channel time-frequency analysis", submitted to JMLR 2006[ArGa06], "Reduced Kaernel Orthonormal Partial Least Squares", submitted for NIPS 2006[Jorg06] Kasper Jørgensen, Lasse Mølgaard, Lars Kai Hansen, "Unsupervised speaker change detection for broadcast news segmentation", EUSIPCO 2006

Kaare Brandt Petersen


  • Login