locating singing voice segments within music signals l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Locating Singing Voice Segments Within Music Signals PowerPoint Presentation
Download Presentation
Locating Singing Voice Segments Within Music Signals

Loading in 2 Seconds...

play fullscreen
1 / 20

Locating Singing Voice Segments Within Music Signals - PowerPoint PPT Presentation


  • 446 Views
  • Uploaded on

Locating Singing Voice Segments Within Music Signals. Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University alb63@columbia.edu, dpwe@ee.columbia.edu. LabROSA. What Where Who Why you love us. The Future as We Hear It. Online Digital Music Libraries

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Locating Singing Voice Segments Within Music Signals


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
locating singing voice segments within music signals

Locating Singing Voice Segments Within Music Signals

Adam Berenzweig and Daniel P.W. Ellis

LabROSA, Columbia University

alb63@columbia.edu, dpwe@ee.columbia.edu

labrosa
LabROSA
  • What
  • Where
  • Who
  • Why you love us
the future as we hear it
The Future as We Hear It
  • Online Digital Music Libraries
  • The Coming Age of Streaming Music Services
  • Information Retrieval: How do we find what we want?
  • Recommendation: How do we know what we want to find?
    • Collaborative Filtering vs. Content-Based
    • What is Quality?
motivation
Motivation
  • Lyrics Recognition: Baby Steps
    • Segmentation
    • Forced Alignment
    • A Corpus
  • Song structure through singing structure?
    • Fingerprinting
    • Retreival
    • Feature for similarity measures
lyrics recognition can you do it
Lyrics Recognition: Can YOU do it?
  • Notoriously hard, even for humans.
    • amIright.com, kissThisGuy.com
  • Why so hard?
    • Noise, music, whatever.
    • Singing is not speech: voice transformations
    • Strange word sequences (“poetry”)
  • Need a corpus
history of the problem
History of the Problem
  • Segmentation for Speech Recognition: Music/Speech
    • Scheirer & Slaney
  • Forced Alignment - Karaoke
    • Cano et al. [REF NEEDED]
  • Acoustic feature design: Custom job or Kitchen Sink?
  • Idea! Use a speech recognizer: PPF (Posterior Probability Features)
    • Williams & Ellis
  • Ultimately: Source separation, CASA
architecture overview
Architecture Overview
  • Entropy H
  • H/h#
  • Dynamism D
  • P(h#)

cepstra

posteriogram

Audio

PLP

Speech

Recognizer

(Neural Net)

Feature

Calculation

Time-

averaging

Segmentation

(HMM)

Gaussian

Model

Gaussian

Model

architecture overview9
Architecture Overview

cepstra

posteriogram

Audio

PLP

Speech

Recognizer

(Neural Net)

Neural

Net

Segmentation

(HMM)

Neural

Net

so how s that working out for you being clever
“So how’s that working out for you, being clever?”
  • Entropy
  • Entropy excluding background
  • Dynamism
  • Background probability
  • Distribution Match: Likelihoods under single Gaussian model
    • Cepstra
    • PPF
slide11

Recovering context with the HMM

  • Transition probabilities
    • Inverse average segment duration
  • Emission probabilities
    • Gaussian fit to time-averaged distribution
  • Segmentation: the Viterbi path
  • Evaluation
    • Frame error rate (no boundary consideration)
results
Results
  • [Table, figures]
  • Listen!
    • Good, bad
    • trigger & stick
    • genre effects?
slide14

E = .075

  • P(h#) in effect
slide15

E = .68

  • P(h#) gone bad
slide16

‘m’,’n’

‘uw’

‘ey’

  • E = .61
  • Strong phones trigger, but can’t hold it
  • Production quality effect?
slide17

‘s’

  • E = .25
  • “Trigger and Stick”
slide18

‘bcl’,’dcl’,’b’, ‘d’

‘l’,’r’

  • E = .54
  • False phones
slide19

E = .20

  • Genre effect?
discussion
Discussion
  • The Moral of the Story: Just give it the data
  • PPF is better than cepstra. Speech Recognizer is pretty powerful.
  • Why does the extra Gaussian model help PPF but not cepstra?
  • Time averaging helps PPF: proves that it’s using the overall distribution, not short-time detail (at least, when modelled by single gaussians)