34
Sponsored Links
This presentation is the property of its rightful owner.
1 / 30

“ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on
  • Presentation posted in: General

34. “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound. Kidron, Schechner, Elad, CVPR 2005. 47. Audio-Visual Analysis: Applications. Lip reading – detection of lips (or person) Slaney, Covell (2000) Bregler, Konig (1994) Analysis and synthesis of music from motion

Download Presentation

“ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


34

“ Pixels that Sound ”

Find pixels that correspond (correlate !?) to sound

Kidron, Schechner, Elad, CVPR 2005


47

Audio-Visual Analysis: Applications

  • Lip reading – detection of lips (or person)

    • Slaney, Covell (2000)

    • Bregler, Konig (1994)

  • Analysis and synthesis of music from motion

    • Murphy, Andersen, Jensen (2003)

  • Source separation based on vision

    • Li, Dimitrova, Li, Sethi (2003)

    • Smaragdis, Casey (2003)

    • Nock, Iyengar, Neti (2002)

    • Fisher, Darrell, Freeman, Viola (2001)

    • Hershey, Movellan (1999)

  • Tracking

  • Vermaak, Gangnet, Blake, Pérez (2001)

  • Biological systems

    • Gutfreund, Zheng, Knudsen (2002)


47

audio-visual analysis

microphone

camera

Problem: Different Modalities

Audio data

44.1 KHz, few bands

Not stereophonic

Visual data

25 frames/sec

Each frame: 576 x 720 pixels

Kidron, Schechner, Elad, Pixels that Sound


54

Not

Typical

  • Cluster of pixels -

  • linear superposition

  • Canonical Correlation Analysis (CCA)

  • Smaragdis, Casey (2003)

  • Li, Dimitrova, Li, Sethi (2003)

  • Slaney, Covell (2000)

Ill-posed

(lack of data)

  • Mutual Information (MI)

  • Fisher et. al. (2001)

  • Cutler, Davis (2000)

  • Bregler,Konig (1994)

highly complex

Previous Work

  • Pointwise correlation

    • Nock, Iyengar, Neti (2002)

    • Hershey, Movellan (1999)


49

Pixel #2

Band #2

Band #1

Pixel #1

Pixel #3

CCA

Optimal

Optimal visual components

Projection

Projection

Video

Audio

Kidron, Schechner, Elad, Pixels that Sound


40

Visual Projection

v

  • Video features

  • Pixels intensity

  • Transform coeff (wavelet)

  • Image differences

1D

variable

3

40

120

52

68

74

36

859

Projection


41

Audio Projection

a

1D

variable

  • Audio features

  • Average energy per frame

  • Transform coeffs per frame

Projection


42

Audio

Video

Canonical Correlation

Representation

Projections

(per time window)

Random variables

(time dependent)

Correlation coefficient


43

Canonical

Correlation

Largest Eigenvalue

equivalent to

Corresponding Eigenvectors

  • yield an eigenvalue problem:

    • Knutsson, Borga, Landelius (1995)

CCA Formulation

Projections


51

t

(frames)

Spatial Location

(pixels intensities)

Visual Data

Kidron, Schechner, Elad, Pixels that Sound


44

t

(frames)

Spatial Location

(pixels intensities)

=

Rank Deficiency

Kidron, Schechner, Elad, Pixels that Sound


45

Estimation of Covariance

Rank deficient


46

Impossible to invert !!!

Ill-Posedness

  • Prior solutions:

  • Use many more frames  poor temporal resolution.

  • Aggressive spatial pruning  poor spatial resolution.

  • Trivial regularization


47

Large number of weights

AGeneral Problem

Small amount of data

The problem is ILL-POSED

Over fitting is likely


48

Minimizing

Maximizing

An Equivalent Problem


49

A has a single column, and

Known

data

Minimizing

Single Audio Band

(The denominator is non-zero)


52

Full correlation if

a(1)

a(2)

a(ti)

a(30)

=

Time

a

V

Underdetermined system !

Kidron, Schechner, Elad, Pixels that Sound

end


52

“Out of clutter, find simplicity.

From discord, find harmony.”

Albert Einstein

Detected correlated pixels

end


53

  • Non-convex

  • Exponential complexity

minimum

-norm

Sparse Solution


54

  • Sparse

  • Convex

  • Polynomial complexity

minimum

-norm

in common situations

The -norm criterion

Donoho, Elad (2005)


55

-norm (pseudo-inverse, SVD, QR)

Solving using

Energy spread

minimum

-norm

The Minimum Norm Solution


56

Audio-visual events

No parameters to tweak

Maximum correlation: Eigenproblem

Minimum objective function G

Linear programming

Fully correlated

Sparse

Polynomial


57

  • Convex

  • Linear

-ball

Multiple Audio Bands - Solution

The optimization problem:

Non-convex constraint


58

Optimization over each face is:

S2

S1

S3

S4

No parameters to tweak

Multiple Audio Bands

  • Each face: linear programming


Frame 9

Frame 42

Frame 68

Frame 115

Frame 146

Frame 169

Sharp & Dynamic, Despite Distraction


Frame 51

Frame 106

Frame 83

Frame 177

Performing in Audio Noise

  • Sparse

  • Localization on the proper elements

  • False alarm – temporally inconsistent

  • Handling dynamics


56

–norm: Energy Spread

Frame 146

Frame 83

Movie #1

Movie #2


57

–norm: Localization

Frame 146

Frame 83

Movie #1

Movie #2


The “Chorus Ambiguity”

Synchronized talk

Who’s talking?

  • Possible solutions:

  • Left

  • Right

  • Both

Not unique (ambiguous)


feature 2

feature 2

Both

feature 1

feature 1

-norm

-norm

The “Chorus Ambiguity”


  • Login