An auditory scene analysis approach to speech segregation l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

An Auditory Scene Analysis Approach to Speech Segregation PowerPoint PPT Presentation


  • 145 Views
  • Uploaded on
  • Presentation posted in: General

An Auditory Scene Analysis Approach to Speech Segregation. DeLiang Wang Perception and Neurodynamics Lab The Ohio State University. Outline of presentation. Introduction Speech segregation problem Auditory scene analysis (ASA) approach

Download Presentation

An Auditory Scene Analysis Approach to Speech Segregation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


An auditory scene analysis approach to speech segregation l.jpg

An Auditory Scene Analysis Approach to Speech Segregation

DeLiang Wang

Perception and Neurodynamics Lab

The Ohio State University


Outline of presentation l.jpg

Outline of presentation

  • Introduction

    • Speech segregation problem

    • Auditory scene analysis (ASA) approach

  • Voiced speech segregation based on pitch tracking and amplitude modulation analysis

    • Ideal binary mask as CASA goal

  • Unvoiced speech segregation

    • Auditory segmentation

  • Neurobiological basis of ASA


Real world audition l.jpg

Real-world audition

What?

  • Source type

    Speech

    message

    speaker

    age, gender, linguistic origin, mood, …

    Music

    Car passing by

    Where?

  • Left, right, up, down

  • How close?

    Channel characteristics

    Environment characteristics

  • Room configuration

  • Ambient noise


Humans versus machines l.jpg

Humans versus machines

Additionally:

  • Car noise is not a very effective speech masker

    • At 10 dB

    • At 0 dB

  • Human word error rate at 0 dB SNR is around 1% as opposed to 100% for unmodified recognisers (around 40% with noise adaptation)

Source: Lippmann (1997)


Speech segregation problem l.jpg

Speech segregation problem

  • In a natural environment, speech is usually corrupted by acoustic interference. Speech segregation is critical for many applications, such as automatic speech recognition and hearing prosthesis

  • Most speech separation techniques, e.g. beamforming and blind source separation via independent analysis, require multiple sensors. However, such techniques have clear limits

    • Suffer from configuration stationarity

    • Can’t deal with single-microphone mixtures orsituations where multiple sounds arrive from close directions

  • Most speech enhancement developed for monaural situation can deal with only stationary acoustic interference


Auditory scene analysis bregman 90 l.jpg

Auditory scene analysis (Bregman’90)

  • Listeners are able to parse the complex mixture of sounds arriving at the ears in order to retrieve a mental representation of each sound source

    • Ball-room problem, Helmholtz, 1863 (“complicated beyond conception”)

    • Cocktail-party problem, Cherry’53

  • Two conceptual processes of auditory scene analysis (ASA):

    • Segmentation. Decompose the acoustic mixture into sensory elements (segments)

    • Grouping. Combine segments into groups, so that segments in the same group are likely to have originated from the same environmental source


Computational auditory scene analysis l.jpg

Computational auditory scene analysis

  • Computational ASA (CASA) systems approach sound separation based on ASA principles

    • Weintraub’85, Cooke’93, Brown & Cooke’94, Ellis’96, Wang & Brown’99

  • CASA progress: Monaural segregation with minimal assumptions

  • CASA challenges

    • Broadband high-frequency mixtures

    • Reliable pitch tracking of noisy speech

    • Unvoiced speech


Outline of presentation8 l.jpg

Outline of presentation

  • Introduction

    • Speech segregation problem

    • Auditory scene analysis (ASA) approach

  • Voiced speech segregation based on pitch tracking and amplitude modulation analysis

    • Ideal binary mask as CASA goal

  • Unvoiced speech segregation

    • Auditory segmentation

  • Neurobiological basis of ASA


Resolved and unresolved harmonics l.jpg

Resolved and unresolved harmonics

  • For voiced speech, lower harmonics are resolved while higher harmonics are not

  • For unresolved harmonics, the envelopes of filter responses fluctuate at the fundamental frequency of speech

  • Our model (Hu & Wang’04) applies different grouping mechanisms for low-frequency and high-frequency signals:

    • Low-frequency signals are grouped based on periodicity and temporal continuity

    • High-frequency signals are grouped based on amplitude modulation (AM) and temporal continuity


Diagram of the hu wang model l.jpg

Diagram of the Hu-Wang model


Cochleogram auditory peripheral model l.jpg

Cochleogram: Auditory peripheral model

Spectrogram

Spectrogram

  • Plot of log energy across time and frequency (linear frequency scale)

    Cochleogram

  • Cochlear filtering by the gammatone filterbank (or other models of cochlear filtering), followed by a stage of nonlinear rectification; the latter corresponds to hair cell transduction by either a hair cell model or simple compression operations (log and cube root)

  • Quasi-logarithmic frequency scale, and filter bandwidth is frequency-dependent

  • Previous work suggests better resilience to noise than spectrogram

Cochleogram


Mid level auditory representations l.jpg

Mid-level auditory representations

  • Mid-level representations form the basis for segment formation and subsequent grouping

  • Correlogram extracts periodicity and AM from simulated auditory nerve firing patterns

  • Summary correlogram is used to identify global pitch

  • Cross-channel correlation between adjacent correlogram channels identifies regions that are excited by the same harmonic or formant


Correlogram l.jpg

Correlogram

  • Short-term autocorrelation of the output of each frequency channel of the cochleogram

  • Peaks in summary correlogram indicate pitch periods (F0)

  • A standard model of pitch perception

Correlogram & summary correlogram of a double vowel, showing F0s


Cross channel correlation l.jpg

Cross-channel correlation

(a) Correlogram and cross-channel correlation of hair cell response to clean speech

(b) Corresponding representations for response envelopes


Initial segregation l.jpg

Initial segregation

  • Segments are formed based on temporal continuity and cross-channel correlation

  • Segments generated in this stage tend to reflect resolved harmonics, but not unresolved ones

  • Initial grouping into a foreground (target) stream and a background stream according to global pitch using the oscillatory correlation model of Wang and Brown (1999)


Pitch tracking l.jpg

Pitch tracking

  • Pitch periods of target speech are estimated from the segregated speech stream

  • Estimated pitch periods are checked and re-estimated using two psychoacoustically motivated constraints:

    • Target pitch should agree with the periodicity of the time-frequency units in the initial speech stream

    • Pitch periods change smoothly, thus allowing for verification and interpolation


Pitch tracking example l.jpg

Pitch tracking example

(a) Global pitch (Line: pitch track of clean speech) for a mixture of target speech and ‘cocktail-party’ intrusion

(b) Estimated target pitch


T f unit labeling l.jpg

T-F unit labeling

  • In the low-frequency range:

    • A time-frequency (T-F) unit is labeled by comparing the periodicity of its autocorrelation with the estimated target pitch

  • In the high-frequency range:

    • Due to their wide bandwidths, high-frequency filters respond to multiple harmonics. These responses are amplitude modulated due to beats and combinational tones (Helmholtz, 1863)

    • A T-F unit in the high-frequency range is labeled by comparing its AM repetition rate with the estimated target pitch


Am example l.jpg

AM example

(a) The output of a gammatone filter (center frequency:2.6 kHz) in response to clean speech

(b) The corresponding autocorrelation function


Am repetition rates l.jpg

AM repetition rates

  • To obtain AM repetition rates, a filter response is half-wave rectified and bandpass filtered

  • The resulting signal within a T-F unit is modeled by a single sinusoid using the gradient descent method. The frequency of the sinusoid indicates the AM repetition rate of the corresponding response


Final segregation l.jpg

Final segregation

  • New segments corresponding to unresolved harmonics are formed based on temporal continuity and cross-channel correlation of response envelopes (i.e. common AM). Then they are grouped into the foreground stream according to AM repetition rates

  • Other units are grouped according to temporal and spectral continuity


Ideal binary mask for performance evaluation l.jpg

Ideal binary mask for performance evaluation

  • Within a T-F unit, the ideal binary mask is 1 if target energy is stronger than interference energy, and 0 otherwise

  • Motivation: Auditory masking - stronger signal masks weaker one within a critical band

  • We have suggested to use ideal binary masks as ground truth for CASA performance evaluation

    • Consistent with recent speech intelligibility results (Roman et al.’03; Brungart et al.’05)


Ideal binary mask illustration l.jpg

Ideal binary mask illustration


Voiced speech segregation example l.jpg

Voiced speech segregation example


Systematic snr results l.jpg

Systematic SNR results

  • Evaluation on a corpus of 100 mixtures (Cooke, 1993): 10 voiced utterances x 10 noise intrusions (see next slide)

  • Average SNR gain: 12.3 dB; 5.2 dB better than the Wang-Brown model (1999), and 6.4 dB better than the spectral subtraction method

SNR (in dB)

Hu-Wang model


Casa progress on voiced speech segregation l.jpg

CASA progress on voiced speech segregation

  • 100 mixture set used by Cooke (1993)

    • 10 voiced utterances mixed with 10 noise intrusions (N0: tone, N1: white noise, N2: noise bursts, N3: ‘cocktail party’, N4: rock music, N5: siren, N6: telephone, N7: female utterance, N8: male utterance, N9: female utterance)

Wang &

Brown

(1999)

Original mixture

of voiced speech

Cooke

(1993)

Ellis

(1996)

Hu & Wang

(2004)

+ telephone

+ male

+ female


Outline of presentation27 l.jpg

Outline of presentation

  • Introduction

    • Speech segregation problem

    • Auditory scene analysis (ASA) approach

  • Voiced speech segregation based on pitch tracking and amplitude modulation analysis

    • Ideal binary mask as CASA goal

  • Unvoiced speech segregation

    • Auditory segmentation

  • Neurobiological basis of ASA


Segmentation and unvoiced speech segretation l.jpg

Segmentation and unvoiced speech segretation

  • To deal with unvoiced speech segregation, we (Hu & Wang’04) proposed a model of auditory segmentation that applies to both voiced and unvoiced speech

  • The task of segmentation is to decompose an auditory scene into contiguous T-F regions, each of which should contain signal from the same sound source

    • The definition of segmentation does not distinguish between voiced and unvoiced sounds

  • This is equivalent to identifying onsets and offsets of individual T-F regions, which generally correspond to sudden changes of acoustic energy

  • The segmentation strategy is based on onset and offset analysis


Scale space analysis for auditory segmentation l.jpg

Scale-space analysis for auditory segmentation

  • From a computational standpoint, auditory segmentation is similar to image (visual) segmentation

    • Visual segmentation: Finding bounding contours of visual objects

    • Auditory segmentation: Finding onset and offset fronts of segments

  • Onset/offset analysis employs scale-space theory, which is a multiscale analysis commonly used in image segmentation

    • Smoothing

    • Onset/offset detection and onset/offset front matching

    • Multiscale integration


Example of auditory segmentation l.jpg

Example of auditory segmentation


Speech segregation l.jpg

Speech segregation

  • The general strategy for speech segregation is to first segregate voiced speech using the pitch cue, and then deal with unvoiced speech

  • To segregate unvoiced speech, we perform auditory segmentation, and then group segments that correspond to unvoiced speech


Segment classification l.jpg

Segment classification

  • For nonspeech interference, grouping is in fact a classification task – to classify segments as either speech or non-speech

  • The following features are used for classification:

    • Spectral envelope

    • Segment duration

    • Segment intensity

  • Training data

    • Speech: Training part of the TIMIT database

    • Interference: 90 natural intrusions including street noise, crowd noise, wind, etc.

  • A Gaussian mixture model is trained for each phoneme, and for interference as well which provides the basis for a likelihood ratio test


Example of segregating fricatives affricates l.jpg

Example of segregating fricatives/affricates

Utterance: “That noise problem grows more annoying each day”

Interference: Crowd noise with music (IBM: Ideal binary mask)


Example of segregating stops l.jpg

Example of segregating stops

Utterance: “A good morrow to you, my boy”

Interference: Rain


Outline of presentation35 l.jpg

Outline of presentation

  • Introduction

    • Speech segregation problem

    • Auditory scene analysis (ASA) approach

  • Voiced speech segregation based on pitch tracking and amplitude modulation analysis

    • Ideal binary mask as CASA goal

  • Unvoiced speech segregation

    • Auditory segmentation

  • Neurobiological basis of ASA


How does the auditory system perform asa l.jpg

How does the auditory system perform ASA?

  • Information about acoustic features (pitch, spectral shape, interaural differences, AM, FM) is extracted in distributed areas of the auditory system

  • Binding problem: How are these features combined to form a perceptual whole (stream)?

    • Hierarchies of feature-detecting cells exist, but do not seem to constitute a solution to the binding problem


Oscillatory correlation theory for asa l.jpg

Oscillatory correlation theory for ASA

  • Neural oscillators are used to represent auditory features

  • Oscillators representing features of the same source are synchronized, and are desynchronized from those representing different sources

  • Originally proposed by von der Malsburg & Schneider (1986), and further developed by Wang (1996)

  • Supported by growing experimental evidence


Oscillatory correlation representation l.jpg

Oscillatory correlation representation

FD: Feature

Detector


Oscillatory correlation for asa l.jpg

Oscillatory correlation for ASA

  • LEGION dynamics (Terman & Wang’95) provides a computational foundation for the oscillatory correlation theory

  • The utility of oscillatory correlation has been demonstrated for speech segregation (Wang-Brown’99), modeling auditory attention (Wrigley-Brown’04), etc.


Summary l.jpg

Summary

  • CASA approach to monaural speech segregation

  • Performs substantially better than previous CASA systems for voiced speech segregation

    • AM cue and target pitch tracking are important for performance improvement

  • Early steps for unvoiced speech segregation

    • Auditory segmentation based on onset/offset analysis

    • Segregation using speech classification

  • Oscillatory correlation theory for ASA


Acknowledgment l.jpg

Acknowledgment

  • Joint work with Guoning Hu

  • Funded by AFOSR/AFRL and NSF


  • Login