Learning long term temporal features
Download
1 / 36

Learning Long-Term Temporal Features - PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on

Learning Long-Term Temporal Features. A Comparative Study Barry Chen. Log-Critical Band Energies. Log-Critical Band Energies. Conventional Feature Extraction. Log-Critical Band Energies. TRAPS/HATS Feature Extraction. What is a TRAP? (Background Tangent).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Learning Long-Term Temporal Features' - breena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Learning long term temporal features

Learning Long-Term Temporal Features

A Comparative Study

Barry Chen

Speech Lunch Talk


Log critical band energies
Log-Critical Band Energies

Speech Lunch Talk


Log critical band energies1
Log-Critical Band Energies

Conventional

Feature Extraction

Speech Lunch Talk


Log critical band energies2
Log-Critical Band Energies

TRAPS/HATS

Feature Extraction

Speech Lunch Talk


What is a trap background tangent
What is a TRAP? (Background Tangent)

  • TRAPs were originally developed by our colleagues at OGI: Sharma, Jain (now at SRI), Hermansky and Sivadas (both now at IDIAP)

  • Stands for TempRAl Pattern

  • TRAP = a narrow frequency speech energy pattern over a period of time (usually 0.5 – 1 second long)

Speech Lunch Talk


Example of traps
Example of TRAPS

Mean Temporal Patterns for 45 phonemes at 500 Hz

Speech Lunch Talk


Traps motivation
TRAPS Motivation

  • Psychoacoustic studies suggest that human peripheral auditory system integrates information on a longer time scale

  • Information measurements (joint mutual information) show information still exists >100ms away within single critical-band

  • Potential robustness to speech degradations

Speech Lunch Talk


Let s explore
Let’s Explore

  • TRAPS and HATS are examples of a specific two-stage approach to learning long-term temporal features

  • Is this constrained two-stage approach better than an unconstrained one-stage approach?

  • Are the non-linear transformations of critical band trajectories, provided in different ways by TRAPS and HATS, actually necessary?

Speech Lunch Talk


Learn everything in one step
Learn Everything in One Step

Speech Lunch Talk


Learn in individual bands
Learn in Individual Bands

Speech Lunch Talk


Learn in individual bands1
Learn in Individual Bands

Speech Lunch Talk


Learn in individual bands2
Learn in Individual Bands

Speech Lunch Talk


Learn in individual bands3
Learn in Individual Bands

Speech Lunch Talk


Learn in individual bands4
Learn in Individual Bands

Speech Lunch Talk


Learn in individual bands5
Learn in Individual Bands

Speech Lunch Talk


Learn in individual bands6
Learn in Individual Bands

Speech Lunch Talk


Learn in individual bands7
Learn in Individual Bands

Speech Lunch Talk


Learn in individual bands8
Learn in Individual Bands

Speech Lunch Talk


One stage approach
One-Stage Approach

Speech Lunch Talk


2 stage linear approaches
2-Stage Linear Approaches

Speech Lunch Talk


Pca lda comments
PCA/LDA Comments

  • PCA on log critical band energy trajectories scales and rotates dimensions in directions of highest variance

  • LDA projects in directions that maximize class separability measured by between class covariance over within class covariance

  • Keep top 40 dimensions for comparison with MLP-based approaches

Speech Lunch Talk


2 stage mlp based approaches
2-Stage MLP-Based Approaches

Speech Lunch Talk


Mlp comments
MLP Comments

  • As with the other 2-stage approaches, we first learn patterns independently in separate critical band trajectories, and then learn correlations among these discriminative trajectories

  • Interpretation of various MLP layers:

    • Input to hidden weights – discriminant linear transformations

    • Hidden unit outputs – Non-linear discriminant transforms

    • Before Softmax – transforms hidden activation space to unnormalized phone probability space

    • Output Activations – critical band phone probabilities

Speech Lunch Talk


Experimental setup
Experimental Setup

  • Training: ~68 hours of conversational telephone speech from English CallHome, Switchboard I, and Switchboard Cellular

    • 1/10 used for cross-validation set for MLPs

  • Testing: 2001 Hub-5 Evaluation Set (Eval2001)

    • 2,255,609 frames and 62,890 words

  • Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models (thanks to Andreas Stolcke for all his help)

Speech Lunch Talk


Frame accuracy performance
Frame Accuracy Performance

Speech Lunch Talk


Standalone feature system
Standalone Feature System

  • Transform MLP outputs by:

    • log transform to make features more Gaussian

    • PCA for decorrelation

  • Same as Tandem setup introduced by Hermansky, Ellis, and Sharma

  • Use transformed MLP outputs as front-end features for the SRI recognizer

Speech Lunch Talk


Standalone features
Standalone Features

Speech Lunch Talk


Combination w state of the art front end feature
Combination W/State-of-the-Art Front-End Feature

  • SRI’s 2003 PLP front-end feature is 12th order PLP with three deltas. Then heteroskedastic discriminant analysis (HLDA) transforms this 52 dimensional feature vector to 39 dimensional HLDA(PLP+3d)

  • Concatenate PCA truncated MLP features to HLDA(PLP+3d) and use as augmented front-end feature

    • Similar to Qualcom-ICSI-OGI features in AURORA

Speech Lunch Talk



Ranking table
Ranking Table

Speech Lunch Talk


Observations
Observations

  • Throughout the three various testing setups:

    • HATS is always #1

    • The one-stage 15 Bands x 51 Frames is always #6 or second last

    • TRAPS is always last

    • PCA, LDA, HATS before sigmoid, and TRAPS before softmax flip flop in performance

Speech Lunch Talk


Interpretation
Interpretation

  • Learning constraints introduced by the 2-stage approach is helpful if done right.

  • Non-linear discriminant transform of HATS is better than linear discriminant transforms from LDA and HATS before sigmoid

  • The further mapping from hidden activations to critical-band phone posteriors is not helpful

    • Perhaps, mapping to critical-band phones is too difficult and inherently noisy

  • Finally, like TRAPS, HATS is complementary to the more conventional features and combines synergistically with PLP 9 Frames.

Speech Lunch Talk



Frame accuracy performance1
Frame Accuracy Performance

Speech Lunch Talk


Standalone features wer
Standalone Features WER

Speech Lunch Talk