Less is More?

1 / 22

# Less is More? - PowerPoint PPT Presentation

Less is More?. Yi Wu Advisor: Alex Rudnicky. People:. There is no data like more data!. Goal: Use less to Perform more. Identifying an informative subset from a large corpus for Acoustic Model (AM) training. Expectation of the Selected Set Good in Performance Fast in Selection.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Less is More?' - sef

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Less is More?

Yi Wu

People:

There is no data like more data!

Goal: Use less to Perform more
• Identifying an informative subset from a large corpus for Acoustic Model (AM) training.
• Expectation of the Selected Set
• Good in Performance
• Fast in Selection
Motivation
• The improvement of system will become increasingly smaller when we keep adding data.
• Training acoustic model is time consuming.
• We need some guidance on what is the most needed data.
Approach Overview
• Applied to well-transcribed data
• Selection based on transcription
• Choose subset that have “uniform” distribution on speech unit (word, phoneme, character)
How to sample wisely?--A simplified example
• We have right to choose how much we want from each class.
• We train the model use MLE estimator.
• When a new sample generated, we use our model to determine its class.

Question:

How to sample to achieve minimum error?

The optimal Bayes Classifier

If we have the exact form of fi(x), above classification is optimal.

To approximate the optimal
• We use our MLE
• The true error would be bounded by optimal Bayes error plus error bound for our worst estimated
Sample Uniformly
• We want to sample each class equally.
• The data selected will have good coverage on each class.
• This will give robust estimation on each class.
Data Selection for ASR System
• The prior has been estimated independently by language model.
• To make acoustic model accurate, we want to sample the W uniformly.
• We can take the unit to be phoneme, character, word. We want their distribution to be uniform.
Entropy: Measure for “uniformness”
• Use the entropy of the word (phoneme) as ways of evaluation
• Suppose the word (phoneme) has a sample distribution p1, p2…. pn
• Choose subset have maximum -p1*log(p1)-p2*log(p2)-... pn *log(pn))
• Entropy actually is the KL distance from uniform distribution
Computational Issue
• It is computational intractable to find the transcription set that maximizes the entropy
• Forward Greedy Search
Combination
• There are multiple entropies we want to maximize.
• Combination Method
• Weighted Sum
Experiment Setup
• System: Sphinx III
• Feature: 39 dimension MFCC
• Training Corpus: Chinese BN 97(30hr)+ GaleY1(810hr data)
• Test Set: RT04(60 min)
Summary
• Choose data uniformly according to speech unit
• Maximize entropy using greedy algorithm