Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estim...
This presentation is the property of its rightful owner.
Sponsored Links
1 / 55

NGASR 2011 暑期講習會 講者:林奇嶽 PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on
  • Presentation posted in: General

Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estimate 基於隨機森林法之爆發起始偵測及其在 嗓音起始時間預估之應用. NGASR 2011 暑期講習會 講者:林奇嶽. Outline. Burst Onset Detection Burst onset Feature representation Random forest (RF) Experimental results

Download Presentation

NGASR 2011 暑期講習會 講者:林奇嶽

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ngasr 2011

Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estimate基於隨機森林法之爆發起始偵測及其在嗓音起始時間預估之應用

NGASR 2011 暑期講習會

講者:林奇嶽


Outline

Outline

  • Burst Onset Detection

    • Burst onset

    • Feature representation

    • Random forest (RF)

    • Experimental results

  • Voice Onset Time Estimate

    • Voice onset time (VOT)

    • Proposed HMM+RF system

    • Experimental results

  • Conclusion


Ngasr 2011

Section I

Burst Onset Detection


Burst onset fundamental phonetics

Burst onset

Burst onset

Burst OnsetFundamental phonetics

  • A stop or an affricate consonant consists of following speech events:

    • Closure: air flow is completely blocked with certain articulators in the vocal tract. (voice bar or silence)

    • Release: the blockage is suddenly released, resulting in a puff of air rushing out of the mouth.

    • Aspiration (stop) or Fricative (affricate)

  • The most salient event is the onset of the release, which is commonly termed burst onset.


Burst onset fundamental phonetics1

Burst OnsetFundamental phonetics

  • Burst onset could be the shortest event in speech signal.

    • A sudden increase of all-band energy exhibits a stripe pattern in a Fourier-based spectrogram. Such an all-band energy dies out immediately.

don’t

carry


Burst onset fundamental phonetics2

Burst OnsetFundamental phonetics

  • To detect burst onsets in continuous speech, we focus on a small spectro-temporal patch containing a “closure-burst transition”.

don’t

carry


Feature representation two dimensional cepstral coefficient

Feature representationTwo-dimensional Cepstral Coefficient

  • Two-dimensional cepstral coefficients (TDCC) are used to encode such a “closure-burst transition”.

  • In deriving TDCC for each spectro-temporal patch, we perform two discrete cosine transforms to compact the transition information into a small set of coefficients.

    • 1st DCT: cepstral analysis (along frequency axis)

    • 2nd DCT: dynamic behavior of the coefficients from the first DCT (along time axis)

    • Between the two DCTs is a cepstral mean subtraction (CMS)


Feature representation two dimensional cepstral coefficient1

Feature representationTwo-dimensional Cepstral Coefficient

  • Similarity of dynamic feature derivation between the conventional regression formula and TDCC.

Coefficient value

Coefficient value

Relative frame distance

Relative frame distance

Derivative coeff.

Accelerative coeff.


Feature representation derive tdcc from a spectro temporal patch

Coefficients are extracted in a row-major fashion

Feature representationDerive TDCC from a spectro-temporal patch

  • Each frame in a patch is an LPC-derived spectrum.

    • Frame length: 10 ms (160 samples)

    • Frame shift: 2 ms (32 samples)

    • LP analysis with an order of 24. The LPC-derived spectrum is obtained with a 512-point DFT.

Extract 55coefficients

55x1

vector


Feature representation waveform and feature plane

Closure-burst transition patterns for detecting burst onsets

Feature representation Waveform and Feature Plane


Random forest fundamental

Random forestFundamental

  • A random forest (RF) consists of following techniques

    • An ensemble of classifiers

      • RF is an ensemble of tree classifiers

    • Bootstrapping and aggregating (bagging)

      • Generate multiple training sets for tree classifiers

      • Final decision is made by a plurality vote (majority vote)

    • Random subspace

      • Introduce randomness during node splitting.


Random forest fundamental1

Random forestFundamental

  • RF construction procedure

    • Bootstrapping training set for each tree classifier

    • Growing one tree and adding it to the forest. The step is terminated when a specified number of trees is reached.

      • While searching for an optimal cut, only considering a few dimensions. Repeat this whenever a node needs a split.

      • Growing the tree to its maximal size without any posterior pruning. (highest purity)

    • During testing, each tree in the forest hypothesizes a class for the input vector. Then a final decision is made by a plurality vote.


Random forest fundamental2

Random forestFundamental

D-dimensional vector

Randomly select d dimensions to search for an optimal split, where d~sqrt(D)

Bootstrapping training data

Each node achieves highest purity. There is no posterior pruning.

Each tree classifier is fully grown and then is added to the ensemble.

Repeat the procedure several times to construct more tree classifiers


Random forest broad phonetic category of manners

Random forestBroad phonetic category of manners

  • Articulatory manners

    • stop, affricate, fricative, nasal, semivowel, vowel, non- speech

    • “Stop” is further divided into

      • Voiced-stop burst

      • Voiceless-stop burst

      • Stop-aspiration

    • “burst”: voiced-stop burst, voiceless-stop burst“non-burst”: all other classes


Random forest imbalanced training data

Random forestImbalanced training data

  • The problem of imbalanced training data

    • The numbers of training vectors from different manners are highly imbalanced.

      • #Vowel >> #Fricative > … > #Stop (#Burst)

    • Conventional bootstrap causes problems.

      • Most of training vectors are selected from the majority classes such as “Vowel” and “Fricative”.

      • The target class “Burst”, however, may not be sampled sufficiently. Thus a resulting tree classifier lacks discriminative power to detect burst onsets.


Random forest asymmetric bootstrap

Random forestAsymmetric Bootstrap

  • Generate balanced training data

burst

fricative

vowel

BootstrappedTraining Data

BootstrappedTraining Data

  • The procedure repeats several times

  • Over-sampling the “burst” class

  • Down-sampling the other classes


Random forest detect burst onsets

Random forestDetect burst onsets

  • For each input vector , the forest votes for its class


Random forest detect burst onsets1

0

3.68

0

3.90

0

Random forestDetect burst onsets

frame


Random forest detect burst onsets2

Random forestDetect burst onsets

frame

0

3.68

0

3.90

0


Experimental results speech materials

Experimental Results Speech materials

  • TIMIT corpus (English read speech)

    • Microphone speech, 16 kHz sampling rate, 16-bit PCM format.

    • 630 speakers, including 438 males and 192 females

    • 8 different dialect regions in the US (DR1~DR8)

    • Training set :462 speakers (326M, 136F)Testing set: 168 speakers (112M, 56F)

    • Each speaker spoke 10 sentences,

      • 2 SA sentences: fixed contexts

      • 5 SX sentences: phonetically compact

      • 3 SI sentences: phonetically diverse


Experimental results speech materials1

Experimental Results Speech materials

  • TIMIT corpus

    • Training data are from four speakers in DR1

    • Training data for “burst” class are exclusively from stops.

    • Testing data are all utterances from TIMIT TEST set.

6991 stops

631 affricates


Experimental results rf based burst onset detector

Experimental Results RF-based burst onset detector

  • Random forest settings

    • Training dataset: 4 speakers from TIMIT DR1

    • Broad phonetic category of articulatory manners

      • nine classes

      • Apply asymmetric bootstrap to balance the training data

    • 56-dim feature vector (D=56), including 55 TDCCs and 1 average log-energy of the patch.

    • The detector consists of 30 trees

      • The dimension of random subspace during the node splitting is d=8

      • No posterior tree pruning


Experimental results detection examples

Dental fricative

affricate

Stops

Experimental Results Detection Examples

|Put |the| butcher | block |table


Experimental results precision of detection

Experimental Results Precision of detection

  • Median: 3.1 ms Interdecile Range: 12.6 ms

  • Precision: Voiceless > Voiced


Experimental results sources of false alarm

Dental fricative //

Dental fricative //

Experimental ResultsSources of false alarm

  • Most onsets of dental fricatives are detected as having burst onsets, and they are hard to be rejected.

  • Other sources are fricatives and pause segments.


Experimental results missed detection rate

Experimental Resultsmissed detection rate

  • The missed detection rate increases as the confidence threshold increases.

    • Stops (5.1%  6.5%) Affricates (13.6%  15.8%)


Experimental results comparison of different rf settings

Experimental Results Comparison of different RF settings

  • D: # of feature dimension

  • d: # of randomly selected dimensions in node splitting.


Experimental results comparison of various learning machines

Experimental Results Comparison of various learning machines

  • Accuracy: RF  SVM > GMM

  • Execution time: RF  GMM >> SVM

  • SVM kernel: RBF  LIN


Experimental results comparison of various amount of training data

Experimental Results Comparison of various amount of training data

  • Training data are from dialect region one (DR1)

  • SVM-RBF starts to surpass RF as more data are included.

    • SVM-RBF takes far more time in training and testing.


Summary

Summary

  • The proposed RF-based detector is able to efficiently detect burst onsets in continuous speech.

    • The detector only needs few training data.

    • Experimental results demonstrate its applicability.

  • The proposed asymmetric bootstrap technique can resolve the problem of imbalanced training data.


Ngasr 2011

Section II

Voice Onset Time Estimate


Voice onset time

Voice Onset Time

  • Voice onset time (VOT) was proposed in 1960s. It was expected to effectively distinguish between English /b, d, g/ and /p, t, k/.

    • Another cues are “voicing”, “articulatory force”, and “aspiration.”

  • VOT is defined as a time difference between burst onset and voicing onset.


Voice onset time two examples of vot

Voice Onset TimeTwo examples of VOT

borrow

tim


Voice onset time1

Voice Onset Time

  • VOT can be classified into several categories

    • Voicing Lead: VOT is negative-valued

    • Voicing Coincide: VOT is about zero

    • Voicing Lag: VOT is positive-valued

  • Distributions of VOT are different from language to language.

    • Two-modal: English, Spanish, Mandarin, Dutch

    • Three-modal: Korean, Thai

    • Four-modal: Hindi


Voice onset time existing automatic methods to estimate vot

Voice Onset TimeExisting automatic methods to estimate VOT

  • Automatic VOT estimate methods include

    • Forced alignment performed by an HMM phone recognizer (HMM-FA)

      • Pros: efficient, suitable for large corpus

      • Cons: aligned boundaries normally do not meet the onsets

    • Onset detector for burst and voicing onsets (OD)

      • Pros: estimated onset locations are more accurate

      • Cons: only suitable for isolated words

    • Combination of the two (HMM-FA+OD)

      • Have the pros of the two previous methods at the same time


Proposed hmm rf system flowchart of the system

Proposed HMM+RF SystemFlowchart of the system


Proposed hmm rf system system overview

Proposed HMM+RF SystemSystem overview

  • The proposed system consists of two parts:

    • Forced alignment based on HMM

      • Roughly locate stop consonants in continuous speech.

      • The aligned boundaries typically do not align with true onset locations.

    • Onset Detection based on random forest

      • For each aligned stop consonant, the detector searches its neighborhood for its burst and voicing onsets.


Proposed hmm rf system hmm based phone recognizer

Proposed HMM+RF SystemHMM-based phone recognizer

  • HMM-based phone recognizer

    • Training dataset: the whole TIMIT training set

    • 48 context-independent English phones

    • HMM topology: three-state left-to-right HMM, each state has eight Gaussian components.

      • ML training + EM algorithm

      • Execute five times of embedded training every time the number of Gaussian components are doubled.

    • 13-dim MFCC + 1-dim log-energy plus their derivative and accelerative coefficients.


Proposed hmm rf system rf based onset detector

Proposed HMM+RF SystemRF-based onset detector

  • Random forest based onset detector

    • Training dataset: 4 speakers from TIMIT training set

    • Broad phonetic category of articulatory manners

      • Burst  burst onset

      • Vocalic  voicing onset

    • 56-dim TDCC vector

    • The detector consists of 30 trees

      • The dimension of random subspace during the node split is 8

      • No posterior tree pruning

      • Apply asymmetric bootstrap to balance the training data from the broad phonetic categories.


Proposed hmm rf system more details about the onset detector

Proposed HMM+RF SystemMore details about the onset detector

  • Burst onset detection

    • The procedure is the same as described in Section I.

  • Voicing onset detection

    • The first frame of a detected ‘vocalic’ segment following a detected burst onset is regarded as the voicing onset.


Proposed hmm rf system more details about the onset detector1

Proposed HMM+RF SystemMore details about the onset detector

  • Voicing onset adjustment procedure

    • (a) Aspiration or release portion: is large

    • (b) Vocalic portion: is small

    • in the region between (a) and (b) will be large


Proposed hmm rf system more details about the onset detector2

Proposed HMM+RF SystemMore details about the onset detector

  • An example of voicing onset adjustment


Experimental results evaluation dataset

Experimental ResultsEvaluation dataset

  • Subset of TIMIT testing set

    • 3,784 stop consonants in 968 distinct words.

      • 2,344 word-initial stop consonants and 1,440 word-medial stop consonants.

      • The selected stop consonants are left-context independent, but right-context dependent.


Experimental results evaluation dataset1

Experimental ResultsEvaluation dataset

  • The list of eligible succeeding vowels in the experiment.

Ht. (Vowel Height): Low, Mid-Low, Mid-High, High

Bk. (Vowel Backness): Front, Central, Back


Experimental results performance evaluation

Experimental ResultsPerformance Evaluation

  • Four systems to be compared

    • HMM-FA-PL

      • HMMForced Alignment at Phone Level

    • HMM-FA-PL+OD

      • HMM-FA-PL with Onset Detection

    • HMM-FA-SL

      • HMMForced Alignment at State Level

    • HMM-FA-SL+OD

      • HMM-FA-SL with Onset Detection


Experimental results performance evaluation1

Experimental ResultsPerformance Evaluation

  • Absolute temporal deviation between an estimated VOT and its true value.

  • The deviations are presented in terms of cumulative relative frequency distributions

    • Four tolerances:  5 ms,  10 ms,  15 ms, and  20 ms


Experimental results vot estimates in voiced and voiceless stops

Experimental ResultsVOT estimates in voiced and voiceless stops

  • Estimating VOTs of voiced stops with HMM-FA-PL are very poor.

    • HMM topology limitation

    • HMM-FA-SL significantly improves the estimates

  • The effect of an additional onset detection is remarkable.


Experimental results 3d histograms of estimate deviations

With additional OD, the estimates of burst and voicing onsets are both enhanced

HMM-FA-SL corrects estimate deviation of burst onset in HMM-FA-PL

Experimental Results3D-histograms of estimate deviations


Experimental results performance comparison

Experimental ResultsPerformance Comparison

Absolute deviation of estimation

* RS: Reassigned Spectrum

** Stouten & Van hamme (2009) employed RS technique to estimate VOT.


Experimental results performance in detail

Experimental ResultsPerformance in Detail

  • VOT estimates of voiced velar stop /g/ are less accurately estimated than other five stops.

    • On average, VOTs of velar stops (/g/, /k/) are less accurately estimated.

  • VOTs of word-medial voiced stops are less accurately estimated than their word-initial counterparts.

    • Caused by failed detection of burst onset.

    • Contrarily, the estimations for voiceless stops in word-medial and word-initial positions are statistically the same.


Experimental results failed onset detection in word medial stops

Experimental ResultsFailed onset detection in word-medial stops

  • Example of failed burst onset detection

    • No noticeableburst onset


Experimental results failed onset detection in word medial stops1

Experimental ResultsFailed onset detection in word-medial stops

  • Example of failed burst onset detection

    • Surrounded by strong vocalic pulses.


Summary1

Summary

  • HMM-based forced alignment provides less accurate VOT estimates; however, applying an additional onset detection can significantly improve the accuracy.

  • The accuracy of VOT estimation varies, depending on a stop’s position in a word, and its articulation places.


Conclusion

Conclusion

  • The proposed RF-based burst onset detector employs the spectro-temporal patterns of closure-burst transition to efficiently detect burst onsets in continuous speech.

  • The burst onset detection combines the voicing onset detection to significantly enhance VOT estimates which are initially made by HMM-based forced alignment.

  • The method could be useful for speech event annotation and speech assessment.


Ngasr 2011

Thank You


  • Login