Automatic Detection of Voice Onset Time Contrasts
1 / 1

Automatic Detection of Voice Onset Time Contrasts For Use in Pronunciation Assessment - PowerPoint PPT Presentation

  • Uploaded on

Automatic Detection of Voice Onset Time Contrasts For Use in Pronunciation Assessment. Project Description. Abe Kazemzadeh 1 , Joseph Tepperman 1 , Jorge Silva 1 , Hong You 2 , Sungbok Lee 1 , Abeer Alwan 2 , and Shrikanth Narayanan 1

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Automatic Detection of Voice Onset Time Contrasts For Use in Pronunciation Assessment' - alder

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Automatic detection of voice onset time contrasts for use in pronunciation assessment

Automatic Detection of Voice Onset Time Contrasts For Use in Pronunciation Assessment

Project Description

Abe Kazemzadeh1, Joseph Tepperman1, Jorge Silva1, Hong You2, Sungbok Lee1, Abeer Alwan2, and Shrikanth Narayanan1

University of Southern California1 and University of California Los Angeles2

  • Automatically distinguish whether a voiceless stop consonant is pronounced with a native or accented pronunciation based on voice onset time (VOT) characteristics.

  • Use data from the Tball corpus: ESL children doing oral reading tasks.

  • Evaluate different methods of accomplishing this.

    • State duration measurements

    • Explicit modeling of aspiration

    • Phone probablility discrimination



Motivation for Studying VOT

  • Baseline method error rates:

    • p: 55% t:23% k:29%

    • p: 19% t:20% k:48% using duration of 3rd HMM state

  • With aspiration model:

    • ShortVOT/ LongVOT

    • p: 5% / 36%

    • t: 11% / 38%

    • k: 57% / 17%

  • With probability comparision:

    • p: 36% / 4%

    • t: 0% / 5%

    • k: 0% / 6%

    • (trained on test data—over trained?)

  • Baseline: use duration measurements from a forced alignment.

  • Insert an /h/ symbol in the transcriptions with standard pronunciation, train accordingly and decode the test files to see if the /h/ phone is recognized.

  • Cut out the phones of interest from the audio file, train separate models and a combined model, and evaluate the likelihood of the separate models w.r.t. the combined model.

  • The data was transcribed by ear with special symbols for non-standard pronunciations.

    • b/c the data for non standard pronunciatons was sparse, the symbol for dental /t/ was included as short VOT.

  • Standard 3 state HMM models

    • 4 mixtures, T-state silence model.

    • Different frame rates were tested.

    • Bootstrap and flat start methods were tested.

    • Used Cambridge Hidden MarkovToolkit (HTK).

  • This study was motivated by a desire to determine if a phone was pronounced with a non-standard pronuniation

  • Other reasons to study VOT

    • It is an important contrastive feature

    • It gives information about stess

    • It gives information about word segmentation

    • It may give information about emphasis

What is VOT?

  • Voice onset time is defined for stops

    • e.g. /p,b,t,d,k,g/

  • It is the inverval between the release of closure of an articulator (the transient “burst”) and the start of voicing.

  • VOT has a continuum of values:

    • When the start of voicing precedes the release of closure for a stop, the VOT takes on a negative value.

    • When the release of closure and onset of voicing are coincident, VOT is zero.

    • When voicing comes after release of closure, VOT is positive.

Tball Corpus


  • Los Angeles area elementary schools.

  • 256 Children, mainly Spanish native speakers.

  • Reading words, letters, and numbers, and naming pictures and colors.

  • Collected by cooperation between USC and UCLA.

  • Studies have noted that for VOT k>t>p

    • This could explain why the baseline gets poor results for p

    • and why the aspiration model predicts the short VOT class best for /p,t/ but predicts the long VOT class best for /k/

  • Roughly, each method increased in difficulty.

  • The results improved from the baseline, but the last approach (comparing probabilities) may have been over-trained.

  • Comparing probabilities may be easier to extend to other pronunciation modeling tasks.

  • Increasing the frame rate didn't help much.

    • Don't use a 1ms frame rate Unless you want to test your patience.

  • If an Initial consonant has a short VOT, this does not necessarily imply non-standard accent.

    • Words like “today” and “together” have stress on the 2nd syllable, so the VOT of the initial consonant is shorter for even for standard pronunciation.

Physical Realization of VOT

Evaluation Method

  • Stop consonants are produced with a closure of the vocal tract at a specific point, the place of articulation

  • During the closure, there is a build up of sub-laryngeal pressure.

  • When the closure is released there is a transient burst of air, frication due to turbulence at the place of articulation, aspiration noise from turbulence at the glottis

  • Voicing may occur before, during, or after the release of closure.

  • The evaluation metric used was the error rate for both classes evaluated separately.

    • This was necessary because the there were much fewer instances of the non-standard pronunciations.

    • If total error rate were used, low error rates could be achieved by classifying all as long VOT

  • When using thresholds, the point of equal error rate for both classes was used.

    • This was necessary b/c moving the threshold would tilt the error rate toward one class or the other.



Linguistic Significance of VOT

  • VOT distinguishes consonants with the same place of articulation (/p/ vs. /b/, /t/ vs. /d/, etc.)

  • However, different languages use different VOT intervals in contrasts (e.g. “taco”, “pasta”).

  • English voiceless stops: VOT= +40-50 ms

  • Spanish voiceless stops: VOT= near zero

  • English voiced stops: VOT = near zero

  • Spanish voiced stops: negative VOT (voicing before closure

  • In English, voiceless stops are have a long VOT at the beginning of a word and before stressed vowels, so aspiration is a perceptual cue to word boundaries and stress

  • Since the frication and aspiration during the VOT is due to build up of pressure from the lungs, it may correspond with emphasis.


Future Work


  • Since VOT is a time/timing related phenomenon, it may help to explicitly model the state duration density in the HMMs.

  • Other optimization criteria might be be better suited than maximum likelihood extimation to train models for this purpose

  • When classifying stop consonants based on VOT characteristics, different approaches work better on different stops

    • Measuring duration of stop state works reasonably well for /t,k/ b/c longer VOT than /p/.

    • Detecting insertion of an aspiration model during decoding works well for /p,t/ but not k, which has too many false positives.

    • Comparing phone probabilities worked well except for unaspirated /p/


Special Thanks to the Tball Project for the data, EE619 class for feedback, and Daylen Riggs and Nathan Go for help with the transcriptions.

References on Request