Enhancing Speech Recognition Through Prosody Modeling

Jennifer Cole Dept. of Linguistics Mark Hasegawa-Johnson Dept. of Electrical and Computer Engineering Bringing Prosody into Automatic Speech Recognition:Improving word recognition and advancing linguistic science

Prosody: the rhythmic and intonational patterns of spoken language • A benefit for comprehension: prosody cues phrasing and information status of words. • A cost for speech recognition: prosody conditions acoustic variation.

Sources of variation in speech • The acoustic features of a speech sound vary as a function of: • Phonological context (assimilations, deletions, insertions) • Phonetic context (coarticulation, masking) • Speaker voice • Speaking style and tempo • Prosodic factors: accent, phrasal position

Modeling acoustic variation in ASR • Acoustic variation that results from local phonological and phonetic context can be accomodated in ASR through the use of “diphone” and “triphone” models. • Variation due to speaker, speech style, and prosody are not determined by local phone context, and not explicitly modeled in most ASR systems.

Variation and confusability • Prosodically-conditioned variation causes greater overlap between contrastive phones in acoustic space. • Greater overlap between phones can result in greater confusability, and is a likely source of error in word recognition.

Accent leads to greater overlap: acoustic cues to consonant voicing voiced voiceless Combining accent conditions results in greater overlap between p/b. = unaccented = accented VOT P p B p/b b P/B

Separating accented from unaccented phones yields better distinctions within accent category Unaccented Accented No overlap between voiced/voiceless phones within accent category Separate models for accented and unaccented consonants should result in better recognition. VOT p p b b

Immediate goal: • By modeling prosodic distinctions in speech recognition we expect to achieve more accurate phone and word recognition. Improving ASR If we do, then… • Results from our speech recognition experiments will tell us about the prosodic effects that occur in “non-lab” (natural) spoken English.Advancing linguistic science

Future goals: • Recognition of pitch and durational cues to pragmatic meaning and discourse structure. • An approach to modeling other sources of systematic variation: “foreign” or dialectal accent, speech disorders, child speech…

Our approach to prosody-dependent phone modeling • Determine which prosodic features condition confusion-inducing variation for which kinds of phones. • Train a speech recognizer on prosodically specified phones. • Requires a training corpus of prosodically-labeled and phone-labeled speech. • Develop an approach to recognize prosodic features from acoustic cues.

Linguistic models of acoustic variation • Research in phonetics shows that prosodic factors are a significant source of variation. • In English: • Lexical stress • Nuclear (phrasal) accent • Phrase position (initial/medial/final)

Prosodic effects on articulation • Sounds in stressed or accented syllables are more strongly articulated, • Stressed/accented speech gestures are • Faster • Longer • Bigger (greater displacement) Cho 2001; deJong 1991, 1995; Edwards & Beckman 1988; Edwards et al 1991; Beckman et al 1992; Harrington et al 1995; Cooper 1991

Prosodic effects on acoustics (English) • Vowels in unstressed syllables are reduced (centralized) and shortened compared to stressed vowels (Lindblom 1963). • Segment durations are longer • In accented syllables (Beckman & Edwards 1994) • In phrase-initial syllables (Fougeron & Keating 1997) • In phrase-final syllables (Edwards & Beckman 1988; Crystal & House 1988; Wightman et al 1994)

Limitations of prior studies • Findings are from laboratory studies of controlled speech, produced in absence of real discourse context • Focus is on supralaryngeal articulations (C and V place and manner); • The bulk of the evidence comes from articulatory studies.

Research questions • What are the full range of effects of prosodic factors on acoustic features? • How are laryngeal features affected? • How far does prosody influence acoustic variation in non-laboratory speech?

Dual methods for investigating prosodic effects • Acoustic analysis provides a direct measure of prosodically-conditioned variation. • Speech recognition experiments provide indirect evidence for prosodic effects • Recognition is improved when prosodic context is explicitly modeled, so prosodic effects must have decreased the distinctiveness of contrastive phones in the speech corpus studied.

Phase I: Boston University Radio News speech corpus • What are the effects of accent on the acoustic cues that distinguish voiced from voiceless stops in American English? • /p,t,k/ vs. /b,d,g/ • Does accent condition a significant degree of variation in this speech corpus? • We begin by looking at the acoustic cues for the voicing contrast for stops in V# 'CV contexts: • C is onset in word-initial, stressed syllable • Comparing Cs in accented and unaccented syllables

Why Radio News Speech? • Speech not controlled for purposes of phonetics research. • Speech produced under real communicative context. • Speech produced by professional radio news announcers. (good? bad?) • Multiple speakers reading same news story • Speech is prosodically labeled based on ToBI labeling standards (Beckman-Pierrehumbert model) … saves us lots of time/work!

Timit database Confusion Matrix *A *A=Actual phoneme, *R= Recognized phoneme

Acoustic cues for the phonological voicing contrast • VOT • voiceless > voiced • closure (lead) voicing for voiced stops • F0 (measured at onset of following vowel) • voiceless > voiced • Closure duration • voiceless > voiced

Hypothesized Accentual Effects Paradigmatic strengthening Syntagmatic strengthening Unaccented Accented Unaccented Accented Acoustic values Acoustic values Contrastive Pairs Contrastive Pairs

Predicted effects of accent • Paradigmatic Strengthening: greater acoustic distinctions between voiced and voiceless stops for all measures • not a problem for ASR • Syntagmatic Strengthening: similar effects for both voiced and voiceless stops: • increase in VOT and Closure Duration • increase in acoustic energy, resulting in higher F0

ANOVA Results for effects of Voicing and Accent on means of acoustic measures • Voicing was a significant factor for all three measures.  These cues signal voicing • Significant effects of Accent found for VOT, F0 and Closure Duration. • Accent effects: • Increased VOT for all stops except /g/; • Raised F0 for all stops except /b/; • Increased Closure Duration for all stops except /g/.

k t p g b d

K k P T t Region of overlap of voiced and voiceless groups within an accent category p g B G D d b

K k P T t For all 3 Places: A greater overlap of voiced and voiceless groups when accent conditions are pooled. p g B G D d b

k d t p b g

K T t P k p d D g B b G Region of overlap of voiced and voiceless groups within an accent category

K T t P k p d D g B b G For bilabials and velars: A greater overlap of voiced and voiceless groups when accent conditions are pooled.

b d p k t g

B K D P T b g d p t k G Region of overlap of voiced and voiceless groups within an accent category

B K D P T b g d G p t For bilabials and alveolars: A greater overlap of voiced and voiceless groups when accent conditions are pooled. k

Summary of results from acoustic study • VOT, F0 and Closure duration: accent induces increased values for both voiced and voiceless stops  syntagmatic strengthening • VOT and F0: effects are bigger and more consistent for voiceless stops than for voiced stops  paradigmatic strengthening

Enhancing Speech Recognition Through Prosody Modeling