340 likes | 367 Views
Explore the impact of prosody on word recognition and linguistic studies, addressing variations in speech patterns. Learn about acoustic features, confusability, accent effects on phone recognition, and advancing linguistic science through modeling prosody. Research methods and goals include investigating prosodic cues in non-lab English for improved ASR and recognizing systematic variations. Discover how prosodic factors influence articulation and acoustic characteristics in English speech.
E N D
Jennifer Cole Dept. of Linguistics Mark Hasegawa-Johnson Dept. of Electrical and Computer Engineering Bringing Prosody into Automatic Speech Recognition:Improving word recognition and advancing linguistic science
Prosody: the rhythmic and intonational patterns of spoken language • A benefit for comprehension: prosody cues phrasing and information status of words. • A cost for speech recognition: prosody conditions acoustic variation.
Sources of variation in speech • The acoustic features of a speech sound vary as a function of: • Phonological context (assimilations, deletions, insertions) • Phonetic context (coarticulation, masking) • Speaker voice • Speaking style and tempo • Prosodic factors: accent, phrasal position
Modeling acoustic variation in ASR • Acoustic variation that results from local phonological and phonetic context can be accomodated in ASR through the use of “diphone” and “triphone” models. • Variation due to speaker, speech style, and prosody are not determined by local phone context, and not explicitly modeled in most ASR systems.
Variation and confusability • Prosodically-conditioned variation causes greater overlap between contrastive phones in acoustic space. • Greater overlap between phones can result in greater confusability, and is a likely source of error in word recognition.
Accent leads to greater overlap: acoustic cues to consonant voicing voiced voiceless Combining accent conditions results in greater overlap between p/b. = unaccented = accented VOT P p B p/b b P/B
Separating accented from unaccented phones yields better distinctions within accent category Unaccented Accented No overlap between voiced/voiceless phones within accent category Separate models for accented and unaccented consonants should result in better recognition. VOT p p b b
Immediate goal: • By modeling prosodic distinctions in speech recognition we expect to achieve more accurate phone and word recognition. Improving ASR If we do, then… • Results from our speech recognition experiments will tell us about the prosodic effects that occur in “non-lab” (natural) spoken English.Advancing linguistic science
Future goals: • Recognition of pitch and durational cues to pragmatic meaning and discourse structure. • An approach to modeling other sources of systematic variation: “foreign” or dialectal accent, speech disorders, child speech…
Our approach to prosody-dependent phone modeling • Determine which prosodic features condition confusion-inducing variation for which kinds of phones. • Train a speech recognizer on prosodically specified phones. • Requires a training corpus of prosodically-labeled and phone-labeled speech. • Develop an approach to recognize prosodic features from acoustic cues.
Linguistic models of acoustic variation • Research in phonetics shows that prosodic factors are a significant source of variation. • In English: • Lexical stress • Nuclear (phrasal) accent • Phrase position (initial/medial/final)
Prosodic effects on articulation • Sounds in stressed or accented syllables are more strongly articulated, • Stressed/accented speech gestures are • Faster • Longer • Bigger (greater displacement) Cho 2001; deJong 1991, 1995; Edwards & Beckman 1988; Edwards et al 1991; Beckman et al 1992; Harrington et al 1995; Cooper 1991
Prosodic effects on acoustics (English) • Vowels in unstressed syllables are reduced (centralized) and shortened compared to stressed vowels (Lindblom 1963). • Segment durations are longer • In accented syllables (Beckman & Edwards 1994) • In phrase-initial syllables (Fougeron & Keating 1997) • In phrase-final syllables (Edwards & Beckman 1988; Crystal & House 1988; Wightman et al 1994)
Limitations of prior studies • Findings are from laboratory studies of controlled speech, produced in absence of real discourse context • Focus is on supralaryngeal articulations (C and V place and manner); • The bulk of the evidence comes from articulatory studies.
Research questions • What are the full range of effects of prosodic factors on acoustic features? • How are laryngeal features affected? • How far does prosody influence acoustic variation in non-laboratory speech?
Dual methods for investigating prosodic effects • Acoustic analysis provides a direct measure of prosodically-conditioned variation. • Speech recognition experiments provide indirect evidence for prosodic effects • Recognition is improved when prosodic context is explicitly modeled, so prosodic effects must have decreased the distinctiveness of contrastive phones in the speech corpus studied.
Phase I: Boston University Radio News speech corpus • What are the effects of accent on the acoustic cues that distinguish voiced from voiceless stops in American English? • /p,t,k/ vs. /b,d,g/ • Does accent condition a significant degree of variation in this speech corpus? • We begin by looking at the acoustic cues for the voicing contrast for stops in V# 'CV contexts: • C is onset in word-initial, stressed syllable • Comparing Cs in accented and unaccented syllables
Why Radio News Speech? • Speech not controlled for purposes of phonetics research. • Speech produced under real communicative context. • Speech produced by professional radio news announcers. (good? bad?) • Multiple speakers reading same news story • Speech is prosodically labeled based on ToBI labeling standards (Beckman-Pierrehumbert model) … saves us lots of time/work!
Timit database Confusion Matrix *A *A=Actual phoneme, *R= Recognized phoneme
Acoustic cues for the phonological voicing contrast • VOT • voiceless > voiced • closure (lead) voicing for voiced stops • F0 (measured at onset of following vowel) • voiceless > voiced • Closure duration • voiceless > voiced
Hypothesized Accentual Effects Paradigmatic strengthening Syntagmatic strengthening Unaccented Accented Unaccented Accented Acoustic values Acoustic values Contrastive Pairs Contrastive Pairs
Predicted effects of accent • Paradigmatic Strengthening: greater acoustic distinctions between voiced and voiceless stops for all measures • not a problem for ASR • Syntagmatic Strengthening: similar effects for both voiced and voiceless stops: • increase in VOT and Closure Duration • increase in acoustic energy, resulting in higher F0
ANOVA Results for effects of Voicing and Accent on means of acoustic measures • Voicing was a significant factor for all three measures. These cues signal voicing • Significant effects of Accent found for VOT, F0 and Closure Duration. • Accent effects: • Increased VOT for all stops except /g/; • Raised F0 for all stops except /b/; • Increased Closure Duration for all stops except /g/.
k t p g b d
K k P T t Region of overlap of voiced and voiceless groups within an accent category p g B G D d b
K k P T t For all 3 Places: A greater overlap of voiced and voiceless groups when accent conditions are pooled. p g B G D d b
k d t p b g
K T t P k p d D g B b G Region of overlap of voiced and voiceless groups within an accent category
K T t P k p d D g B b G For bilabials and velars: A greater overlap of voiced and voiceless groups when accent conditions are pooled.
b d p k t g
B K D P T b g d p t k G Region of overlap of voiced and voiceless groups within an accent category
B K D P T b g d G p t For bilabials and alveolars: A greater overlap of voiced and voiceless groups when accent conditions are pooled. k
Summary of results from acoustic study • VOT, F0 and Closure duration: accent induces increased values for both voiced and voiceless stops syntagmatic strengthening • VOT and F0: effects are bigger and more consistent for voiceless stops than for voiced stops paradigmatic strengthening