CSE 551: Structure of Spoken Language

CSE 551: Structure of Spoken Language Lecture 6: Characteristics of Place of Articulation;Phonetic Transcription John-Paul Hosom Fall 2004

Acoustic-Phonetic Features: Manner of Articulation Approximately 8 manners of articulation: Name Sub-Types Examples . Vowel vowel, diphthong aa, iy, uw, eh, ow, … Approximants liquid, glide l, r, w, y Nasal m, n, ng Stop unvoiced, voiced p, t, k, b, d, g Fricative unvoiced, voiced f, th, s, sh, v, dh, z, zh Affricate unvoiced, voiced ch, jh Aspiration h Flap dx, nx Change in manner of articulation usually abrupt and visible; manner provides much information about location of phonemes.

Acoustic-Phonetic Features: Place of Articulation Approximately 8 places of articulation for consonants: Name Examples . Labial p, b, m, (w) Labio-Dental f, v Dental th, dh Alveolar t, d, s, z, n, l Palato-Alveolar sh, zh, ch*, jh*, r** Palatal y Velar k, g, ng, (w) Glottal h *may start as alveolar (/t/, /d/) followed by palatal-alveolar ** /r/ is really a retroflex, and has a complex place of articulation Place of articulation more subject to coarticulation than manner; F2 trajectory important for identifying place of articulation.

Acoustic-Phonetic Features: Place of Articulation • Labial (/p/, /b/, /m/, /w/): • constriction (or complete closure) at lips • the only unvoiced labial is /p/ • the only nasal labial is /m/ • characterized by F1, F2, (even) F3 of adjacent vowel(s)rapidly and briefly decreasing at border with labial

Acoustic-Phonetic Features: Place of Articulation • Labio-Dental (/f/, /v/): • produced by constriction between upper lip and lower teeth • only fricatives are labio-dental in English • can be characterized by rising formants into adjacentvowels (similar to characteristics of labials) • Dental (/th/, /dh/): • produced by constriction between tongue tip and upper teeth(sometimes tongue tip is closer to alveolar ridge) • only fricatives are labio-dental in English • may be characterized by stronger energy above 6 KHz,but weaker than /sh/, /zh/ fricatives

Acoustic-Phonetic Features: Place of Articulation • Alveolar (/t/, /d/, /s/, /z/, /n/, /l/): • tongue tip is at or near alveolar ridge • a large number of English consonants are alveolar • primary cue to alveolars: F2 of neighboring vowel(s)is around 1800 Hz, except for /l/ • /l/ has low F1 ( 500 Hz) and F2 ( 1000 Hz), high F3

Acoustic-Phonetic Features: Place of Articulation • Palato-Alveolar (/sh/, /zh/, /ch/, /jh/, /r/): • tongue is between alveolar ridge and hard palate • 2 fricatives, 2 affricates, 1 retroflex • retroflex has “depression” midway along tongue • the palato-alveolar fricatives tend to have strong energy due to weak constriction allowing large airflow • /r/ (and /er/) most easily identified by F3 below 2000 Hz • Palatal (/y/): • produced with tongue close to hard palate • “extreme” production of /iy/ • F1-F2 tend to be more spread than /iy/, F1 is lower than /iy/

Acoustic-Phonetic Features: Place of Articulation • Velar (/k/, /g/, /ng/): • produced with constriction against velum (soft palate) • only plosives /k/ and /g/, and nasal /ng/ • characteristic of velars is the “velar pinch”, in whichF2 and F3 of neighboring vowel become very closeat boundary with velar. More visible in front vowel /ih/

Acoustic-Phonetic Features: Place of Articulation • Glottal (/h/): • /h/ is the nominal glottal phoneme in English; inreality, the tongue can be in any vowel-like position • the primary cue for /h/ is formant structure withoutvoicing, an energy dip, and/or an increase in aspirationnoise in higher frequencies.

Distinctive Phonetic Features: Summary • Distinctive features may be used to categorize phoneticsub-classes and show relationships between phonemes • There is often not a one-to-one correspondence between afeature value and a particular trait in the speech signal • A variety of context-dependent and context-independent cues (sometimes conflicting, sometimes complimentary) serve to identify features • Speech is highly variable, highly context-dependent, andcues to phonemic identity are spread in both the spectraland time domains. The diffusion of features makesautomatic speech recognition difficult, but human speechrecognition is able to use this diffusion for robustness.

Redundancy • Distinctive features are not always independent; someredundancy may be implied (especially binary features) • Example: Spanish +high low +low high back round +round  +back +low  +back +low round back low +round low These relationships are language and feature-set specific. (from Schane, p. 35-38)

Redundancy • Redundant information can be indicated by circling redundantfeatures: • Some redundancies are universal (can’t be +high and +low) • Phonetic sequences also have constraints (redundant info.): • English has no more than 3 word-initial consonants; in this • case, first consonant is always /s/; next is always /p/, /t/, or /k/; • third is always /r/ or /l/ (from Schane, p. 36-40)

Phonetic Transcription Given a corpus of speech data, it’s often necessary to create a transcription:• word level• phoneme level• time-aligned phoneme level• time-aligned detailed phoneme level (with diacritics)• other information: phonetic stress, emotion, syntax, repair Most common are word-level and time-aligned phoneme level. Time-aligned phonetic transcription examples: 0 110 .pau 110 180 h 180 240 eh 240 280 l 280 390 ow 390 540 .pau t uw .br

Phonetic Transcription Are phonemes precise quantities with exact boundaries? No… humans disagree on phonetic labels and boundary positions;disagreement may be a matter of interpretation of the utterance. Phonetic label agreement between humans: Full, Base Label Set: 55 (English), 62 (German), 50 (Mandarin), 42 (Spanish) Broad Categories: 7 corresponding to manner of articulation *From Cole, Oshika, et al., ICSLP’94

Phonetic Transcription • 70% agreement on 55 phonemes, 90% agreement on 7 categories?? • Best phoneme-level automatic speech recognition results on TIMIT, • with a 39-phoneme symbol set: 75.8% (Antoniou and Reynolds) • Differences: • Human agreement evaluated on spontaneous speech (stories), TIMIT is read speech • Humans used 55 phonemes; 39 phonemes for evaluating TIMIT • Phoneme agreement doesn’t translate into word accuracy… • human word accuracy is typically an order of magnitude better • than the best automatic speech recognition system.

Phonetic Transcription Phonetic label boundary agreement between humans: Agreement measured by comparing two manual labelings, A and B, and computing the percentage of cases in which B labels are within some threshold (20 msec) of A labels. agreement (%) threshold (msec) Average agreement of 93.8% within 20 msec threshold; Maximum agreement of 96% within 20 msec

Phonetic Transcription Is there a “correct” answer? No; inherently subjective although semi-arbitrary guidelines can be imposed. Is measuring accuracy meaningless? No; phonemes do have identity and order, although details may be subjective. Sometimes very precise (if semi-arbitrary) labels and boundaries are extremely important (e.g. concatenative text-to-speech databases). What about getting a computer to generate transcriptions? Advantages: consistent, fast Disadvantages: not accurate, compared to human transcription not robust to different speakers, environments

Phonetic Transcription • Automatic Phonetic Alignment (assume phonetic identity is known): • Two common methods: • “Forced Alignment”: Use existing speech recognizer, constrained to recognize only the “correct” phoneme sequence. The search process used by HMM recognizers returns both phoneme identity and location. Location information is boundary information. • (2) Dynamic Time Warping: (a) Use text-to-speech or utterance “templates” to generate same speech content with known boundaries. (b) Warp time • scale of reference (TTS or template) with input speech to • minimize spectral error. (c) Convert known boundary • locations to original time scale.

Phonetic Transcription Accuracy of automatic alignment Speaker-independent alignment using Forced Alignment: agreement (%) threshold (msec)

Phonetic Transcription Comparing manual and automatic alignment of TIMIT corpus: • Automatic method still makes “stupid” mistakes. • Manual labeling criteria not rigorously defined. • Performance degrades significantly in presence of noise. • Assumes correct phonetic sequence is known…

CSE 551: Structure of Spoken Language