Introduction to Speech Synthesis and Production Systems

SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03

Speech units • Sentences & phrases • Words • Syllables • Phonemes • Subphonemic acoustic segments Speech features Prosodic (suprasegmental) features • Intensity variation • Pitch variation Phonemic features • Articulatory • Acoustic • Perceptual

Classification of phonemes Vowels • Pure vowels • Diphthongs Consonants • Semivowels • Whisper • Stops • Nasals • Fricatives • Affricates

Speech production system

Schematic of speech production

Vovel spectrum

Speech synthesis Generation of speech by a machine Applications • Voice response systems (limited vocabulary) • Text-to-speech synthesis (unlimited vocabulary) • Analysis-by-synthesis (speech research) • Generation of speech-like test signals • Analysis-synthesis systems * channel capacity reduction * secure commn. * speech enhancement * voice transformation * processing for hearing aids

Development of speech synthesizers • Mechanical / electro-mechanical (1760-1930) • Electronic analog with key-board input (1930’s) • Electronic analog analysis-synthesis systems (1930-50) • Digital synthesizer (1950 ..) * software based * hardware based

Mechanical synthesizers Von Kempelen, 1780 Wheatstone’s speaking machine

Riesz, 1930’s: Speaking machine

Dudley, 1930s: Voder Electronic analog synthesizer with mechanical keyboard

Fant, 1950s: OVE

Holmes, 1960s: Parallel formant synth.

Klatt, 1970s: Cascade/parallel formant synth.

Modern synthesis approaches Waveform based • high quality natural output • limited vocabulary • large storage requirement Speech model based • unlimited speech synthesis with small storage • difficulty in parameter generation & concatenation Text-to-speech synthesis • Text pre-processing & phonetic transcription • Parsing for syntactic & semantic structure Prosodic information & Sound units • Speech waveform generation

Speech model based approaches • Articulatory • Source-filter * channel vocoder * LPC vocoder * homomorphic vocoder * formant-based synthesizer • Acoustic * phase vocoder * sinusoidal model * harmonic plus noise model (HNM)

HARMONIC PLUS NOISE MODEL (Stylianou, 1995; 2001) Speech signal divided into: • harmonic part • noise part Harmonic part Noise part Parameters: • Harmonic amplitudes and phases • max. voiced frequency • V/UV & pitch • noise parameters

IMPLEMENTATION OF HNM

ANALYSIS

SYNTHESIS

SEGMENT CONCATENATION For generation of longer units from smaller ones. Steps: 1) Parsing of phonetic transcript 2) Fetching the parameters of required units 3) Pitch and intensity modifications for prosody 4) Smoothening of the parameter tracts at unit boundaries 5) Interpolation of the parameters over the frame length from end point values 6) Synthesis

RESULTS • All VCV syllables and vowels natural & intelligible if synthesized using harmonic part only, except /a∫a/ and /asa/ • HNM preserve the styles (anger, high articulatory rate) Synthesized /a∫a/ Synthesized /asa/

RESULTS (continued) GCIs from glottal signal give better synthesis. Pitch contours for "/ap kΛhœn ja rΛhE hœn/" From glottal signal From speech (Childers and Hu’s, 1994)

RESULTS (continued) Good quality of the larger units constructed from prarameters of the smaller units. Recorded /ΛbhImani/ Synthesized from /ΛbhI/, /Ima/, /ani/

DEMONSTRATIONS

Further developments • High quality multilingual / multi-dialect text-to-speech synthesis • Voice transformations • Processing for aids for the hearing impaired

THANKS

Introduction to Speech Synthesis and Production Systems