speech synthesis n.
Skip this Video
Download Presentation
Speech Synthesis

Loading in 2 Seconds...

play fullscreen
1 / 38

Speech Synthesis - PowerPoint PPT Presentation

  • Uploaded on

Speech Synthesis. April 8, 2008. Some Reminders. Class presentations begin on Thursday: Jenessa Tara Nicole Joel I’m planning on passing out a final exam review sheet on Thursday, too. Lastly: the teen buzz. Moral of the Story.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Speech Synthesis' - lundy

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
speech synthesis

Speech Synthesis

April 8, 2008

some reminders
Some Reminders
  • Class presentations begin on Thursday:
    • Jenessa
    • Tara
    • Nicole
    • Joel
  • I’m planning on passing out a final exam review sheet on Thursday, too.
  • Lastly: the teen buzz.
moral of the story
Moral of the Story
  • Remember--categorical perception was initially used to justify the claim that listeners converted a continuous signal into a discrete linguistic representation.
  • In reality, listeners don’t just discard all the continuous information.
    • (especially for sounds like vowels)
  • Perceptual categories have to be more richly detailed than the classical categories found in phonology textbooks.
  • (We need details in order to deal with variability.)
wait a minute
Wait a minute…
  • (Classical) Categorical perception really does occur…
    • But only in limited circumstances.
  • Works best for:
  • Sounds with rapid transitions
    • (consonants, not vowels)
  • Tasks that require retaining more than one sound in memory.
    • Ex: AXB discrimination induces more categoriality than AX discrimination.
  • In these circumstances, sounds are stored in memory with less acoustic details in them.
cp results
CP Results

Responses to different pairs

  • Generally: more “correct” different responses than predicted.
  • Experienced listeners gave more different responses than new listeners.
cp results1
CP Results

Responses to same pairs

  • Experienced listeners also gave more “different” responses in this condition.
  • = Indicative of response bias
perceptual recap
Perceptual Recap
  • Overall percent correct:
    • Experienced listeners = 75%
    • New listeners = 78.5%
  • Overall percentage of “different” responses:
    • Experienced listeners = 39.2%
    • New listeners = 29.5%
  • (Another) moral of the story:
    • Correct response percentages can be boosted by bias towards one response over another.
    •  Correct answers don’t always reflect sensitivity.
perceptual recap ii
Perceptual Recap II
  • Continuous word recognition scores:
    • New items correctly recognized: 97.7%
    • Repeated items correctly recognized: 67.7%
  • Of the repeated items:
    • Same voice: 73.6%
    • Different voice: 61.8%
  • After 10 intervening stimuli:
    • Same voice: 72.7%
    • Different voice: 61.8%
  • General finding: same voice effect does not diminish over time.
speech synthesis a basic overview
Speech Synthesis:A Basic Overview
  • Speech synthesis is the generation of speech by machine.
  • The reasons for studying synthetic speech have evolved over the years:
  • Novelty
  • To control acoustic cues in perceptual studies
  • To understand the human articulatory system
    • “Analysis by Synthesis”
  • Practical applications
    • Reading machines for the blind, navigation systems
speech synthesis a basic overview1
Speech Synthesis:A Basic Overview
  • There are four basic types of synthetic speech:
  • Mechanical synthesis
  • Formant synthesis
    • Based on Source/Filter theory
  • Concatenative synthesis
    • = stringing bits and pieces of natural speech together
  • Articulatory synthesis
    • = generating speech from a model of the vocal tract.
1 mechanical synthesis
1. Mechanical Synthesis
  • The very first attempts to produce synthetic speech were made without electricity.
    • = mechanical synthesis
  • In the late 1700s, models were produced which used:
    • reeds as a voicing source
    • differently shaped tubes for different vowels
mechanical synthesis part ii
Mechanical Synthesis, part II
  • Later, Wolfgang von Kempelen and Charles Wheatstone created a more sophisticated mechanical speech device…
    • with independently manipulable source and filter mechanisms.
mechanical synthesis part iii
Mechanical Synthesis, part III
  • An interesting historical footnote:
    • Alexander Graham Bell and his dog.
  • Mechanical synthesis has largely gone out of style ever since.
    • …but check out Mike Brady’s talking robot.
the voder
The Voder
  • The next big step in speech synthesis was to generate speech electronically.
  • This was most famously demonstrated at the New York World’s Fair in 1939 with the Voder.
  • The Voder was a manually controlled speech synthesizer.
    • (operated by highly trained young women)
voder principles
Voder Principles
  • The Voder basically operated like a vocoder.
  • Voicing and fricative source sounds were filtered by 10 different resonators…
  • each controlled by an individual finger!
  • Only about 1 in 10 had the ability to learn how to play the Voder.
the pattern playback
The Pattern Playback
  • Shortly after the invention of the spectrograph, the pattern playback was developed.
    • = basically a reverse spectrograph.
  • Idea at this point was still to use speech synthesis to determine what the best cues were for particular sounds.
2 formant synthesis
2. Formant Synthesis
  • The next synthesizer was PAT (Parametric Artificial Talker).
  • PAT was a parallel formant synthesizer.
  • Idea: three formants are good enough for intelligble speech.
  • Subtitles: What did you say before that? Tea or coffee? What have you done with it?
2 formant synthesis part ii
2. Formant Synthesis, part II
  • Another formant synthesizer was OVE, built by the Swedish phonetician Gunnar Fant.
  • OVE was a cascade formant synthesizer.
  • In the ‘50s and ‘60s, people debated whether parallel or cascade synthesis was better.
  • Weeks and weeks of tuning each system could get much better results:
synthesis by rule
Synthesis by rule
  • The ultimate goal was to get machines to generate speech automatically, without any manual intervention.
    • synthesis by rule
  • A first attempt, on the Pattern Playback:
  • (I painted this by rule without looking at a spectrogram. Can you understand it?)
  • Later, from 1961, on a cascade synthesizer:
    • Note: first use of a computer to calculate rules for synthetic speech.
  • Compare with the HAL 9000:
parallel vs cascade
Parallel vs. Cascade
  • The rivalry between the parallel and cascade camps continued into the ‘70s.
  • Cascade synthesizers were good at producing vowels and required fewer control parameters…
    • but were bad with nasals, stops and fricatives.
  • Parallel synthesizers were better with nasals and fricatives, but not as good with vowels.
  • Dennis Klatt proposed a synthesis (sorry):
    • and combined the two…
  • KlattTalk has since become the standard for formant synthesis. (DECTalk)
  • http://www.asel.udel.edu/speech/tutorials/synthesis/vowels.html
  • Dennis Klatt also made significant improvements to the artificial voice source waveform.
  • Perfect Paul:
  • Beautiful Betty:
  • Female voices have remained problematic.
  • Also note: lack of jitter and shimmer
lpc synthesis
LPC Synthesis
  • Another method of formant synthesis, developed in the ‘70s, is known as Linear Predictive Coding (LPC).
  • Here’s an example:
  • As a general rule, LPC synthesis is pretty lousy.
    • But it’s cheap!
  • LPC synthesis greatly reduces the amount of information in speech…
  • To recapitulate childhood: http://www.speaknspell.co.uk/
filters lpc
Filters + LPC
  • One way to understand LPC analysis is to think about a moving average filter.
  • A moving average filter reduces noise in a signal by making each point equal to the average of the points surrounding it.

yn = (xn-2 + xn-1 + xn + xn+1 + xn+2) / 5

filters lpc1
Filters + LPC
  • Another way to write the smoothing equation is
    • yn = .2*xn-2 + .2*xn-1 + .2*xn + .2*xn+1 + .2*xn+2
  • Note that we could weight the different parts of the equation differently.
    • Ex: yn = .1*xn-2 + .2*xn-1 + .4*xn + .2*xn+1 + .1*xn+2
  • Another trick: try to predict future points in the waveform on the basis of only previous points.
  • Objective: find the combination of weights that predicts future points as perfectly as possible.
deriving the filter
Deriving the Filter
  • Let’s say that minimizing the prediction errors for a certain waveform yields the following equation:
    • yn = .5*xn - .3*xn-1 + .2*xn-2 - .1*xn-3
  • The weights in the equation define a filter.
  • Example: how would the values of y change if the input to the equation was a transient where:
    • at time n, x = 1
    • at all other times, x = 0
  • Graph y at times n to n+3.
decomposing the filter
Decomposing the Filter
  • Putting a transient into the weighted filter equation yields a new waveform:
  • The new equation reflects the weights in the equation.
  • We can apply Fourier Analysis to the new waveform to determine its spectral characteristics.
lpc spectrum
LPC Spectrum
  • When we perform a Fourier Analysis on this waveform, we get a very smooth-looking spectrum function:

LPC spectrum

Original spectrum

  • This function is a good representation of what the vocal tract filter looks like.
lpc applications
LPC Applications
  • Remember: the LPC spectrum is derived from the weights of a linear predictive equation.
  • One thing we can do with the LPC-derived spectrum is estimate formant frequencies of a filter.
    • (This is how Praat does it)
  • Note: the more weights in the original equation, the more formants are assumed to be in the signal.
  • We can also use that LPC-derived filter, in conjunction with a voice source, to create synthetic speech.
    • (Like in the Speak & Spell)
3 concatenative synthesis
3. Concatenative Synthesis
  • Formant synthesis dominated the synthetic speech world up until the ‘90s…
    • Then concatenative synthesis started taking over.
  • Basic idea: string together recorded samples of natural speech.
  • Most common option: “diphone” synthesis
    • Concatenated bits stretch from the middle of one phoneme to the middle of the next phoneme.
  • Note: inventory has to include all possible phoneme sequences
    • = only possible with lots of computer memory.
concatenated samples
Concatenated Samples
  • Concatenated synthesis tends to sound more natural than formant synthesis.
    • (basically because of better voice quality)
  • Early (1977) combination of LPC + diphone synthesis:
  • LPC + demisyllable-sized chunks (1980):
  • More recent efforts with the MBROLA synthesizer:
  • Also check out the Macintalk Pro synthesizer!
recent developments
Recent Developments
  • Contemporary concatenative speech synthesizers use variable unit selection.
  • Idea: record a huge database of speech…
    • And play back the largest unit of speech you can, whenever you can.
  • Interesting development #2: synthetic voices tailored to particular speakers.
  • Check it out:
4 articulatory synthesis
4. Articulatory Synthesis
  • Last but not least, there is articulatory synthesis.
    • Generation of acoustic signals on the basis of models of the vocal tract.
  • This is the most complicated of all synthesis paradigms.
    • (we don’t understand articulations all that well)
  • Some early attempts:
  • Paul Boersma built his own articulatory synthesizer…
    • and incorporated it into Praat.
synthetic speech perception
Synthetic Speech Perception
  • In the early days, speech scientists thought that synthetic speech would lead to a form of “super speech”
    • = ideal speech, without any of the extraneous noise of natural productions.
  • However, natural speech is always more intelligible than synthetic speech.
    • And more natural sounding!
  • But: perceptual learning is possible.
    • Requires lots and lots of practice.
    • And lots of variability. (words, phonemes, contexts)
  • An extreme example: blind listeners.
more perceptual findings
More Perceptual Findings

Reducing the number of possible messages dramatically increases intelligibility.

more perceptual findings1
More Perceptual Findings

2. Formant synthesis produces better vowels;

    • Concatenative synthesis produces better consonants (and transitions)

3. Synthetic speech uses up more mental resources.

    • memory and recall of number lists
  • Synthetic speech perception is a lot easier for native speakers of a language.
    • And also adults.

5. Older listeners prefer slower rates of speech.

audio visual speech synthesis
Audio-Visual Speech Synthesis
  • The synthesis of audio-visual speech has primarily been spearheaded by Dominic Massaro, at UC-Santa Cruz.
    • “Baldi”
  • Basic findings:
    • Synthetic visuals can induce the McGurk effect.
    • Synthetic visuals improve perception of speech in noise
      • …but not as well as natural visuals.
  • Check out some samples.