Overview of Text to Speech

Overview of Text to Speech Getting the computer to read your printed document out loud”

Text to Speech • “Text-to-Speech software is used to convert words from a computer document (e.g. word processor document, web page) into audible speech spoken through the computer speaker”

Benefits • The benefits of speech synthesis have been many, including computers that can read books to people, better hearing aids, more simultaneous telephone conversations on the same cable, talking machines for vocally impaired or deaf people and better aids for speech therapy.

The history of speech synthesis • What you maybe don't know is that the first synthetic speech was produced as early as in the late 18th century. • The machine was built in wood and leather and was very complicated to use generating audible speech. It was constructed by Wolfgang von Kempelen and had great importance in the early studies of Phonetics. • The picture following is the original construction as it can be seen at the Deutsches Museum (von Meisterwerken der Naturwissenschaft und Technik) in Munich, Germany.

Von Kempelens Machine

Voder • In the early 20th century when it was possible to use electricity to create synthetic speech, the first known electric speech synthesis was "Voder" and its creator Homer Dudley showed it to a broader audience in 1939 on the world fair in New York.

OVE • One of the pioneers of the development of speech synthesis in Sweden was Gunnar Fant. • During the 1950s he was responsible for the development of the first Swedish speech synthesis OVE (Orator Verbis Electris.) • By that time it was only Walter Lawrences Parametric Artificial Talker (PAT) that could compete with OVE in speech quality. • OVE and PAT were text-to-speech systems using Formant (parametric) synthesis.

Speech synthesis becomes more human-like • The greatest improvements when it comes to natural speech were during the last 10 years. • The first voices we used for ReadSpeaker back in 2001 were produced using Diphone synthesis. • The voices are sampled from real recorded speech and split into phonemes, a small unit of human speech. This was the first example of Concatenation synthesis. However, they still have an artificial/synthetic sound. We still use diphone voices for some smaller languages and they are widely used to speech-enable handheld computers and mobile phones due to their limited resource consumption, both memory and CPU.

Unit Selection • It wasn't until the introduction of a technique called Unit selection, that voices became very naturally sounding. this is still concatenation synthesis but the used units are larger than phonemes, sometimes a complete sentence.

Why use Speech Synthesis • Visual Issue (Difficulty seeing text) • Cognitive Issue ( Low reading level/comprehension) • Motor Issue (Difficulty handling a book or paper)

Forms of Text • E text Most of the text you see on your computer Examples: Internet, Email, Word Document, E Books • Paper text Any text is printed on paper Examples: Newspaper, Book, Magazine

Characteristics of Speech synthesis systems • Many speech synthesis systems take as their input text and output speech. • Hence they are often known as text to speech (TTS) systems. • The naturalness of a speech synthesizer usually refers to how much the output sounds like the speech of a real person. • The intelligibility of a speech synthesizer refers to how easily the output can be understood.

Parts of Speech Synthesizers • . Speech Synthesizers usually consist of two parts.

First Part • The first part has two major tasks. First it takes the raw text and converts things like numbers and abbreviations into their written-out word equivalents. This process is often called text normalization, pre-processing, or tokenization. Then it assigns phonetic transcriptions to each word, and divides and marks the text into various linguistic units like phrases, clauses, and sentences. The combination of phonetic transcriptions and prosody information make up the symbolic linguistic representation output of the first part of the system to the second part.

Second Part • The other part, the back end, takes the symbolic linguistic representation and converts it into actual sound output. • The back end is often referred to as the synthesizer.

Text normalization challenges • The process of normalizing text is rarely straightforward. Texts are full of homographs (i.e. words that are spelt the same but are pronounced differently, e.g. Read the book, The book was read), numbers and abbreviations that all ultimately require expansion into a phonetic representation. • There are many words in English which are pronounced differently based on context(i.e. homographs) . Some examples: • project: My latest project is to learn how to better project my voice. • bow: The girl with the bow in her hair was told to bow deeply when greeting her superiors.

Most TTS systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well-understood, or computationally effective.

Numbers • Deciding how to convert numbers is another problem TTS systems have to address. • It is a fairly simple programming challenge to convert a number into words, like 1325 becoming "one thousand three hundred twenty-five". • However, numbers occur in many different contexts in texts, and 1325 should probably be read as "thirteen twenty-five" when part of an address (1325 Main St.) and as "one three two five" if it is the last four digits of a social security number. • Often a TTS system can infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the systems provide a way to specify the type of context if it is ambiguous.

Abbreviations • Similarly, abbreviations like "etc." are easily rendered as "et cetera", but often abbreviations can be ambiguous. • For example, the abbreviation "in." in the following example: "Yesterday it rained 3 in. Take 1 out, then put 3 in." • "St." can also be ambiguous: "St. John St." • TTS systems with intelligent front ends can make educated guesses about how to deal with ambiguous abbreviations, while others do the same thing in all cases, resulting in nonsensical but sometimes comical outputs: "Yesterday it rained three in." or "Take one out, then put three inches."

Text-to-phoneme challenges • Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion, as phoneme is the term used by linguists to describe distinctive sounds in a language.

Dictionary Based approach • The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciation is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary.

Rule based approach • The other approach used for text-to-phoneme conversion is the rule-based approach, where rules for the pronunciations of words are applied to words to work out their pronunciations based on their spellings. This is similar to the "sounding out" approach to learning reading.

Synthesizer technologies • There are two main technologies used for the generating synthetic speech waveforms: concatenative synthesis and formant synthesis sometimes called parametric speech synthesis. • There are others such as • Recorded promptss • Intonation modeling

Formant Synthesis • Formant synthesis does not use any human speech samples at runtime. Instead, the output synthesized speech is created using an acoustic model. • Parameters such as frequency amplitude etc are varied over time to create a waveform of artificial speech. • This method is sometimes called Rule-based synthesis but some argue that because many concatenative systems use rule-based components for some parts of the system, like the front end, the term is not specific enough.

Many systems based on formant synthesis technology generate artificial, robotic-sounding speech, and the output would never be mistaken for the speech of a real human. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have some advantages over concatenative systems.

Formant synthesized speech can be very reliably intelligible, even at very high speeds, avoiding the acoustic glitches that can often plague concatenative systems. • High speed synthesized speech is often used by the visually impaired for quickly navigating computers using a screen reader. • Second, formant synthesizers are often smaller programs than concatenative systems because they do not have a database of speech samples. • Last, because formant-based systems have total control over all aspects of the output speech, a wide variety of prosody can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.

Formant • This synthesis is a sort of source-filter-method that is based on mathematic models of the human speech organ.The approach pipe is modelled from a number of resonances with resemblance to the formants (frequency bands with high energy in voices) in natural speech.The first electronic voices Voder, and later on OVE and PAT, were speaking with totally synthetic and electronic produced sounds using formant synthesis. As with articulatory synthesis, the memory consumption is small but CPU usage is large.

The Source-Filter Model of Formant Synthesis • Excitation or Voicing Source(s) to model sound source • standard wave of glottal pulses for voiced sounds • randomly varying noise for unvoiced sounds • modification of airflow due to lips, etc.

Formant Synthesis continued • high frequency (F0 rate), quasi-periodic, choppy • modeled with vector of glottal waveform patterns in voiced regions • Acoustic Filter(s) • shapes the frequency character of vocal tract and radiation character at the lips • relatively slow (samples around 5ms suffice) and stationary • modeled with LPC (linear predictive coding)

Concatenative synthesis • Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. • Generally, concatenative synthesis gives the most natural sounding synthesized speech. • However, natural variation in speech and automated techniques for segmenting the waveforms sometimes result in audible glitches in the output, detracting from the naturalness. There are three main subtypes of concatenative synthesis:

Subtypes • Unit selection synthesis uses large speech databases (more than one hour of recorded speech). During database creation, each recorded utterance is segmented into some or all of the following linguistic constructs such as phonemes, words phrases and sentences • Diphone synthesis uses a minimal speech database containing all the Diphones (sound-to-sound transitions) occurring in a given language. In diphone synthesis, only one example of each diphone is contained in the speech database. • Domain-specific synthesis concatenates pre-recorded words and phrases to create complete utterances.

Concatenative Synthesis • Record basic inventory of sounds • Retrieve appropriate sequence of units at run time • Concatenate and adjust durations and pitch • Synthesize waveform

Concatenating synthesis • A concatenating synthesis is made of recorded pieces of speech (sound-clips) that is then unitized and formed to speech. • Depending on the length of sound-clips that are used it become a diphone or a polyphonic synthesis. • The latter in a more developed version is also called a Unit Selection synthesis, where the synthesizer has access to both long and short segments of speech and the best segments for the actual context is chosen.

Diphone • In phonetics, a diphone is an adjacent pair of phones. It is usually used to refer a recording of the transition between two phones. • A phone is the actual pronunciation of a phoneme

Diphone and Polyphone Synthesis • Phone sequences capture co-articulation • That is how combinations of phones sound

Diphone and Polyphone Synthesis • Data Collection Methods • Collect data from a single (professional) speaker • Select text with maximal coverage (typically with greedy algorithm), or • Record minimal pairs in desired contexts (real words or nonsense)

Reduce number collected by • phonotactic constraints • collapsing in cases of no co-articulation • Cut speech in positions that minimize context contamination • Need single phones, diphones and sometimes triphones

Diphone • For a diphone synthesis elements from the recorded speech are very small. • The speech may sound a bit monotonic. • Diphone synthesis doesn't work that well • in languages where there is a lot of inconsequence in the pronunciation rules (English, Swedish etc) • in special cases where letters are pronounced differently than in general. • The diphone works better for languages that have large consistencies in the pronunciation (Spanish, Finnish etc.) • Another advantage is that the prosody, the intonation, can be described in much detail.

Signal Processing for Concatenative Synthesis • Diphones recorded in one context must be generated in other contexts • Features are extracted from recorded units • Signal processing manipulates features to smooth boundaries where units are concatenated • Signal processing modifies signal via ‘interpolation’ • intonation • duration

Unit selection • The greatest difference between a Unit selection and a diphone voice is the length of the used speech segments. • There are entire words and phrases stored in the unit database. this implies that the database for the Unit selection voices are many times bigger than for diphone voices. • Thus, the memory consumption is huge while the CPU consumption is low.

Unit Selection • The most important issue is to still get a natural and smooth prosody. • This is hard because the units contain both intonation and pronunciation since entire phrases are used almost directly from the recorded data. • Since the first Unit selection voice was released, over eight years ago, there has been much improvement for each new voice with every release.

HMM synthesis • A quite new technology is speech synthesis based on HMM, a mathematical concept called Hidden Markov models. • It is a statistical method where the text-to-speech system is based on a model that is not known beforehand but it is refined by continuous training. • The technique consumes large CPU resources but very little memory. • This approach seems to give a better prosody, without glitches, and still produces very natural sounding, human-like speech

Recorded Prompts • The simplest (and most common) solution is to record prompts spoken by a (trained) human • Produces human quality voice • Limited by number of prompts that can be recorded • Can be extended by limited cut-and-paste or template filling

Articulatory synthesis • In an articulatory synthesis, models of the human articulators (tong, lips, teethes, jaw) and vocal ligament are used to simulate how an airflow passes through, to calculate what the resulting sound will be like. It is a great challenge to find good mathematical models and therefore the development of articulatory synthesis is still in research. The technique is very computation-intensive but memory requirements is almost nothing.

Sable • SABLE is an emerging standard extending SGML • http://www.cstr.ed.ac.uk/projects/sable.html • marks: emphasis(#), break(#), pitch(base/mid/range,#), rate(#), volume(#), semanticMode(date/time/email/URL/...), speaker(age,sex) • Implemented in Festival Synthesizer (free for research, etc.): http://www.cstr.ed.ac.uk/projects/festival.html

Assistive Applications of speech synthesis • Systems that provide voice synthesis output for blind users are generally referred to as screen readers. Brown (1989)[ Cook and Hussey 95] has identified the capabilities an ideal voice output system should have.

Key Features • 1: Good audio environment. No background noise, Good speakers, earphones etc • 2: Good intelligibility. The output should be intelligible. Studies have shown this to be paramount. Studies have also shown that naturalness of the voice is also desirable particularly for female users of speech synthesis. Synthetic voices are not that acceptable. • 3: The screen reader should work with all commercially available software, i.e. the blind user should have access to the same software the sighted user has. • This includes access to both text and graphics. • 4: The adapted output system should work with a variety of speech synthesizer systems.

User Interface • The user interface should have the following characteristics. • 1: Spoken letters often sound the same e.g. b and v. To reduce ambiguity the synthesizer should have access to the aviators alphabet (Alpha Bravo , Charlie etc.) • 2: To match the capabilities of normal vision, the screen reader should be able to read forward or backwards, read punctuation, highlights and other syntactical conventions. • 3: A sighted reader often scans whole passages to get context or a sense of the text. The screen reader should be able to read complete sentences and passages. • 4: Computer programs often generate prompts and output messages. The screen reader should be able to read these.

Operational Characteristics • The following are desirable operational characteristics of the screen reader. • 1; It should be easy to use and maintain. It shouldn’t require huge amounts of training. Screen readers are often complex and difficult to master. • 2: Screen readers have two modes , application and review. Review is where the reader is basically reading. Application is where the functionality of the application a can be accessed. For example a document in a word processor could be read in review mode but edited in application mode. Ideally the two modes should be merged. If a mistake were noted while a document is being read then it would be beneficial to change it there and then and not have to switch out of review into application mode. .

Overview of Text to Speech

Overview of Text to Speech

Presentation Transcript

TEXT TO SPEECH SYNTHESIS

A Text-to-Speech Synthesis System

Text-to-Speech Part II

Text to speech to text: a third orality?

Future of Speech-to-text

Speech Processing Text to Speech Synthesis

6-Text To Speech (TTS) Speech Synthesis

FLST: Text-to-Speech Synthesis

Stages in “text-to-speech” synthesis

Listen Up: Overview of Text to Speech Software

Overview of Normal Speech

5-Text To Speech (TTS) Speech Synthesis

BIOVI Text To Speech (TTS) project

Introduction to text-to-speech synthesis

Numerical Text-to-Speech Synthesis System

Text-To-Speech Device for V+

Text to speech

Speech To Text Service

Text-to-speech Synthesis

Text-to-Speech Part II

Text-To-Speech Synthesis

Future of Speech-to-text