1 / 28

Stages in “text-to-speech” synthesis

EE2F1 Multimedia (1): Speech & Audio Technology Lecture 7: Speech Synthesis (1) Martin Russell Electronic, Electrical & Computer Engineering School of Engineering The University of Birmingham. Stages in “text-to-speech” synthesis. Text normalisation Text-to-phone conversion Linguistic analysis

barton
Download Presentation

Stages in “text-to-speech” synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EE2F1Multimedia (1): Speech & Audio TechnologyLecture 7: Speech Synthesis (1)Martin RussellElectronic, Electrical & Computer EngineeringSchool of EngineeringThe University of Birmingham

  2. Stages in “text-to-speech” synthesis • Text normalisation • Text-to-phone conversion • Linguistic analysis • Semantic analysis • Conversion of phone-sequence to sequence of synthesiser control parameters • Synthesis of acoustic speech signal

  3. Approaches to synthesis • Final stage is to convert ‘phone’ or word sequence into a sequence of synthesiser control parameters • Two main approaches: • Waveform concatenation • Model-based speech synthesis (inludes articulatory synthesis)

  4. Waveform Concatenation • Join together, or concatenate, stored sections of real speech • Sections may correspond to whole word, or sub-word units • Early systems based on wholewords • E.G: Speaking clock - UK telephone system, 1936 • Storage and access major issues • Speech quality requires data-rates of 16,000 to 32,000 bits per second (bps)

  5. 1936 “Speaking Clock” From John Holmes, “Speech synthesis and recognition”, courtesy of British Telecommunications plc

  6. Whole word concatenation (1) • Whole word concatenation can give good quality speech (as in speaking clock), but has many disadvantages: • pronunciation of a word influenced by neighbouring words (co-articulation) • prosodic effects like intonation, rate-of-speaking and amplitude also influenced by context. • interpretation of a sentence will be strongly influenced by details of individual words used (“Mary didn’t buy Sam a pizza”)

  7. Whole word concatenation (2) • Disadvantages (continued): • words must be extracted from the right sort of sentence • most suitable for applications where structure of the sentence is constrained, e.g., announcements, lists… • may need to record more than one example of each word, e.g., raised pitch at end of a list, pre-pause lengthening…

  8. Example – original recording The next train to arrive at platform 2 will call at Bromsgrove, Droitwich Spa, Worcester Foregate Street and Malvern Link

  9. Example – trivial concatenative synthesis The next train to arrive at platform 2 will call at Malvern Link, Worcester Foregate Street, Droitwich Spa and Bromsgrove

  10. Example repeated • Original recording • ‘Concatenative synthesis’

  11. Whole word concatenation (3) • Disadvantages (continued): • to add new words the original speaker must be found, or all words must be re-recorded • even with specialist facilities, selection and extraction of suitable words is labour intensive and time consuming

  12. Sub-word concatenation (1) • Limitations of word-based methods suggest concatenative speech synthesis based on sub-word units • Need well-annotated, phonetically-balanced corpus of speech recordings • Extract fragments from waveforms in the corpus which represent ‘basic units’ of speech, and can be concatenated and used for speech synthesis

  13. Sub-word concatenation (2) • Difficulties include: • identification of a set of suitable units • careful annotation of large amounts of data • derivation of a good method for concatenation

  14. Sub-word concatenation (3) • Sub-word concatenation overcomes difficulties with adding new words to the application vocabulary, • But, other problems exacerbated. • In particular, coarticulation and pitch continuity problems occur within, as well as between, words. • Necessary to use several examples of each phone (corresponding roughly to different allophones).

  15. Sub-word concatenation (4) • Natural to select fragments that characterise the phone target values, but modelling transitions between these targets is a significant problem

  16. Example: sub-word concatenation “stack” (original) “task” sub-word concatenative synthesis

  17. Transitional units (1) • Central regions of many speech sounds are approximately stationary and less susceptible to coarticulation effects. • Hence select fragments which characterise transitions between phones, rather than phone targets. • e.g., diphone - transition between two phones.

  18. Transitional units (2) • There are contextually-induced differences between instantiations of the central region of phone, which cause discontinuities if they are not attended to. • Possible solutions are: • use several different examples of each diphone • store short transition regions, and • interpolate between end values

  19. Transitional units (3) • Coping with coarticulation effects by modelling transitions and • (a) using multiple examples to cope with variation in the instantiation of the phone centres, and • (b) by interpolation between short transition regions

  20. More on prosody • Discontinuity in the fundamental frequency exacerbated for sub-word methods. • Can use source-filter model to separate-excitation signal from vocal-tract shape. • Vocal-tract shape descriptions can then be concatenated and an appropriately smooth fundamental frequency pattern can be added separately.

  21. PSOLA: Pitch Synchronous Overlap and Add • PSOLA (Charpentier, 1986) • Most successful current approach to concatenative synthesis • In PSOLA, the end regions of windowed waveform samples are overlapped pitch-synchronously and added • BT’s Laureate is an example

  22. PSOLA From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

  23. Speech modification using PSOLA • In addition to speech synthesis from segments, there are two other common applications of PSOLA: • Pitch modification • Duration modification

  24. Increasing pitch using PSOLA From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

  25. Decreasing pitch using PSOLA From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

  26. The ‘Laureate’ System • The BT “Laureate” system is a modern, PSOLA-based synthesiser • See Edington et al. (1996a), also look at the web site • Demonstration

  27. PSOLA strengths and weaknesses • Strengths • Produces good quality speech • Weaknesses • Large, annotated corpus needed for each ‘voice’ • Requires accurate pitch peak detection • Inflexible – new voices can only be produced by recording and labelling significant speech corpora from new speakers • Automatic annotation of corpora using techniques from speech recognition

  28. Summary • Concatenative speech synthesis • Whole word concatenation • Importance of prosody • Sub-word concatenation • Choice of sub-word units • PSOLA

More Related