1 / 20

Creating a Voice for Festival

Creating a Voice for Festival. Presentation by Matthew Hood Supervisors: S. Bangay A. Lobb . Voice: cmu_uk_rab_diphone. Presentation Overview. About the project Festival About Text to Speech 3 layer approach Waveform Generation Languages, phones and diphones Making a voice

Download Presentation

Creating a Voice for Festival

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Creating a Voice for Festival Presentation by Matthew Hood Supervisors: S. Bangay A. Lobb Voice: cmu_uk_rab_diphone

  2. Presentation Overview • About the project • Festival • About Text to Speech • 3 layer approach • Waveform Generation • Languages, phones and diphones • Making a voice • Recording Diphones • Labelling • Results

  3. About the Project • Text to speech programs have been around for many years without much excitement. • Many new applications have arisen, sparking new interest. • One of the factors limiting its usefulness is the limited number of voices (fewer than 10?) • Creating a voice is a long, tedious process. But a greater problem is the lack of documentation. • This project aims to give a comprehensive overview of how to make a voice in Festival, pointing out all the pitfall ahead of time.

  4. Festival • Festival is an open source TTS system developed at the University of Edinburgh in the late 90s. • “It offers a free, portable, language independent, run-time speech synthesis engine for various platforms under various APIs.” [Black et al] • Supported by the FestVox toolkit. • Documented in “Building Synthetic Voices” [Black et al]

  5. General Text to Speech • Text Analysis Words and Utterances identified. • Linguistic Analysis Words analysed in context and pronunciation generated e.g. 1990. • Waveform Generation Utterances turned into sound and the words “Spoken”. Due to abstraction from previous layers, this is the only layer were the voice is used.

  6. Waveform Generation • Festival is a concatenative synthesis system. • This means sound clips are joined together to generate speech eg Talking Clocks. Recorded Sound set “The time is”; “past”; “o’clock”; numbers etc. Generated Output “The time is” – “half” – “past” – “three”. Voice: cmu_us_kal_diphone

  7. Waveform Generation • For a more general system it is not feasible to record everything that could be said. • Speech needs to be broken down into smaller units. • A phone is a single phonetic sound that is generated by a human when speaking. eh - get ; feather s - sit ; mass zh - vision ; casual

  8. Languages • A language is defined by its phoneme set. • A phoneme set is a collection of every phonetic sound used in any word in the language (including silence). • US English phoneset used in Festival has 44 phones. • BUT it is not enough to record every phone in the phoneset.

  9. Diphones • We donot always pronounce a phone the same way. • Its pronunciation depends on its neighbouring phones. This is know as the co-articulatory effect. • Festival relies on the simplifying assumption that the co-articulatory effect does not extend across more than a pair of phones. • These are known as diphones.

  10. Diphones • By combining recorded diphones, we can now “say” any word in the language. • E.g. Jack - jh-ae-k jh - ae k - __ ae - k __- jh

  11. Recording Diphones • Because of the co-articulatory effect, it is nearly impossible to pronounce a diphone accurately on its own. • Using made up words is preferable to using real words. us_006 “pau t aa k aa k aa pau” - “k-aa” “aa-k” us_603 “pau t aa t ey ah t aa pau” - “ey-ah”

  12. Recording Diphones • In theory the number of diphones needed to speak a language is the number of phones squared. • But we don’t actually talk every combination. • The standard US diphone list used by festival contains 1396 diphones. • It is often worth extending this list to take into account strong accents or common foreign words.

  13. Recording Diphones • Because pronouncing the words can be a bit tricky, especially the first few times you try, FestVox provides a prompting tool.

  14. Recording Environment • The better the recording the better the voice. • With a decent sound card it is possible to record straight onto the PC. • Background noise must be kept to a minimum. • Takes approximately 1.5 hours to record all diphones. • Enviroment must be repeatable.

  15. Labelling • Labelling is the hardest and one of the most important part of creating a voice. • Label file consists of series of boundary times. • Emu label is an open source program that graphically shows where in the wave file the phones are marked. • Part of the Emu Speech Tools available on Source Forge.

  16. Hand Labelling Us-0603 “ey- ah” • Displays phones, frequency and waveform. • Sound extracted from mid point of labels. • Worth moving further into the phone when recording eh-__.

  17. Auto Labelling - results • FestVox provides an auto labeller. • 1.6% failure rate. • 8 – 15% error rate. • 70% useable diphones. (400+ hand correction)

  18. Auto labeller • Test, test and retest. • Created splittest.pl • Hand label any problem phones. • Remove DB markers.

  19. Finishing voice • Once happy with labels. • Optional pitchmark extraction. • Volume levelling. • Load the voice into festival and test with actual speech. • Build final voice database. • Create symbolic link.

  20. What I have learnt & achieved • Learnt a lot about speech and speech synthesis. • Learnt a lot about Linux and sound editing. • Created a number of variations of ru_us_matt_diphone, used to test different labelling methods, how recordings affect results etc. • Final paper giving step by step guide and helpful hints. • There is much room for future work, including voice adaptation. • Am sick of the sound of my own voice. Voice: ru_us_matt_diphone

More Related