Data-driven approach to rapid prototyping Xhosa speech synthesis Albert Visagie Justus Roux Centre for Language and Speech Technology Stellenbosch University South Africa
Introduction • Japan-South African Intergovernmental Science and Technology Cooperation Programme. • Goals: • Understand what is needed from a linguistic and technology standpoint. • Build a text-analysis front-end. • Experimental platform.
Outline • Xhosa: • orthography, • phonetics, • tone • Approach: • Text analysis, • HTS.
Xhosa • Xhosa is spoken in South Africa, by about 8 million people. • One of the official languages of South Africa • Writing system is relatively young, and based on English letters. • Many dialects. • Borrowed clicks from Khoisan.
Xhosa: Orthography Agglutinative language. Nouns: • 15 classes (including plural & singular). • Nouns affixed for dimunitive. Verbs: • Verbs affixed according to subject, tense, negative etc. Examples: teach: -fund- preacher (teacher): umfundisi u + m(u) + fund + is + i small preacher: umfundisana u + m(u) + fund + is + ana He/she will teach them: uzakubafundisa u + za + ku + ba + fund + is + a
Xhosa: Phonetics Consonants: • Implosive /b/ • Ejectives and aspirated versions of stops. • 15 Clicks Vowels • Five basic vowels, including long versions.
Xhosa: Tone • According to the literature, it’s a tone language. • High, Low, and Falling tones. • Recent dictionary: has tone marked for root morphemes, rules can be constructed to predict movement under morphological composition. • Recent work: • Downing, Roux, argue for accent. • Kuun: Statistical experiment suggests highly regular structure. • Observed regularity on pitch rises and duration increase gives a simple method to use in a first prototype.
Approach Focus on language dependent components: • Build the text analyser, • use an existing synthesiser. Choice: HTS 2.0 • Model driven, trainable synthesiser. • Contains language independent F0 and duration models • Good use of synthesis database by predicting spectrum, F0 and segment duration separately.
HTS: Symbolic Features Each segment of audio (HMM state) is labelled according to its linguistic context Examples: • Phonetic context: labels of preceding and following phones. • Parts-of-speech. • Stress or canonical tone. • Counting.
Text Analyser Components Components: • Orthographic to phonetic • Morphological analysis • Parts-of-speech • Canonical tone marks
Orthographic to Phonetic • The orthography is very young, and highly consistent with the pronunciation. • Hand-written letter-to-sound rewrite rules. • Lexicon for loan words.
Morphology • Specially bootstrapped from a Zulu version for this project. • Requires a lexicon of root morphemes. • Works with isolated words. • Ambiguous! • Ideal: root morpheme boundaries, affix types, POS tagger for disambiguation. • Implemented: None
Parts-of-Speech • Morphological analysis. • Ideal: POS tagger. • Implemented: Exhaustive lists of closed sets – pronouns, conjunctions, prepositions, etc.
Tone • A printed dictionary with canonical tone markings for root morphemes is available. • Rules can be constructed to determine movement of at least High tones, under morphological composition. • Highly regular structure: 3rd-from-last syllable starts high pitch excursion, 2nd-from-last syllable lengthened. • Ideal: Exhaustive specification of set tones • Implemented: Word-level syllable counts (3-1, 2-2, 1-3)
Tests • Basic intelligibility test:Listeners asked to transcribe what they hear. • Incomplete phrases. • Two versions of the question set, and natural utterances (recoded) • Mother-tongue and second language speakers. • Impressions: • “He’s from the townships.” • “That’s perfect, there’s nothing wrong with that.” • Also frowns and repeats.
Next Steps • Comprehension test? • Impressions. • Baseline comparative/preference test. • Improvements • Question phrases. • Information from morphological analysis. • Canonical tone markings. • Zulu
Conclusion • The system worked very well, considering the bare minimum of knowledge currently incorporated. • Data driven approach with HTS well suited to bootstrapping a new language. • Got experimental platform
Demos “Ubangele amadoda amaninzi kule lali,” • Natural: • Synthesised: “waqalisa ukunqwenela ukuba nomzi.” • Natural: • Synthesised: Click song: