S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Inf

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad ** Punjab Engineering College Chandigarh

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT ORGANIZATION OF THE TALK • Text to Speech (TTS) Systems: Introduction • Components of TTS system • Concatenative Synthesis: Data Driven Approach • Issues in Data Driven Synthesis Approaches • Unit Selection Algorithm based on Prosodic Features • Development of Indian Language Text to Speech Systems • Demonstrations • Conclusions

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT TEXT TO SPEECH SYSTEMS : INTRODUCTION A Text to Speech system converts an arbitrary text to speech wave form.

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT LANGUAGE PROCESSING MODULES PREPROCESSOR: • Text is processed to expand abbreviations, address, numbers etc. , and for sentence end detection - . , ? ! :) : ; • CONTEXTUAL ANALYZER: • Analyze the part of speech of words - In English, words like record, permit, present are produced with stress on the first syllable if a noun and on the second if a verb. • Analyze the structure of the sentence - declarative, interrogative. Useful to get appropriate prosodic structure. PHRASE ANALYSIS: • Decompose the text into phrases and clauses – provides useful hints to the synthesis engine to give prosodic pauses

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT LANGUAGE PROCESSING MODULES LETTER TO SOUND (LTS) RULES: • Pronunciation of words may differ from spelling – for example in English: known, write wrong etc. • LTS rules provides phonetic transcription of the input text. • Use pronunciation dictionaries for English like languages. COARTICULATION AND PROSODIC GENERATOR: • Incorporate co-articulation and prosodic knowledge such as intonation and duration

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT SYNTHESIS STRATEGIES • ARTICULATORY BASED SYNTHESIS: • Simplified models of articulators or shapes of the observed vocal tract are devised • Rules are specified to control the position of the articulation • Difficulty lies in modeling of the motion of the articulators PARAMETER BASED SYNTHESIS: • Speech segments are parameterized in terms of format frequencies or linear prediction coefficients • Rules are formed to manipulate the parameters of the speech unit to manifest co-articulation, intonation and duration • Several hundred precisely crafted rules may be required

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT SYNTHESIS STRATEGIES • CONCATENATIVE BASED SYNTHESIS: • Stored speech segments are concatenated for synthesis • Speech segments are also refer to as Units • Unit can be phrase, word, syllable, diphone or phone. • A longer unit such as phrase or word may have wide range of prosodical variations depending on the context. For example: 1) He cleaned it. 2) It has to be cleaned. • A smaller unit such as phone may not have the required co-articulation. For example a mere concatenation of isolated sound “c” “l” “e” “a” “n” does not result in “clean” as spoken in continuous speech.

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT SYNTHESIS STRATEGIES • CONCATENATIVE BASED SYNTHESIS: • Sub-word units such as diphone, syllable are considered to be suitable for concatenation. • Prosodic Variations: • Intonation and duration could be acquired and incorporated in the form of rules • Store multiple realizations of units with differing prosody

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT DATA DRIVEN SYNTHESIS APPROACH

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT ISSUES IN DATA DRIVEN APPROACHES • Choice of unit • Basic unit could be sub-word unit • It should be possible to select variable length units • Characteristics of the speech database • Balanced in terms of coverage and prosodic diversity of all the units • Duration of the speech database? • Criteria to select a unit • Unit is selected depending on how well it matches with the input specification and on how well it matches with the other units in the sequence

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT EARLIER WORKS ON UNIT SELECTION ALGORITHM Hunt and Black (1996) UNIT FEATURES • Phonemic information such as labels of neighboring units and position in phrases • Prosodic features such as energy, duration and pitch FOR SYNTHESIS • Prosodic features are PREDICTED for target units • A particular realization of a unit is selected depending on how well it minimizes the cost function of unit distortion and discontinuity measure

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT EARLIER WORKS ON UNIT SELECTION ALGORITHM Black and Taylor (1997) UNIT FEATURES • Decision trees for each unit based on the questions concerning phonemic and prosodic context • Units of the same type are further clustered into different groups based on the acoustic similarity within FOR SYNTHESIS • For each target unit, an appropriate decision tree is selected to find the best cluster of candidate units • A search is then made to find the best path through the candidate units • A particular realization of a unit is selected taking into account the distance of the unit from its cluster center and the cost of joining two adjacent units

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT EARLIER APPROACES FOR UNIT SELECTION • Need a prediction component to get the prosodic features of the target units • Build decision trees and defines a measure for acoustic similarity based on which the units of the same type are further clustered • Questions: • Can we avoid prediction of prosodic features of target units using a prior knowledge (it may not available also?) • How do you define a measure for acoustic similarity based on Cepstrum for longer units such as syllables and words (For two units with different durations, we need more than simple Euclidean distance between Cepstral Coefficients of the units)

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT UNIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES • Unit selection mainly based on Pitch, Duration and Energy • No need of prior prosodic knowledge • Does not cluster the units based on Euclidean distance between Cepstral coefficients • Hypothesis: A Unit is best perceived in synthesized speech if its neighboring units have same or at least similar prosodic and phonetic features as that of in the recorded speech (from which the unit is selected).

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT UNIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES • UNIT FEATURES • Phonetic context: • Position of the unit (syllable) • Last phone of the previous unit • First phone of the succeeding unit • Prosodic context: • Pitch, duration and energy • Previous unit’s pitch, duration and energy • Succeeding unit’s pitch, duration and energy

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT UNIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES • FOR SYNTHESIS • Having selected (k-1) th unit, a realization of k th unit if it satisfies the following conditions • Position in the word is same as required • It should have the coarticulation effect of last phone of previous unit • Prosodic features such as pitch, duration and energy should match with the expectations of (k-1) th unit • WHAT ARE THESE EXPECTATIONS AND HOW DO YOU MEASURE PROSODIC SIMILARITY

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT UNIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES EXPECTATIONS OF (K-1) th UNIT Actual values of pitch, duration and energy of the unit following (k-1) th unit in the speech corpus If Ea Da and Pa are the energy, duration and pitch of k th syllable Ee, De, and Pe are the expectations of (k-1) th syllable then Prosodic Matching Function

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT SIGNIFICANCE OF PROSODIC MATCHING FUNCTION • A lesser value of function m (.) indicates better prosodic match • The function m(.) attains the least possible value “0” if • Ee = Ea , De = Da , and Pe = Pa (k th unit happens to the successor of (k-1) th in the speech corpus ) • Selecting k th unit will lead to selection of longer sequence consisting of two units • Thus the function m(.) will implicitly selects longer available sequence such as words, phrases and even sentences • HOW WELL IT WORKS ?

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT DEVELOPMENT OF TEXT TO SPEECH SYSTEMS FOR INDIAN LANGUAGES • Nature of Indian Scripts • Common phonetic Base • About 35 Consonants and 18 Vowels • Phonetic nature of languages - We almost speak what we write • Choice of Unit – Syllable • Basic units of Indian writing system are characters • These characters are close to syllables and are typically of the form C, V, CV, CVC, CCV

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT DEVELOPMENT OF TEXT TO SPEECH SYSTEMS FOR INDIAN LANGUAGES Letter to Sound Rules(LTS) Almost one to one correspondence between character and sound Inherent Vowel Suppression(IVS) 1.No two successive characters undergoes IVS 2.The last character of the word always have its vowel suppressed unless the vowel is not /a/. e.g. rAma -> rAm 3.For characters in word middle position, IVS occurs if the next character in the word is a.Either not the last character b.Or has a vowel other than /a/.

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT DEVELOPMENT OF TEXT TO SPEECH SYSTEMS FOR INDIAN LANGUAGES Syllabification Rules 1.When nasals (/mz/ & /mzz/) succeeds a vowel, there are treated as part of vowel. e.g. samzskrit 2.When 3 or more consonants between consecutive vowels, 1st consonants -> coda of previous syllable Other consonants -> onset of next syllable 3.When exactly 2 consonants between consecutive vowels, 1st consonants -> coda of previous syllable 2nd consonants -> onset of next syllable Exception: Second consonant is from {/r/, /s/, /sh/, /shz/} 1st & 2nd consonant -> onset of next syllable

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT DEVELOPMENT OF SPEECH DATABASES • GENERATION OF SPEECH CORPUS • Sentences are selected from large text corpus (available with LTRC) taking into account of high frequency syllables • A sentence is selected if it has at least one high frequency syllable not present in the previous selected sentences • Recording is done in lab environment • LABELLING THE CORPUS • Mark the boundaries of speech segments at phone level • Emulabel (www.festvox.org/emu)

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT DEVELOPMENT OF SPEECH DATABASES • CORPUS DETAILS • Hindi • 96 minutes • 2391 high frequency syllables (23096 total realizations) • Telugu • 110 minutes • 2291 high frequency syllables (33417 total realizations)

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT DEMOS AND COMPARISON PMF based Speech Synthesizer Festvox based Speech Synthesizer Hindi Sample 1 CLICK HERE CLICK HERE Hindi Sample 2 CLICK HERE CLICK HERE Telugu Sample 1 CLICK HERE CLICK HERE Telugu Sample 2 CLICK HERE CLICK HERE

DATA DRIVEN SYNTHESIS APPROACH FOR INDIAN LANGUAGES USING SYLLABLE AS BASIC UNIT CONCLUSION • Data driven synthesis methods have capabilities to produce high quality synthesis • Syllable is more suitable unit than diphone for Indian languages • Unit selection algorithm based on simple prosodic features performs as good as or sometimes better than other methods • Selection of a unit satisfying local phonetic constraints and prosodic constraints through prosodic matching function produce high quality speech output

S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Inf