Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center)

Modeling and Generation of Accentual Phrase F0 Contours Based on Discrete HMMs Synchronized at Mora-Unit Transitions Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center) Koji Iwano (currently with Tokyo Institute of Technology, Japan) Keikichi Hirose (Dep. of Frontier Eng., The University of Tokyo, Japan)

Introduction to Corpus-Based Intonation Modeling • Traditional approach: rules derived from linguistic expertise Human-dependent (too complicated and not satisfactory, because the phenomena involved are not completely understood) • Corpus-based approach: modeling derived from statistical analysis of speech corpora Automatic (potential to improve as better speech corpora become available)

Background • HMMs are widely used in speech recognition, and fast learning algorithms exist • Macroscopic discrete HMMs associated to accentual phrases can store information such as accent type and prosodic structure • Morae are extremely important to describe Japanese intonation - sequences of high and low mora can characterize accent types

Overview of the Method • Definition of HMM and alphabet: • Accent types modeled by discrete HMMs • 2-code mora F0 contour alphabet used as output symbols • State transitions sychronized with mora transitions • Classification of HMMs and training: • HMMs classified according to linguistic attributes • Training by usual FB algorithm • Generation of F0 contours: • Best sequence of symbols generated by a modified Viterbi algorithm

The Mora-F0 Alphabet • Two codes: stylized mora F0 contours and mora-to-mora F0: 34 symbols each • Obtained by LBG clustering from a 500-sentence database (ATR continuous speech database, speaker MHT) • All the database is labeled using the 2-code symbols.

The Accentual Phrase HMM • Accentual phrases are classified according to: • Accent type • Position of accentual phrase in the sentence • (Optional: number of morae, part-of-speech, syntactic structure) State transition Mora transition HMM Accentual phrase

Example: Example: ‘Karewa Tookyookara kuru. (He comes from Tokyo) Label sequence Accent type Position [],[],[] 1 M1: 1 0 [],[],[],[],[],[] M2: 2 M3: 1 3 [],[] shape1 F01 shape2 F02 ,

HMM Topologies (a) Accent types 0 and 1 (a) Other accent types

Training Database • ATR Continuous Speech Database (500 sentences, speaker MHT) • Segmented in mora and accentual phrases • Mora labels using the mora-F0 alphabet: shape (stylized F0 contour), mora F0. • Accentual phrase labels: number of morae, position in the sentence

Output Code Generation How to use the HMM for synthesis? A) Recognition Likelihood Best path 1 output sequence B) Synthesis Best output sequence Best path

Intonation Modeling Using HMM Viterbi Search for the Recognition Problem: for t=2,3,...,T for it=1,2,...,S Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)] +[-log b(y(t)| it)]} (t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)] +[-log b(y(t)| it)]} next it next t

Intonation Modeling Using HMM Modified Viterbi Search for the Synthesis Problem: for t=2,3,...,T for it=1,2,...,S Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)] +[-log b(ymax(t)| it)]} (t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)] +[-log b(ymax(t)| it)]} next it next t

Use of Bigram Probabilities for t=2,3,...,T for it=1,2,...,S Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)] +maxk{[-log b(y(t)| y(t-1),it)]}} (t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)] +maxk{[-log b(y(t)| y(t-1),it)]}} next it next t k=1,…,K (dimension of y) k=1,…,K (dimension of y)

Accent Type Modeling Using HMM

Phrase Boundary Level Modeling Using HMM Pause Y/N J-TOBI B.I. Bound. Level 3 3 2 Y N N 1 2 3

PH1_0.original PH1_0.bigram The Effect of Bigrams PH1_1.original PH1_1.bigram PH1_2.original PH1_2.bigram

Comments • We presented a novel approach to intonation modeling for TTS synthesis based on discrete mora-synchronous HMMs. • For now on, more features should be included in the HMM modeling (phonetic context, part-of-speech, etc.), and the approach should be compared to rule-based methods. • Training data scarcity is a major problem to overcome (by feature clustering, an F0 contour generation model, etc.)

a11 a22 a33 a44 a13 a12 a23 a34 2 4 3 1 b(1|3)~b(K|3) b(1|1)~b(K|1) b(1|2)~b(K|2) b(1|4)~b(K|4) Hidden Markov Models (HMM) A Hidden Markov Model (HMM) is a Finite State Automaton where both state transitions and outputs are stochastic. It changes to a new state each time period, generating a new vector according to the output distribution of that state. Symbols: 1,2, ..., K

ステップ１：データベース作成 • ATRの連続音声データベースを使用（５００文，話者MHT) • モーラ単位に分割 • モーララベルの付与 • F0パターンを抽出 • LBG法によるクラスタリング • 全データベースにクラスタクラスを付与

Ｂｉｇｒａｍの導入 for t=2,3,...,T for it=1,2,...,S Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)] +maxk{[-log b(y(t)| y(t-1),it)]}} (t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)] +maxk{[-log b(y(t)| y(t-1),it)]}} next it next t k=1,…,K (dimension of y) k=1,…,K (dimension of y)

考察・今後の展望 • 学習データが少ない • TTSシステムへの組込みにはさらなる工夫が必要他の言語情報を考慮（音素、モーラ数、品詞等）データ不足を克服するための工夫（クラスタリング等）モデルの接続に関する検討

a11 a22 a33 a44 a13 a12 a23 a34 2 4 3 1 b(1|3)~b(K|3) b(1|1)~b(K|1) b(1|2)~b(K|2) b(1|4)~b(K|4) Hidden Markov Models (HMM) A Hidden Markov Model (HMM) is a Finite State Automaton where both state transitions and outputs are stochastic. It changes to a new state each time period, generating a new vector according to the output distribution of that state. Symbols: 1,2, ..., K

Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center)