HTK tutorial

HTK tutorial Speaker: ricer Date:2005.08.26

outline • Data preparation • Corpora: label & speech data • Three models for Automatic Speech Recognition • Acoustic model • Feature extraction • HMM (Hidden Markov Model) • Pronunciation dictionary • Searching net • Free-syllable net • Large vocabulary • Recognizer evaluation

Data preparation • We have • Wave files and the correspond labels • MD01M0P0000 • 國語年份女生 編號 • We want • The all files list J:/sp12/wav/1.wav, …*.lab J:/sp12/wav/2.wav, …*.lab • The all labels in a file (Master Label File) #!MLF!# "*/1.lab" sia tian . "*/2.lab" sia yu . Function: file2list Function: lab2mlf

xxx1000 夏天下雨 sia4_tieN1_sia4_Y3 transcription 國語語音波形夏天(sia4 tieN1) 下雨(sia4 Y3) 聲學單位夏sia4 天tieN1 下sia4 雨Y3 去聲調音節 sia sp tieN sia Y 去聲調音素 s i a sp t i e N s i a Y 音節內右相關 s+i i+a a sp t+i i+e e+N N s+i i+a a Y 音節內左右相關 s+i s-i+t a-i sp t+i t-i+e i-e+N e-N s+i s-i+t a-i Y

sia4 sia s s+i tian1 tian i s-i+a Yu3 yu a a-i sp t t+i … …. Model list Master label File #!MLF!# "*/1.lab" sia4 tian1 . "*/2.lab" sia4 yu3 . syl_tone.mod syl_sp.mod phn.mod tri.mod Mono-syllable Mono-phone Tri-phone Tonal syllable #!MLF!# "*/1.lab" sia sp tian . "*/2.lab" sia sp yu . #!MLF!# "*/1.lab" s i a sp t i a n . "*/2.lab" s i a sp y u . #!MLF!# "*/1.lab" s+i s-i+a i-a sp t+i t-i+a i-a+n a-n . "*/2.lab" s+i s-i+a a-i sp y+u y-u . Function: hled

Data preparation Mono-phone mlf tri-phone mlf hled hled3("syl.mlf", "ex.led", "syl2tri.dic", "tri.mlf", "tri.mod" )

Feature extraction Mel-Frequency Cepstrum Coefficient • See vip/eda.cfg NATURALREADORDER = TRUE SOURCEFORMAT = WAV TARGETKIND = MFCC_E_D_A TARGETRATE = 100000.0 WINDOWSIZE = 200000.0 USEHAMMING = TRUE PREEMCOEF = 0.97 NUMCHANS = 26 NUMCEPS = 12 ENORMALISE = TRUE DELTAWINDOW = 2 ACCWINDOW = 2

Creating mono-phone HMM Transition matrix a33 state3 a11 a22 state1 state2 a01 a12 a23 a34 0.000e+0 1.000e+0 0.000e+0 0.000e+0 0.000e+0 0.000e+0 5.000e-1 5.000e-1 0.000e+0 0.000e+0 0.000e+0 0.000e+0 5.000e-1 5.000e-1 0.000e+0 0.000e+0 0.000e+0 0.000e+0 5.000e-1 5.000e-1 0.000e+0 0.000e+0 0.000e+0 0.000e+0 0.000e+0 • See vip/3n3s1m.pro hcompv("3n3s1m.pro","phn.mod","mfc.lst","phn.mlf") ~o <VecSize> 39 <MFCC_E_D_A> <DiagC> <BeginHMM> <NumStates> 5 <StreamInfo> 3 13 13 13 <State> 2 <SWeights> 3 1.000000e+000 1.000000e+000 1.000000e+000 <Stream> 1 <Mean> 13 0 0 0 0 0 0 0 0 0 0 0 0 0 <Variance> 13 1 1 1 1 1 1 1 1 1 1 1 1 1 <Stream> 2 <Mean> 13 0 0 0 0 0 0 0 0 0 0 0 0 0 <Variance> 13 1 1 1 1 1 1 1 1 1 1 1 1 1 <Stream> 3 <Mean> 13 0 0 0 0 0 0 0 0 0 0 0 0 0 <Variance> 13 1 1 1 1 1 1 1 1 1 1 1 1 1 Function: hcompv

Creating mono-phone HMM erest(0,"mfc.lst", "phn.mlf", "phn.mod", 4) All Gaussian have the same mean and variance Refine Gaussain to fit each data a a i i t t s s

Creating mono-phone HMM *.sts No. model name acounts state1 state2 state3 1 "A" 592 4905.791992 2986.182129 2671.728271 2 "C" 340 2879.937256 1012.802734 1034.989014 3 "E" 124 1856.104492 837.523865 670.962036 4 "G" 2082 12491.683594 13483.448242 7560.445313 5 "I" 580 5163.432617 2224.220703 2649.229248 6 "J" 358 1926.428955 990.966858 1072.944946 hhed(5, "ssp.hed", "phn_sp.mod", 6) AT 2 4 0.2 {sil.transP} AT 4 2 0.2 {sil.transP} AT 1 3 0.3 {sp.transP} TI ssp {sil.state[3],sp.state[2]}

Creating tri-phone HMM ~h "i" <BEGINHMM> <NUMSTATES> 5 <STATE> 2 <SWEIGHTS> 3 1.000000e+000 1.000000e+000 1.000000e+000 <STREAM> 1 <MEAN> 13 -1.231087e+001 -9.749413e-001 9.766034e+000 … <VARIANCE> 13 2.357146e+001 6.214857e+001 6.707030e+001 … <GCONST> 6.855833e+001 <STREAM> 2 <MEAN> 13 -4.161490e-002 -3.644128e-001 2.665605e-002 … <VARIANCE> 13 9.070483e-001 3.527379e+000 4.040065e+000 … <GCONST> 3.090531e+001 <STREAM> 3 <MEAN> 13 -2.207233e-002 2.695016e-002 -4.460607e-001 … <VARIANCE> 13 1.712317e-001 4.176200e-001 3.959162e-001 … <GCONST> 6.822828e+000 …… ~h “s-i+a" <BEGINHMM> <NUMSTATES> 5 <STATE> 2 <SWEIGHTS> 3 1.000000e+000 1.000000e+000 1.000000e+000 <STREAM> 1 <MEAN> 13 -1.231087e+001 -9.749413e-001 9.766034e+000 … <VARIANCE> 13 2.357146e+001 6.214857e+001 6.707030e+001 … <GCONST> 6.855833e+001 …... ~h “t-i+a" <BEGINHMM> <NUMSTATES> 5 <STATE> 2 <SWEIGHTS> 3 1.000000e+000 1.000000e+000 1.000000e+000 <STREAM> 1 <MEAN> 13 -1.231087e+001 -9.749413e-001 9.766034e+000 … <VARIANCE> 13 2.357146e+001 6.214857e+001 6.707030e+001 … <GCONST> 6.855833e+001 …... tri-phone HMM Mono-phone HMM hhed hhed(10, "tri.hed", "phn_sp.mod", 11)

Creating tri-phone HMM Gaussian (with same mean and variance) i-a+n a s-i+a i t-i+a i-a

Creating tri-phone HMM hhed(15, "mix2.hed", "tri.mod", 16) Single Gaissian for each “model” Gaussina Mixtrures (two Gaussians) i-a+n i-a+n s-i+a s-i+a t-i+a t-i+a i-a i-a behhed(2,"hmm/hmm15/15.sts","hed/mix2.hed") mix2.hed MU 2 {s-i+a.state[2].stream[1-3].mix} MU 2 {t-i+a.state[3].stream[1-3].mix} …… erest(16,"mfc.lst", "tri.mlf", "tri.mod", 20)

Creating tri-phone HMM : Training data s+i(mix1) i-a(mix2) s+i(mix2) i-a(mix1) i-a(mix3) s-i+a(mix1) s-i+a(mix2)

Pronunciation dictionary 文字發音機率發音 HMM Model 一 0.16134 i2 i sp 一 0.26218 i4 i sp 一 0.57647 i1 i sp 乙 1.00000 i3 i sp 丁 1.00000 ding1 d+i d-i+n i-n+g n-g sp 七 1.00000 ci1 c+i c-i sp sia s i a tian t i a n Yu y u sp sp sil sil sia s+i s-i+a i-a sp tian t+i t-i+a i-a+n a-n sp Yu y+u y-u sp sp [] sp sil [] sil Syl2phn.dic Syl2tri.dic hdman hdman("syl.mod", "syl2phn.dic", "syl2rcd.dic","man1.log","man2.log");

Searching net 台北市 Linear net 政府中縣廳 Free Hanzi net Tree structured net

Searching net j9 j0 yu tian sia I1 i0 i4 !NULL !NULL !NULL j7 j2 i5 i2 j3 j8 j4 i3 j1 Hparse(" free_syl.grm", “free_syl.net ") $free_syl= sia | tian | yu; (<$free_syl>) VERSION=1.0 N=6 L=10 I=0 W=yu I=1 W=!NULL I=2 W=tian I=3 W=sia I=4 W=!NULL I=5 W=!NULL J=0 S=1 E=0 J=1 S=5 E=0 J=2 S=0 E=1 J=3 S=2 E=1 J=4 S=3 E=1 J=5 S=1 E=2 J=6 S=5 E=2 J=7 S=1 E=3 J=8 S=5 E=3 J=9 S=1 E=4 j5

Recognizer evaluation :Testing data s+i i-a s-i+a Vite("mfc.lst", 25, "tri.mod", "syl2tri.dic", "freesyl.net", "rec_freesyl.mlf","rec_freesyl.log" )

Recognizer evaluation Mandarin syllable network 3 da xuei 1 2 s1 Syllable HMM T M s2 tai t t+ai wan wan da d d+a xuei x x+uei dai d d+ai bah b b+ah … s3 s4 sh le tai bei bah wan wan qi dai dai hah liau s5 s6 Model “t” Model “ah” Model “ai” s7 s8 s9 … s10 … Bi-lingual dictionary Syllable network From 1 to 3 layer is to find the best syllable sequences by acoustic characteristic 4 acoustic HMM “tai wan” would translate to “台灣”or “太晚”, two different meanings “tai bei” would translate to “台北”or “泰北”, two different locations From the best syllable sequences to find the best path of Chinese characters HMM s+i s-i+a i-a t+i t-i+a i-a+n a-n y+u y-u t

Recognizer evaluation syl.mlf rec_freesyl.mlf tri.rec Result("rec_freesyl.mlf", "syl.mlf", "syl.mod", "tri.rec" ) #!MLF!# "*/1.lab" sia tian . "*/2.lab" sia yu . #!MLF!# "*/1.lab" sia yu yu . "*/2.lab" sia yu Aligned transcription:I:/… LAB: sia tian REC: sia yu yu Aligned transcription:I:/… LAB: sia yu REC: sia yu WORD: %Correct=50 [H=1, S=1, N=2] SYLL: %Corr=75, Acc=50((3-1/)4) [H=3, D=0, S=1, I=1, N=4] insertion deletionsubstitution

Homework (’01 corpus) • CGU • (tri-phone,free-syllable net) • g1 • 台語 MDXXXX • 華語 TWXXXX • 男生 XXM1XX • 女生 XXM0XX • 時間:兩星期後 • Data:下載

HTK tutorial

HTK tutorial

Presentation Transcript

The HTK Book (for HTK Version 3.2.1)

Building an ASR using HTK CS4706

Building an ASR using HTK CS4706

Building an ASR using HTK CS4706

HTK － SOLUTION FOR MYOCARDIAL PROTECTION

Building an ASR using HTK CS4706

A Tutorial of HMM Tool Kit (HTK)

What is HTK tool kit

Language model using HTK

Speech Processing Using HTK

HMM Toolkit (HTK)