Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

Lecture 8. Inference in Non-Tree Graphs • The multiple-parent problem • Solution #1: Parent merger • Solution #2: Moralize, Triangulate, and create a Junction Tree • Inference in Any DBN: Sum-Product Algorithm in a Junction Tree • Example: Factorial HMM • Example: Zweig-triangle LVCSR • Example: Articulatory phonology

Conditional Independence of Descendants and Non-descendants • The Sum-Product algorithm can use computations that are local at each node, v, because of the following theorem: • Theorem: if and only if a Bayesian network is a tree, then for every variable v, the descendants and non-descendants of v are conditionally independent given v: p(Dv,Nv | v) = p(Dv | v) p(Nv | v)

Example: Descendants and Non-descendants in a Tree • p(Dc|c) = p(d|c)p(e|c)p(f|c) • p(Nc,c) = p(a)p(b|a)p(c|a) a b c d e f

Example: Descendants and Non-descendants in a Non-Tree • p(Dc,c) = Sa,b p(a)p(b|a)p(c|a)p(d|b,c)p(e|c)p(f|c) • p(Nc,c) = Sd p(a)p(b|a)p(c|a)p(d|b,c) • So is it necessary for EVERY computation to be global? a b c d e f

Local Computations in a Non-Tree • Here are some computations that can be local: • d depends only on the combination (b,c) • (b,c) depend only on a • e depends only on c, or equivalently, e depends on (b,c) a b c d e f

The “Parent Merger” Algorithm • Algorithm #1 for turning a Non-Tree into a Tree: • If any node has multiple parents, merge them • If any resulting supernode has multiple parents, merge them • Repeat until no node has multiple parents • Why this algorithm is sometimes undesirable: • In an upward-branching graph, this results in a supernode with NgNhNiNj = many possible values • Many values  Lots of computation a b c {bc} d e f {def} g h i j {ghij} k l m {klm} n o {no} p

Algorithm #2: Junction Trees • Moralize • Triangulate • Read off the cliques into a Junction Tree • Add variables to cliques, as necessary to ensure Locality of Influence

Moralization • “Moralization” is the process of connecting the parents of every node. • Goal: to show that values of the parents can not really be independently computed. • Once the graph has been moralized, we usually show it as an undirected graph --- dependency structure will still be necessary for inference, but not necessary for finding the best junction tree. a a b c b c d e f d e f n o n o Moralize p p

Triangulation • A “triangular” or “chordal” graph is a graph with no cycles of length longer than three. • “Triangulation” is the process of adding edges to a graph in order to make it triangular. a b a b a b c d c d c d e f e f e f Moralize g h g h Triangulate g h i j i j i j

Digression: Why are Moralization and Triangulation Allowed? • An edge connecting a→b means that, during inference, we must use a probability table of size NaNb: p(b|a) • One special case of the p(b|a) probability table is the case in which every row is the same, i.e., p(b|a)=p(a) • Therefore this graph: Is a special case of this one: • Put another way, information about the special features of a problem is coded by absentedges, not present edges. • Adding edges is equivalent to forcing yourself to solve a harder, more general problem, rather than a simple specific one. a b a b c c

Cliques • A “clique” is a group of nodes, all of which are connected together • The “separator” of two cliques is the set of nodes that are members of both cliques. For example: • Clique efg and • Clique def • have {e,f} as their separator a b c d e f g h i j

Forming a Junction Tree • It is always possible to create a Junction Tree from a triangular graph using the following algorithm: • Start with any clique as the root node • Next comes the clique whose separator with the root node is largest • Locality of influence: if cliques A and B both contain node c, then node c must also be added to every clique in the junction tree between A and B a b e,f,g c d f,g,h d,e,f e f g,h,j c,d,e Junction Tree g h g,j a,b,d i j

Triangulation: A Hard Example • In this example, the graph on the left is not yet fully triangulated (for example, the cycle dbcfon has no chord). Here is one possible triangulation algorithm: create a junction tree, then add variables to the cliques as necessary to maintain locality of influence. a d,e,n? b c c,e,f? e,f,o? d e f b,d,e? e,n,o? n o Junction Tree p b,c,e? n,o,p a,b,c

Example: Triangulation to Maintain Locality of Influence • Every node that’s in both the second clique and the fourth clique must also exist in the third clique a d,e,n? b c c,e,f? e,f,o? d e f b,d,e? e,n,o? n o Junction Tree p b,c,e? n,o,p a,b,c

Example: Triangulation to Maintain Locality of Influence • Add node c to the 3rd clique, because c is in both 2nd and 4th cliques. Putting c into this clique is equivalent to drawing an edge between c and d, so that b,c,d,e are all interconnected. a d,e,n? b c c,e,f? e,f,o? d e f b,c,d,e e,n,o? n o Junction Tree p b,c,e? n,o,p a,b,c

Example: Triangulation to Maintain Locality of Influence • … and then delete clique (b,c,e), because it’s now redundant with clique (b,c,d,e). a d,e,n? b c c,e,f? e,f,o? d e f b,c,d,e e,n,o? n o Junction Tree p a,b,c n,o,p

Example: Triangulation to Maintain Locality of Influence • Similar reasoning: d is in the 4th clique, so it had better be added to the 3rd. This is equivalent to drawing an edge between d and f. a d,e,n? b c c,d,e,f e,f,o? d e f b,c,d,e e,n,o? n o Junction Tree p a,b,c n,o,p

Example: Triangulation to Maintain Locality of Influence • … and f is in the 5th clique, so it had better be added to the 4th. This is equivalent to drawing an edge between f and n. a d,e,f,n b c c,d,e,f e,f,o? d e f b,c,d,e e,n,o? n o Junction Tree p a,b,c n,o,p

Finishing Up a d,e,f,n b c c,d,e,f e,f,n,o d e f b,c,d,e n,o,p n o Junction Tree p a,b,c

Inference in a Junction Tree • The “clique” representation means that all inference is contained within one clique. For each clique: • Input: a probability table for values of the lower separator. Size of this table = NI, where N is the number of possible values of each variable, and I is the number of nodes in the input separator • Product: multiply by information about other nodes in the clique. Resulting table is of size NC, where C is the number of nodes in the clique. • Sum: For each possible setting of variables in the output separator (NO possible settings), marginalize out the values of all other variables (a sum operation, containing NC-O terms in the sum). Total complexity: O{NC} • Pass the resulting table to next clique.

Inference Example • Suppose b and o are observed, and we want to find p(♦,b,o) for all variables ♦. • Clique noq: • Output separator variables are n,o • Product: p(observation, q | n,o) = p(q | n,o) • Sum: p(observation | n,o) = Sq p(q | n,o) = 1 • We could have skipped this step by observing that p(o | o) is always 1 a d,e,f,n b b c c,d,e,f e,f,n,o d e f b,c,d,e n,o,q n o o q a,b,c

Inference Example • Clique efno: • Input: p(observation | n,o) • Product: Nodes not in the output separator are moved to the left: • p(observation,o | e,f,n) = p(observation | n,o) p(o | e,f,n) • Sum: over unobserved elements not in the output separator • p(observation | e,f,n) = p(observation,o | e,f,n) • Output: probabilities for every setting of the output separator • p(observation | e,f,n) a d,e,f,n b b c c,d,e,f e,f,n,o d e f b,c,d,e n,o,q n o o q a,b,c

Inference Example • Clique defn: • Input: p(observation | e,f,n) • Product: Nodes not in the output separator are moved to the left: • p(observation,n | d,e,f) = p(observation | e,f,n) p(n | d,e,f) • Sum: over unobserved elements not in the output separator • p(observation | d,e,f) = Sn p(observation,n | d,e,f) • Output • p(observation | d,e,f) a d,e,f,n b b c c,d,e,f e,f,n,o d e f b,c,d,e n,o,q n o o q a,b,c

Inference Example: “Propagate Down” • Clique abc: • Product: p(a,b,c) = p(b,c|a)p(a) • Sum: over every unobserved variable that’s not in the output separator: • p(b,c) = Sa p(a,b,c) a d,e,f,n b b c c,d,e,f e,f,n,o d e f b,c,d,e n,o,q n o q a,b,c

Inference Example: “Propagate Down” • Clique bcde: • Product: p(b,c,d,e) = p(d,e|b,e)p(b,c) • Sum: over every unobserved variable that’s not in the output separator: • p(observations above, c,d,e) = p(b,c,d,e) • Output: a probability table of size N3: • p(observations above, c,d,e) a d,e,f,n b b c c,d,e,f e,f,n,o d e f b,c,d,e n,o,q n o q a,b,c

Inference Example: “Propagate Down” • Clique cdef: • Input: p(observations above, c,d,e) • Product: p(observation,c,d,e,f) = p(f|c,d,e)p(observation,c,d,e) • Sum: over every unobserved variable that’s not in the output separator: • p(observations above, d,e,f) = Sc p(observations above, c,d,e,f) • Output: a probability table of size N3: • p(observations above, d,e,f) a d,e,f,n b b c c,d,e,f e,f,n,o d e f b,c,d,e n,o,q n o q a,b,c

… and so on…

Computational Complexity • Complexity of Inference is O{NC}, where N is the number of values each node takes, C the number of nodes in the largest clique. • Actually, complexity is O{max Pi Ni}, where the ith variable in the clique is defined in the range 1≤vi≤Ni, and max finds the maximum of this number over all cliques • Therefore, a triangulation algorithm should minimize the maximum clique. • Unfortunately, automatic minimum-maximum-clique triangulation is NP-hard. Good approximate algorithms exist, but… • Humans are better at this than machines: design your graph with small cliques.

Factorial HMM (FHMM) q1 qt qt+1 qT-1 qT x1 x1 xt x2 xt+1 x3 x4 xT-1 x5 xT v1 vt vt+1 vT-1 vT • Factorial HMM: • qt and vt represent two different types of background information, each with its own history • Observations xt depend on both hidden processes • Model parameters: • p(vt+1|vt), p(qt+1|qt), p(xt|qt,vt) • Computational Complexity of Sum-Product Algorithm: • O{N4T} using “parent-merger” triangulation • O{N3T} using a better triangulation (five slides from now)

Example: Speech in Music(Deoras and Hasegawa-Johnson, ICSLP 2004) … … q1 qt qt+1 qT-1 qT x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5 v1 vt vt+1 vT-1 vT • qt = one person speaking • Speech log spectrum given by ps(yt(ejw)|qt) = mixture Gaussian • vt = music playing in the background • Music log spectrum given by pm(zt(ejw)|vt) = mixture Gaussian • Observed log spectrum = max(speech,music) • xt(ejw) ≈ max(yt(ejw),zt(ejw)) (xt(ejw)≥max(yt(ejw),zt(ejw))≥xt(ejw)-6dB) • p(xt|qt,vt) = ps(xt|qt)ʃxpm(z|vt)dz + pm(xt|vt)ʃxps(y|qt)dy

AVSR: The Boltzmann Zipper(Hennecke, Stork, and Prasad, 1996) y1 x1 x2 yt yt+1 x2 x5 yT Video observations … … v1 vt vt+1 vT Viseme states q1 qt qt+1 qT Audio phoneme states x1 x1 x2 xt x2 xt+1 x5 xT Audio spectral observations • Same as AVSR model from last time, except that now vt has memory, independent of qt. Model parameters: • p(qt+1|qt), p(xt|qt) • p(vt+1|vt,qt+1), p(yt|qt) • Sum-Product algorithm: O{N3T}, just like FHMM • The extra observations add complexity only of O{T}

AVSR: The Coupled HMM(Chu and Huang, 2000) y1 x1 yt x2 yt+1 x2 yT x5 Video observations v1 vt vt+1 vT Viseme states q1 qt qt+1 qT Audio phoneme states x1 x1 xt x2 x2 xt+1 x5 xT Audio spectral observations • Advantage over Boltzmann Zipper: More flexible, because neither vision nor sound is “privileged” over the other. • p(qt+1|vt,qt), p(xt|qt) • p(vt+1|vt,qt), p(yt|qt) • Disadvantage: can’t be triangulated like FHMM, so complexity is O{N4T} rather than O{N3T}

Inference using Parent Merger … … q1 qt qt+1 qT-1 qT x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5 v1 vt vt+1 vT-1 vT • Nt=observed non-descendants of (qt,vt) = {x1,…,xt-1} • Dt=observed descendants of (qt,vt) = {xt,…,xT} • Forward algorithm: • p(Nt+1,qt+1,vt+1) = SqtSvt p(xt | qt,vt)p(qt+1 | qt)p(vt+1 | vt)p(Nt,qt,vt) • Backward algorithm: • p(Dt | qt,vt) = p(xt | qt,vt) Sqt+1Svt+1 p(qt+1 | qt)p(vt+1 | vt)p(Dt+1 | qt+1,vt+1) • Complexity: • (T frames)X(N2 sums/frame)X(N2 terms/sum) = O{N4T}

A Smarter Triangulation … … q1 qt qt+1 qT-1 qT x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5 v1 vt vt+1 vT-1 vT • Forward Algorithm, step 1: • p(qt+1,vt,Nt) = Sqt p(xt | qt,vt) p(qt+1 | qt) p(qt,vt,Nt)

A Smarter Triangulation … … q1 qt qt+1 qT-1 qT x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5 v1 vt vt+1 vT-1 vT • Forward Algorithm, step 1: • p(qt+1,vt,Nt+1) = Sqt p(xt | qt,vt) p(qt+1 | qt) p(qt,vt,Nt) • Forward Algorithm, step 2: • p(qt+1,vt+1,Nt+1) = Svt p(vt+1|vt) p(qt+1,vt,Nt+1) • Computational Complexity: • (T frames)X(2N2 sums/frame)X(N terms/sum) = O{N3T} • Complexity is N times higher than that of a one-stream HMM

“Compiling” an FHMM into an HMM … … x2 xt x3 xt+1 qt,vt qt+1,vt qt+1,vt+1 • Purpose: GMTK (Bilmes and Zweig, ICASSP 2002) can implement FHMM directly, but by compiling FHMM to HMM, we can also use HTK (Young, Evermann, Hain et al., 2002) and other software tools • Method: • Each state specifies the variables in the output separator of one clique, e.g., (qt+1,vt) is the separator between cliques (qt,qt+1,vt) and (qt+1,vt,vt+1). • Transition probability matrix p(qt+1,vt|qt,vt) is N2XN2, but only N3 entries can be non-zero, thus complexity is O{N3}

A Note on Parameter Tying … … x2 xt x3 xt+1 qt,vt qt+1,vt qt+1,vt+1 • Transition probability table specifies p(separator|separator), e.g., p(qt+1,vt | qt,vt) • The only non-zero entries are those specifying variables that differ between separators, e.g., p(qt+1 | qt,vt) • With “parameter tying,” we can constrain different elements of transition matrix to equal one another, thus forcing the model to match condition p(qt+1|qt,vt)=p(qt+1|qt). • If the two chains are known to be truly independent, e.g., speech and background music, parameter tying may help to avoid over-training the model. • If the two chains are possibly dependent, allow the full transition matrix p(qt+1|qt,vt): result is a Boltzmann zipper.

“Compiling” an FHMM into an HMM • In order to handle non-emitting states, we need a total of 2N2 junction states: N2 emitting, N2 non-emitting • Finite State Diagram looks like this (NOT A DBN – this is here to help you design the HTK configuration, if desired): Emitting States qt,vt Non-Emitting States qt+1,vt Observation PDFs 1,1 1,1 p(xt | qt=1,vt=1) p(xt | qt=1,vt=2) 1,2 1,2 2,1 2,1 p(xt | qt=2,vt=1) 2,2 2,2 p(xt | qt=2,vt=2) Blue arrows: left-to-right transition Red arrows: right-to-left transition Black arrows: both (Note: no self-loops. Emitting & non-emitting states alternate.)

Graphical Models for Large-Vocabulary Speech Recognition

wt: word. 1≤ wt ≤Nw • Nw = # words in vocabulary • p(wt+1=wt | wt, wdTrt=0)=1 • p(wt+1 | wt, wdTrt=1) = bigram word grammar • it: segment index. 1≤ it ≤Ni • p(it+1 | it,wdTrt=0)>0 iff it≤it+1≤ it+1 • p(it+1=1 | it,wdTrt=1)=1 • wdTrt: is there a word transition? • p(wdTrt=0 | it<Ni)=1 • p(wdTrt=1 | it=Ni)= probability word ends • qt: segment label, for example, qt could equal “/aa/ state 3.” • p(qt|it,wt)=probability that itth phonetic segment in wt is qt • Often deterministic: p(qt|it,wt)=1 iff qt is itth phone of wt • xt: observation • p(xt|qt) usually mixture Gaussian “Zweig Triangles”(Zweig, 1998) wt wt+1 wdTrt wdTrt+1 it it+1 qt qt+1 x2 xt x2 xt+1

TB-LOC VELUM TT-LOC LIP-OP TB-OPEN TT-OPEN VOICING Example: Pronunciation Variability • Pronunciation variability (e.g., apparent deletions or substitutions of phonemes) can be parsimoniously described as resulting from asynchrony and reduction of quasi-independent tract variables such as the LIP-OPENING and TONGUE-TIP-OPENING: (Browman & Goldstein 1990):

A DBN Model of Articulatory Phonology for Speech Recognition(Livescu and Glass, 2004) • wordt: word ID at frame #t • wdTrt: word transition? • indti: which gesture, from • the canonical word model, • should articulator i be • trying to implement? • asyncti;j: how asynchronous • are articulators i and j? • Uti: canonical setting of • articulator #i • Sti: surface setting of • articulator #i

Summary • Multiple parents violate conditional independence of descendants and non-descendants  sum-product fails • A fast solution: parent merger • A more computationally efficient solution: • Moralize • Triangulate • Create a junction tree • Sum-Product Algorithm in a Junction Tree has complexity of O{NC} where C is number of nodes in the largest clique • Example: Factorial HMM • Applications: speech with background noise, audiovisual speech • Complexity:O{N4} with parent merger, O{N3} with triangulation • Example: Large Vocabulary Speech Recognition • Zweig triangles: word grammar and phone model in one graph • Livescu model: a DBN for pronunciation variability

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA