Conditional Random Fields for ASR

Conditional Random Fields for ASR Jeremy Morris July 25, 2006

Overview • Problem Statement (Motivation) • Conditional Random Fields • Experiments • Attribute Selection • Experimental Setup • Results • Future Work

Problem Statement • Developed as part of the ASAT Project • (Automatic Speech Attribute Transcription) • Goal: Develop a system for bottom-up speech recognition using 'speech attributes'

/d/ manner: stop place of artic: dental voicing: voiced /t/ manner: stop place of artic: dental voicing: unvoiced /iy/ height: high backness: front roundness: nonround Speech Attributes? • Any information that could be useful for recognizing the spoken language • Phonetic attributes • Speaker attributes (gender, age, etc.) • Any other useful attributes that could be used for speech recognition • Note that there is no guarantee that attributes will be independent of each other • One part of this project is to explore ways to create a framework for easily combining new features for experimental purposes

hyp hyp Top Down Generate a hypothesis See if the data fits the hypothesis Bottom Up Examine the data Search for a hypothesis that fits data data Evidence Combination • Two basic ways to build hypotheses

/iy/ X Top Down • Traditional Automated Speech Recogintion Systems (ASR) use a top-down approach (HMMs) • Hypothesis is the phone we are predicting • Data is some encoding of the acoustic speech signal • A likelihood of the signal given the phone label is learned from data • A prior probability for the phone label is learned from the data • These are combined through Bayes Rule to give us the posterior probability P(/iy/) P(X|/iy/)

/iy/ X Bottom Up • Bottom-up models have the same high-level goal – determine the label from the observation • But instead of a likelihood, the posterior probability is learned from the data • Neural Networks have been used to learn these probabilities P(/iy/|X)

/k/ /k/ /iy/ /iy/ /iy/ Speech is a Sequence • Speech is not a single, independent event • It is a combination of multiple events over time • A model to recognize spoken language should take into account dependencies across time

/k/ /k/ /iy/ /iy/ /iy/ X X X X X Speech is a Sequence • A top down (generative) model can be extended into a time sequence as a Hidden Markov Model (HMM) • Now our likelihood of the data is over the entire sequence instead of a single phone

/k/ /iy/ /iy/ Y Y Y Speech is a Sequence • Tandem is a method for using evidence bottom up (discriminative) • Hypothesis output of Neural Network is used to train an HMM • Not a pure discriminative method, but a combination of generative and discriminative methods X X X

Bottom up Modelling • The idea is to have a system that combines evidence layer by layer • Speech attributes contribute to phone attribute detection • Phone attributes contribute to “syllable” attribute detection, and so on • Each layer combines information from previous layers to form its hypotheses • We want to do this probabalistically – no hard decisions • Note that there is no guarantee of independence among the observed speech features – in fact, they are often very dependent.

Conditional Random Fields • A form of discriminative modelling • Has been used successfully in various domains such as part of speech tagging and other Natural Language Processing tasks • Processes evidence bottom-up • Combines multiple features of the data • Builds the probability P( sequence | data)

Transition functions add associations between transitions from one label to another State functions help determine the identity of the state Conditional Random Fields /k/ /k/ /iy/ /iy/ /iy/ • CRFs are based on the idea of Markov Random Fields • Modelled as an undirected graph connecting labels with observations • Observations in a CRF are not random variables X X X X X

State Feature Weight λ=10 One possible weight value for this state feature (Strong) Transition Feature Weight μ=4 One possible weight value for this transition feature State Feature Function f([x is stop], /t/) One possible state feature function For our attributes and labels Transition Feature Function g(x, /iy/,/k/) One possible transition feature function Indicates /k/ followed by /iy/ Conditional Random Fields • Hammersley-Clifford Theorem states that a random field is an MRF iff it can be described in the above form • The exponential is the sum of the clique potentials of the undirected graph

Conditional Random Fields • Conceptual Overview • Each attribute of the data we are trying to model fits into a feature function that associates the attribute and a possible label • A positive value if the attribute appears in the data • A zero value if the attribute is not in the data • Each feature function carries a weight that gives the strength of that feature function for the proposed label • High positive weights indicate a good association between the feature and the proposed label • High negative weights indicate a negative association between the feature and the proposed label • Weights close to zero indicate the feature has little or no impact on the identity of the label

Experiments • Goal: Implement a Conditional Random Field Model on ASAT-style data • Perform phone recognition • Compare results to those obtained via a Tandem system • Experimental Data • TIMIT read speech corpus • Moderate-sized corpus of clean, prompted speech, complete with phonetic-level transcriptions

Attribute Selection • Attribute Detectors • ICSI QuickNet Neural Networks • Two different types of attributes • Phonological feature detectors • Place, Manner, Voicing, Vowel Height, Backness, etc. • Features are grouped into eight classes, with each class having a variable number of possible values based on the IPA phonetic chart • Phone detectors • Neural networks output based on the phone labels – one output per label • Classifiers were applied to 2960 utterances from the TIMIT training set

Experimental Setup • Code built on the Java CRF toolkit on Sourceforge • http://crf.sourceforge.net • Performs training to maximize the log-likelihood of the training set with respect to the model • Uses a Limited Memory BGFS algorithm to minimize the gradient of the log-likelihood • For CRF models, maximizing the log-likelihood of the empirical distribution of the data as predicted by the model is the same as maximizing the entropy (Berger et. al.)

Experimental Setup • Output from the Neural Nets are themselves treated as feature functions for the observed sequence – each attribute/label combination gives us a value for one feature function • Note that this makes the feature functions non-binary features.

Results

Future Work • More features • What kinds of features can we add to improve our transitions? • Tuning • HMM model has parameters that can be tuned for better performance – can we tweak the CRF to do something similar? • Word recogntion • How does this model do at the full word recognition level, instead of just phones • Other corpora • Can we extend this method beyond TIMIT to different types of corpora? (e.g. WSJ)

Conditional Random Fields for ASR