Natural Language Understanding

Natural Language Understanding Raivydas Simenas

Overwiev • History • Speech Recognition • Natural Language Understanding • statistical methods to resolve ambiguities • Current situation

History • Roots in teaching the deaf to speak using “visible speech” • 1874: Alexander Bell’s invention of harmonic telegraph • Different frequency harmonics from an electrical signal could be separated • Could sent multiple messages over the same wire at the same time • 1940’s: separating the speech signal into different frequency components using the spectrogram • 1950’s: the beginning of computer use for automatic speech recognition

The Nature of Speech • Phoneme – a basic sound, e.g. a vowel • The complexity of human vocal apparatus: about 18 phonemes per second • Speech viewed as a sound wave • Identifying sounds: analyzing the sound wave into its frequency components

The Spectrogram I • A visual representation of speech which contains all the salient information • Plots the amount of energy at different frequencies against time • Discontinuous speech (making a pause after each word) – easier to recognize on the spectrogram

The Spectrogram II • The same word uttered twice (especially by different speakers – speaker independence) might look radically different on a spectrogram • The need to recognize invariant features in a spectrogram • Formants: resonant frequencies sustained for a short time period in pronouncing a vowel • Normalization: distinguishing between relevant and irrelevant information • Nonlinear time compression: taking care of the changing speed of a speech • Matching a spoken word to a template

Robust Speech Recognition • Need to maintain accuracy when the quality of the input speech is degraded or when the speech characteristics differ due to change in environment or speakers • Dynamic parameter adaptation: either alter the input signal or the internally stored representations • Optimal parameter estimation: based on a statistical model characterizing the differences between training and test sets • Empirical feature comparison: based on comparison between high-quality speech and the same speech recorded under degraded conditions

Stochastic Methods in Speech Recognition • Generating the sequence of word hypotheses for an acoustic signal is most often done using statistics • The process: • A sequence of acoustic signals is represented using a collection of vectors • Such collections are used to build acoustic word models, which consist of probabilities of certain sequences of vectors representing a word • Acoustic word models utilize Markov chains

Representing Sentences • Syntactic form: indicates the way the words are related to each other in a sentence • Logical form: identifying the semantic relationships between words based solely on the knowledge of the language (independently of the situation) • Final meaning representation: mapping the information from the syntactic and logical form into knowledge representation • System uses knowledge representation to represent and reason about its application domain

Parsing a Sentence • Parsing – determining the structure of the sentence according to the grammar • Tree representation of a sentence • Transition network grammars • Start with initial node • Can traverse an arc only if it is labeled with an appropriate category

Stochastic Methods for Ambiguity Resolution I • Some sentences can be parsed many different ways, e.g. time flies like an arrow • The most popular method for this is based on statistics • Some facts from probability theory • The concept of the random variable, e.g. the lexical category of “flies” • Probability function • assigns probability to every possible value of the random variable, e.g. 0.3 for “flies” being a noun, 0.7 for its being a verb • conditional probability functions (Pr(A|B)), e.g. the probability for the occurrence of a verb given the fact that a noun already occurred

Stochastic Methods for Ambiguity Resolution II • Probabilities are used to predict future events given some data about the past • Maximum likelihood estimator (MLE) • Probability of X happening in the future = number of cases of X happening in the past/total number of events in the past • Works well only if X occurred often, not very useful for low-frequency events • Expected likelihood estimator (ELE) • Probability of X happening in the future = f(number of cases of X happening in the past)/Sum(f(number of cases of some event happening in the past)), e.g. if f(Pr(X))=Pr(X)+0.5 and we know that Pr(X)=0.4 and Pr(Y)=0.6, then ELE(X)=(0.4+0.5)/(0.4+0.5+0.6+0.5)=0.45 • MLE is a special case of ELE, i.e. for MLE f(Pr(X)=Pr(X) • Given a large amount of text, one can use MLE or ELE to determine the lexical category of an ambiguous word, e.g. the word flies

Stochastic Methods for Ambiguity Resolution III • Always choosing the interpretation that occurs most frequently in the training set on average obtains 90% success rate (not good) • Some of the local context should be used to determine the lexical category of a word • Ideally, for a sequence of words w1,w2,…,wn we want a lexical category sequence c1,c2,…,cn which maximizes the probability of right interpretation • In practice, approximations of such probabilities are made

Stochastic Methods for Ambiguity Resolution IV • n-gram models • Look at the probability of a lexical category Ci which follows the sequence of lexical categories Ci-1,Ci-2,…,Ci-n+1 • Probability of c1,c2,…,ck occurring is approximately the product of n-gram probabilities for each word, e.g. the probability of a sequence ART, N, V is 0.71*1*0.43=.3053 • In practice, bigram or trigram models are used most often • The models capturing the concept are called Hidden Markov Models

Stochastic Methods for Ambiguity Resolution V • In order to determine the most likely interpretation of a given sequence of n words, we want to maximize the value of • The Viterbi algorithm • Given k lexical categories, the total number of possibilities to consider for a sequence of n words is kn • The Viterbi algorithm reduces this number to const*n*k2

Logical Form • Although interpreting sentence often requires the knowledge of the context, some interpretation can be done independently of it • basic semantic properties of a word, its different senses etc. • Ontology • each word has 1 or more senses in which it can be used, e.g. go has about 40 senses • the different senses of all the words of a natural language are organized into classes of objects, such as events, actions etc. • the set of such classes is called an ontology • Logical form of an utterance can be viewed as a function that maps current discourse situation into a new one resulting from the occurrence of the utterance

Current Situation • Inexpensive software for speech recognition • The issues: large vocabulary, continuous speech and speaker independence • Automated speech recognition for restricted domains • The speed of serial processes in a computer vs. the number of parallel processes in human brain

References • Survey of the State of the Art in Human Language Technology, edited by Ronald A. Cole, 1996 • James Allen. Natural Language Understanding, 1995 • Raymond Kurzweil. When will HAL understand what we are saying? Computer Speech Recognition and Understanding. Taken from HAL’s Legacy, 1996

Natural Language Understanding