Application Areas

Application Areas • For two lectures, we will examine a number of application areas of AI research • Visual image understanding • Speech recognition • Natural language processing • mostly we’ll look at NL understanding • but we will briefly talk about NL generation and machine translation • Search engine technology • Next time, we will cover the following topics and possibly others (specific topics to be determined) • Homeland security • Sensor interpretation • Robotic vehicles • AI in space

Perception Problems • Vision understanding, natural language processing and speech recognition all have several things in common • Each problem is so complex that it must be solved through a series of mappings from subproblem to subproblem • Each problem requires a great deal of knowledge that is not necessarily available or well-understood such that successful applications often utilize non-knowledge-based mechanisms • Each problem contains some degree of uncertainty, often implemented using HMMs or neural networks • Early approaches in AI were symbolic and often suffered for several reasons • Poor run-time performance (because of the sheer amount of knowledge needed and the slow processors of the time) • Models that were based on our incomplete knowledge of human (or animal) vision, auditory abilities and language • Lack of learning so that knowledge acquisition was essential • Research into all three areas has progressed, but slowly

Vision Understanding Mapping • The vision problem is shown to the left as a series of mappings • Low level processing and filtering to convert from analog to digital • Pixels  edges/lines • Lines  regions • Regions  surfaces • Add texture, shading, contours • Surfaces  objects • Classify objects • Analyze scene (if necessary)

A Few Details • Computer vision has been studied for decades • There is no single solution to the problem • Each of the mappings has its own solution • often mathematical, such as the edge detection and mapping from edges to regions • often applies constraint satisfaction algorithms to reduce the amount of search or computation required for low level processes • The “intelligence” part really comes in toward the end of the process • Object classification • Scene analysis • Surface and object disambiguation (determine which object a particular surface belongs, dealing with optical illusions) • Computer vision is practically an entire CS discipline and so is beyond the scope of what we can cover here (sadly)

Edge Detection • Waltz created an algorithm for edge detection by • finding junction points (intersections of lines) • determining the orientation of the lines into the junction points • applying constraint satisfaction to select which lines belong to which surfaces • Below, convex edges are denoted with + and concave edges with - Other approaches may be necessary for curve, contour or blob detection and analysis Often, these approaches use such mathematical models as eigen-models (eigenform, eigenface), quadratics or superquadratics, distance measures, closest point computations, etc For trihedral junction points (intersection of 3 lines), these are the 18 legal connections

Vision Sub-Applications • Machine produced character recognition • Solved satisfactorily through neural networks • Hand-written character recognition • Many solutions: neural networks, genetic algorithms, HMMs, Bayesian probabilities, nearest neighbor matching approaches, symbolic approaches • printed character recognition highly accurate, cursive recognition greatly varies • Face recognition • Many solutions, often the solutions attempt to map the face’s contours and texture into mathematical equations and Gaussian distributions • Image stabilization and image (object) tracking • Solutions include neural networks, fuzzy logic, best fit search • UAV input – we’ll discuss UAVs next time

Two Approaches to Handwritten Character Recognition Neural Networks with voting Symbolic approach using pattern matching

Speech Recognition • In spite of the fact that research began in earnest in 1970, speech recognition is a problem far from solved • Problems: • Multispeaker – people speak at different frequencies • Continuous speech – the speech signal for a given sound is dependent on the preceding and succeeding sounds, and in continuous speech, it is extremely hard to find the beginning of a new word, making the search process even more computationally difficult • Large vocabularies – not only does this complicate the search process because there might be many words that match, but it also brings in ambiguity • SR attempts • Knowledge based approaches (particularly Hearsay) • Neural networks • Hidden Markov Models • Hybrid approaches

The Task Pictorially The speech signal is segmented overlapping segments are processed to create a small window of speech Processing typically involves FFT and Linear Predictive Coding (LPC) analysis This provides a series of energy patterns at different frequencies

Phonetic Dependence • Below are two wave forms created by uttering the same vowel sound, “ee” as in three (on the left) and tea (on the right) • notice how dissimilar the “ee” portion is, in fact the one on the right is even longer • this problem is caused because of co-articulation – one sound will directly impact the next sound

Hearsay-II • Hearsay-II attempted to solve the problem through symbolic problem solvers • Each called a knowledge group • They would communicate through a global mechanism called a blackboard • Each KG knew what part of the BB to read from and where to post partial conclusions to • A scheduler would use a complex algorithm to decide which KG should be invoked next based on priority of KGs and what knowledge was currently available on the BB Hearsay could recognize 1000 words of continuous speech and several speakers with a limited syntax with an accuracy of around 90%

Sample of Hearsay-II’s Blackboard Sentence is “Are any by Feigenbaum and Feldman?”

One Neural Network Approach • One approach is to build a separate network for every word known in the system • For a small vocabulary isolated word system, this approach may work • no need to worry about finding the separation between words or the effect that a word ending might have on the next word beginning Notice that NN have fixed sized inputs An input here will be the processed speech signal in the form of LPCs

The preceding solution does not work for continuous speech Since there is no easy way to determine where one word ends and the next begins, we cannot just rely on word models Instead, we need phonetic models The problem here is that the sound of a phoneme is influenced by the preceding and succeeding sounds a neural network only learns a “snapshot” of data and what we need is context dependent One solution is to use a recurrent NN, which remembers the output of the previous input to provide context or memory Note that the RNN is much more difficult to train, but can solve the speech problem more effectively than the normal NN Continuous Speech

Another Solution: Multiple Nets Here, there are multiple neural networks for the various levels The segmentation module is responsible for dividing the continuous signal into segments The unit level Generates phonetic units The word module combines possible phonetic units to words

HMM Approach • The HMM approach views speech recognition as finding a path through a graph of connected phonetic and/or grammar models, each of which is an HMM • In this case, the speech signal is the observable, and the unit of speech uttered is the hidden state • Typically, several frames of the speech signal are mapped into a codebook (a fancy name for selecting one of a set of acoustic classifications) • Separate HMMs are developed for every phonetic unit • For instance, we might have a /d/ HMM and an /i/ HMM, etc • There may be multiple paths through a single HMM to allow for differences in duration caused by co-articulation and other effects Here, the speech problem is one of working through several layers of HMMs, at the lowest level are the phonetic HMMs, which combine to make up word HMMs which combine to make up grammar HMMs

Discrete Speech Using HMMs • Here, digits spoken one at a time are recognized • The HMM word model for a digit consists of 5 states any of which can repeat • each model is trained before being used to adjust transition probabilities to the speaker • The process is simple, giving the LPC, work through each word model and find the one which yields the greatest probability using Viterbi

Continuous Speech with HMM • Many simplifications made for discrete speech do not work for continuous speech • HMMs will have to model smaller units, possibly phonemes • To reduce the search space, use a beam search • To ease the word-to-word transitions, use bigrams or unigrams The process is similar to the previous slide where all phoneme HMMs are searched using Viterbi, but here, transition probabilities are included along with more codebooks to handle the phoneme-to-phoneme transitions and word-to-word transitions A successful path through the phoneme HMMs to match the word “seeks” A trained Phoneme HMM for /d/

HMMs need to have an observable state For the speech problem, the observable is a speech signal – this needs to be converted to a state The LPC coefficients are mapped to a codebook using a nearest-neighbor type of search More codebooks mean larger HMMs (more observable states to model) but the more accurate the nearest-neighbor matching will be HMM Codebook HMM systems will commonly use at least 256 codebooks

HMM/NN Hybrid • The strength of the NN is in its low level recognition ability • The strength of the HMM is in its matching ability of the LPC values to a codebook and selecting the right phoneme • Why not combine them? Here, a neural network is trained and used to determine the classification of the frames rather than a matching codebook -- thus, the system can learn to match better the acoustic information to a phonetic classification Phonetic classifications are gathered together into an array and mapped to the HMMs

Outstanding Problems • Most current solutions are stochastic or neural network and therefore exclude potentially useful symbolic knowledge which might otherwise aid during the recognition problem (e.g., semantics, discourse, pragmatics) • Speech segmentation – dividing the continuous speech into individual words • Selection of the proper phonetic units – we have seen phonemes and words, but also common are diphones, demisyllables and other possibilities – speech science still has not determined which type of unit is the proper type of unit to model for speech recognition • Handling intonation, stress, dialect, accent, etc • Dealing with very large vocabularies (currently, speech systems recognize no more than a few thousand words, not an entire language) • Accuracy is still too low for reliability (95-98% is common)

NLP • We mainly look at NLU here • How can a machine “understand” natural language? • As we know, machines don’t understand, so the goal is to transform a natural language input (speech or text) into an internal representation • the system may be one that takes that representation and selects an action, or merely stores it • for instance, if the NLU system is the front end of a database, then the goal is to form a DB query, submit it, and respond to the user with the DB results, or if it is the front end of an OS, the goal is to generate an OS command • Research began in the 1940s and virtually died in the early 1950s until progress was made in the field of linguistics in the 1960s • Since then, a number of approaches have been tried: • Symbolic • Stochastic/probabilistic • Neural network

NLP NLU on the left – input text (or speech) and come to a meaning by syntactic parsing followed by semantic analysis NLG on the right – a planning problem – how to create a sentence to convey a given idea?

NLU Processes • Morphological analysis • Syntactic parsing • identifying grammatical categories • top down parser • bottom up parser • Semantic parsing • Identifying meaning • template matching • alternatives are semantic grammars and using other forms of representations • ambiguity handled by some form of word sense disambiguation • Discourse analysis • Combining word meanings for a full sentence • handling references • applying world knowledge – causality, expectations, inferences • Pragmatic analysis • Speech acts, cognitive states, and beliefs, illocutionary acts

Parsing Recursive Transition Network parser Augmented Transition Network parser

An NLU Problem: Ambiguity • At the grammatical level • words can take on multiple grammatical roles • “Our company is training workings” has 3 syntactic parses • “List the sales of the products produced in 1973 with the products produced in 1972” has 455 parses! • At the semantic level • words can take on multiple meanings even within one grammatical category • consider the sentence “I made her duck” • At the pragmatic and discourse levels • what is the meaning of “it sure is cold in here” – is this just a statement of discomfort, or a request to make it warmer? • identifying the relationship for references “The chef made the boy some stew. He ate it and thanked him” – who do the he and him refer to? What does it refer to? • Some real US headlines with multiple interpretations:

Fun Headlines • Hospitals are Sued by 7 Foot Doctors • Astronaut Takes Blame for Gas in Spacecraft • New Study of Obesity Looks for Larger Test Group • Chef Throws His Heart into Helping Feed Needy • Include your Children when Baking Cookies

Stochastic Solutions Offer Promise • HMMs & bayesian probabilities are often used – how? • in either case, build a corpora – the probability that a given word or a given word + transition to another word will have a specific meaning • this requires obtaining statistics on word frequencies, collocation frequencies (phrases), and interpretation frequencies • Problem: Zipf’s law • many words appear infrequently and therefore will have low probabilities in stochastic models • rather than using the word count to determine the probability of a given word, rank the words by frequency of occurrence and then use the position in the list to compute a probability • frequency  1 / rank • In addition, we need to filter collocations to remove common phrases like “and so”, “one of the”, etc, to obtain more reasonable frequency rankings • otherwise, collocation frequencies will be misleading toward very common phrases

Symbolic solutions include: Logic-based models for parsing and context-free grammars to generate parsers finite state automota such as RTNs and ATNs Parsing by dynamic programming Knowledge representation approaches and case grammars (word models) for syntactic and semantic parsing Ad hoc knowledge-based approaches Neural network solutions include: Parsing (combining recurrent networks and self-organizing maps) Parsing relative clauses using recurrent networks Case role assignment Word sense disambiguation Other Solutions

NLG, Machine Translation • NLG: given a concept to relate, translate it into a legal statement • Like NLU, a mapping process, but this time in reverse • much more straight forward than NLU because ambiguity is not present • but there are many ways to say something, a good NLG will know its audience and select the proper words through register (audience context) • a sophisticated NLG will use reference and possibly even parts of speech • Machine Translation: • This is perhaps the hardest problem in NLP becomes it must combine NLU and NLG • Simple word-to-word translation is insufficient • Meaning, references, idioms, etc must all be taken care of • Current MT systems are highly inaccurate

Application Areas • MS Word – spell checker/corrector, grammar checker, thesaurus • WordNet • Search engines (more generically, information retrieval including library searches) • Database front ends • Question-answering systems within restricted domains • Automated documentation generation • News categorization/summation • Information extraction • Machine translation • for instance, web page translation • Language composition assistants – help non-native speakers with the language • On-line dictionaries

Search Engine Technology • Search engines generally comprise three components • Web crawler • simple, non-AI, traverser of web pages/sites • given web page, accumulate all URLs, add them to a queue or stack • retrieve next page given the URL from the queue (breadth-first) or stack (depth-first/recursive) • convert material to text format when possible • Summary extractor • take web page and summarize its content (possibly just create a bag of words, possibly attempt some form of classification) – store summary, classification and URL in DB • note: some engines only save summaries, others store summaries and the entire web page, but only use summaries during the search process • create index of terms to web pages (possibly a hash table) • Search engine portal • accept query • find all related items in the DB via hashing • sort using some form of rating scheme • display URLs, titles and possibly brief summaries

Page Categorization/Summaries • The tricky part of the search engine is to properly categorize or summarize a web page • Information retrieval techniques are common • Keywords from a bag of words • Statistical analysis to gauge similarities between pages • Link information such as page rank, hits, hubs, etc • Filtering • Many web pages (e.g., stores) try to take advantage of the syntactic nature of search engines and place meta tags in their pages that contain all English words • Filtering is useful in eliminating pages that attempt such tricks • Sorting • Once web pages have been found that match the given query, how are they sorted? • using word count, giving extra credit if any of the words are found in the page’s title or the link text, examine font size and style for importance of the words in the document, etc

Page Ranking • Based on the idea of academic citation to determine something’s importance • PR(A) = (1 – d) + d * (PR(T1) / C(T1) + … + PR(Tn)/C(Tn)) • PR(A) – page rank of page A • d – a “damping factor” between 0 and 1 (usually set to .85) • C(A) – number of links leaving page A • T1..Tn are the n pages that point at A • The page rank corresponds to the principle eigenvector of a normalized “link matrix” (that is, a matrix of pages and their links) • One way to view page rank is to think of an average web surfer who randomly is walking around pages by clicking on links (and never clicking the back button) • the page rank is in essence the probability that this page will be reached randomly • the damping factor is the likelihood that the surfer will get bored at this page and request another random page (rather than following a specific link of interest)

Google’s Architecture • There are numerous distributed crawlers working all the time • Web pages are compressed before being stored • Each page has a unique document ID provided by the store server • The indexer uncompresses files and parses them into word occurrences including the position in the document of the given word • These word occurrences are stored in “barrels” to create an index of word-to-document mappings (using ISAM) • The Sorter resorts the barrel information by word to create a reverse index • The URL resolver converts relative URLs into absolute URLs

An Architecture for Improved Search • Search engines are limited in that they do not take advantage of personalized knowledge • This is of course understandable given that search engines have to support millions of users • Imagine that you had your own search engine DB, could you provide user-specific knowledge to improve search? Three different techniques were tried using the architecture to the left A search engine provides a number (millions?) of URLs, which are downloaded and kept locally User personalized knowledge is applied to filter the items found locally in order to provide only those pages most likely to be of use

User Profiles Partial user profile accumulated from text files: debugging 1100 dec 1785 degree 4138 def 4938 default 1349 define 2752 defined 1140 department 2587 dept 1328 description 2691 design 1780 deskset 6472 development 1403 different 2336 digital 1424 directory 2517 disk 1907 distributed 5127 Word and frequency listed • The first approach merely created a bag of words annotated by their frequency • Both the words and the frequency were derived by accumulating text in user stored files • Word counts were summed based on which words appeared in the document – this score was used to order the retrieved pages

User-Defined Knowledge Base An alternative approach is to allow the user to define his/her own knowledge-base of topics, people, phrases and keywords that are of importance Classification is done through a user-defined hierarchy where each concept has its own matching knowledge Two concepts are shown to the left – threshold means the number of matches required for the category to establish

Learning-Driven Approach • Learning what bag of words best represents a category by having viewers vote on which web pages are relevant for a given query and which are not • Voting causes weights of the words in the bag to be altered • Implementations have included linear matrix forms, perceptrons and neural networks • this approach can also be used for recommendation systems

Application Areas