A Whirlwind Tour of Natural Language Processing

A Whirlwind Tour of Natural Language Processing Mark Sammons Cognitive Computation Group, UIUC

Who Cares about NLP? …Eddie Izzard, that’s who… (Those of a sensitive disposition toward explicit language should probably cover their ears…)

Remember Star Trek? HAL in 2001? The Heart of Gold in Hitch-hiker’s Guide…? • Grand Vision of Artificial Intelligence: computers that actively communicate. • A substantial effort devoted to achieving AI. • But how do we decide whether a machine is smart? • IBM’s Deep Blue plays a mean game of chess… …but is it intelligent? • Early idea of evaluation: Turing Test • If a human can’t tell that it’s a machine… • AI philosophy: is *appearance* of intelligent behavior the same as intelligence? • General assumption: NLP is AI-complete (play on concept of NP-completeness) – i.e. need Intelligence to properly solve NLP

More Realistically… where does NLP help? Already here: Context-sensitive spelling, grammar checkers in text editors Machine Translation, e.g. in web browsers Automated phone trees (by some definition of “help”) Web search Under development: Better Machine Translation Better search Voice command in e.g. cars

Outline • Why NLP is hard • NLP domains: Speech vs. Text • Attacking NLP problems • Linguistics: building explanatory models • Statistics: data-driven approaches • Machine Learning & NLP • NLP Problems and Solutions

Variability Ambiguity Why is NLP so hard? Meaning Language

Variability Example: Relation Extraction: “Works for” Jim Carpenter works for the U.S. Government. The American government employed Jim Carpenter. Jim Carpenter was fired by the US Government. Jim Carpenter worked in a number of important positions. …. As a press liaison for the IRS, he made contacts in the white house. Top Russian interior minister Yevgeny Topolov met yesterday with his US counterpart, Jim Carpenter. Former US Secretary of Defense Jim Carpenter spoke today…

Context Sensitive Paraphrasing [3] Ambiguity Loosen Strengthen Step up Toughen Improve Fasten Impose Intensify Ease Beef up Simplify Curb Reduce Loosen Strengthen Step up Toughen Improve Fasten Impose Intensify Ease Beef up Simplify Curb Reduce He used a Phillips head to tighten the screw. The bank owner tightened security after a spate of local crimes. The Federal Reserve will aggressively tighten monetary policy. ……….

Domain Size • Ideal goal: must handle all well-formed strings of text • Problem: infinite domain • Sequential modifiers: I saw Martin Sheen in a movie I saw Martin Sheen in a movie in Paris I saw Martin Sheen in a movie in Paris in the Spring I saw Martin Sheen in a movie in Paris in the Spring with my friend… • Unbounded relative clauses: I saw Martin Sheen, who was with a friend I knew from high school, which was well known for its long, storied history of ………….., in a movie…..

Speech Recognition • NOT “voice recognition” • How hard can it be? • First image: “Fix the wing” • Second image: same utterance in noisy airport maintenanceenvironment

Speech Recognition – yup, it’s hard… “Yuhgudda unnuhstahn sheeguhnuhbeeyah, yunoewaah, dissappointed.” “You’ve got to understand she’s going to be, ah, you know, ah, disappointed.” • Difficult to recognize words, word boundaries • Even given word boundaries, utterances are ill-formed (compared to text) (multiple variations for single word) • Hesitations, repetitions, fragmentary sentences, self-interruptions, poor word choice, sound quality… LBJ/Mansfield audio sample

Development and Evaluation for Speech Recognition • Switchboard (and other) corpora • Large set of phone conversations • Audio signals aligned with transcriptions of utterances (phone sequences) • Dictionaries aligning words with phone sequence equivalents • Typically, machine learning approaches applied • Signal processing techniques extract features from signals • Statistical methods relate these features to particular phones – create a model • Analyze new signals, use model to identify plausible phone sequences • Choose most likely sequence given another statistical model

Speech Recognition System (Courtesy of ComputerWorld…)

The State of the Art in Speech-to-Text Translation • Current performance on known tasks: 98% word accuracy for dictation • Very controlled circumstances • State of the art for spontaneous speech: • News broadcast: ~90% • Switchboard (phone conversations): ~80% • A lot of work even to get to a clean text representation of signal • Notice that I haven’t even begun to address tasks like search using this input (Note also that there are many other research directions in speech processing – e.g. speaker identification)

What about Text? • A lot of overlap • If you can solve NLP in text, and can accurately parse speech into text, the two problems are the same • Text domain has some nice characteristics • Paragraph, Sentence, Word segmentation already present • Well-formed utterances (in many/most sub-domains) • Little regional variation • Most information is already in the form of text

Linguistics • Linguists: meaning through structure + lexical knowledge “Colorless green ideas sleep furiously” • Discover the rules of language (a grammar) • Prescriptive grammar: rules describe what you shouldn’t do. • Generative grammar: a finite set of rules that can generate all possible strings in a language, and only those strings that are valid in that language [3] • “Generate” here means “assign a structural description to” • Attempts to move beyond simplistic linear models, where words are dependent only on previous words

Divide and Conquer: Morphology Consider the sub-problem of recognizing well-formed variations of words Popular method: Finite State Automata/Transducers Automaton: recognizes patterns Transducer: maps from an input pattern to an output pattern – e.g. indicate whether a noun is plural

Morphology Example: plurals [5] q1 Regular noun : N -s : + PL q0 q2 Irregular singular noun : N Irregular plural noun : N+ PL

Basic Generative Grammar: Context-Free Grammar S => NP VP VP => V VP VP => VP PP VP => V ADJP NP => PRO PRO => He V => wants PP => to ….. [4] Accomplishes the goal of a finite description of infinite domain, at least for syntactic structure Generate parse trees, decompose into constituents, infer generative rules:

Context-Free Grammar • Drawbacks to CFGs: • Real natural language may not be context-free • Hard to model some phenomena, e.g. limits on nesting:The cat ran away. The cat the dog bitran away. The cat the dog the horse kicked bitran away. • Phenomena like agreement, morphology, long distance dependencies, require very complex set of rules • What about unseen words/phrases/sentences? • Given a sentence, there may be multiple ways to explain it. I pointed to the man with the crutch.

That doesn’t deter Real Linguists… • A range of formalisms have been developed • Different ways of tackling composition of words, phrases, clauses • Trade-off between importance of sentence structure and individual words • Strong emphasis on generality, particularly across languages • Typically much more involved than the simplistic CFG in the previous example • There is ongoing work to encode a hand-written grammar of English – English Resource Grammar • Uses Head-driven Phrase Structure Grammar • Explains syntax via a Typed Feature Structure model

HPSG sample Feature Structure (for one word)

General Points Much work on analyzing languages for structure Wide range of theories; all have some descriptive power All assume close relation between structure and meaning We will see CFGs again later…

Outline • Why NLP is hard • NLP domains: Speech vs. Text • Attacking NLP problems: 4 research strands • Linguistics: building explanatory models • Statistics: data-driven approaches • Machine Learning & NLP • NLP Problems and Solutions

Data-Driven Approaches • Consider a partially completed sentence… • We can capture some measure of this intuitive restriction on word choice using probabilities • Bigrams, trigrams, n-grams • Effect of adding complexity in terms of storage requirements? 50,0002 = 2.5 Billion • We can estimate these probabilities directly from a corpus (body of text): p(wn|wn-1) = C(wn-1 wn)/C(wn-1) • Applications: spelling checker, augmentative communication systems, speech processing…

N-gram model samples The following sentences were generated using n-gram models trained on Shakespeare’s works (~885,000 words, ~29,000 types) [5]: 1-gram: Every enter now severally so, let 2-gram: What means sir. I confess she? Then all sorts, he is trim, captain. 3-gram: This shall forbid it should be branded, if renown made it empty. 4-gram: Enter Leonato’s brother Antonio, and the rest, but seek the weary beds of people sick.

N-Gram Modeling • What’s it good for? • Determine plausibility of new sentence: The man spoke briefly… The dog spoke briefly… The spoke briefly man… The wheel spoke briefly… • Given N-gram models of two domains, identify most likely source: ACENOR stocks caught fire today on word of a take-over…. Teen pop sensation Tilde Greengrass roared into Austin today… • Teen Angst Poetry and Band Names… • Drawbacks: how to handle unseen sequences?

Computational Linguistics • We just used very elementary statistics to make some potentially interesting discoveries about language • In fact, given the right resources, we can use statistics to build automated resources for linguistic analysis… • Part of speech tagging: (DT the) (NN man) (VBD climbed) (IN up) (DT the ) (NN tree) • Phrase boundary detection & phrase labeling (NP the man) (VP climbed) (PP up the tree) • Parsing….

Parsing Revisited • We saw earlier an outline of a Context-Free Grammar model of language S => NP VP VP => NP PP NP => NP PP NP => DT NN (NP I) (VP saw) (NP the man) (PP with the telescope) (NP I) (VP saw) (NP the man) (PP with the book) • Two valid parses for each… are they equally valid?

Probabilistic CFGs • In the n-gram modeling example, we derived probabilities based on a corpus. Can we do the same for CFG rules? • Not the same problem: for n-gram modeling, the words alone were sufficient • Need a corpus with additional information – the parse trees • Given such a corpus, can use statistical analysis to derive the rules themselves, and the relative probabilities of rules. • This pattern – applying statistical methods to a labeled data set to extract a predictive model – is common in Machine Learning.

Outline • Why NLP is hard • NLP domains: Speech vs. Text • Attacking NLP problems: 4 research strands • Linguistics: building explanatory models • Statistics: data-driven approaches • Machine Learning & NLP • NLP Problems and Solutions

(x,y) (x,y) (x,y) (x,y) y x Output: y Input: x Machine Learning: Classification D: Training examples h: X->Y (classifier) Learning algorithm

Machine Learning (supervised) • Given some labeled data, and assuming some set of models, find the model that best maps each example to its label. • Statistically: represent examples using some abstraction (set of features), compute the relation between features and labels. • Choice of model affects best possible performance. • Complex model: may get better results (more expressive), but requires much more data to train (and labeled data is expensive) • Simple model: fewer parameters, so less expressive, but easier to learn • Some examples…

Outline • Why NLP is hard • NLP domains: Speech vs. Text • Attacking NLP problems: 4 research strands • Linguistics: building explanatory models • Logic: defining meaning and reasoning • Statistics: data-driven approaches • Machine Learning & NLP • NLP Problems and Solutions

NLP Problems and Solutions (focused) Part-of-Speech tagging Context Sensitive Spelling Correction Named Entity Recognition Relation detection Comma Resolution Verb and Noun Phrase Chunking Prepositional Phrase Attachment Coreference Resolution Statistical Parsing Semantic Role Labeling Emotion and Subjectivity detection

Example: Named Entity Recognition New NE seen • A lot of Machine Learning work – significant over fitting • Key difficulties – Adaptation to: • New domains/corpora • Slightly new definition of an entity • New languages • New types of entities NE seen • How to reduce the requirements on the resources needed to produce a semantic categorization for a new domain/new language/new type of entities • Entities are inherently ambiguous (e.g. JFK can be both location and a person depending on the context) • Can appear in various forms ; Can be nested. • Using lists is not sufficient • New entities are always being introduced

Grand Challenges Machine Translation Message Understanding (Information Extraction) Question Answering Information Retrieval & Data Mining Textual Entailment

Textual Entailment • Work at the level of meaning • Frame the task of understanding text as recognizing when two text fragments mean the same thing (one meaning ‘contains’ the other) • Dagan and Glickman, 2004 pose this problem as Recognizing Textual Entailment. • Now we can recast many problems in terms of TE: The American government employed Jim Carpenter. Top Russian interior minister Yevgeny Topolov met yesterday with his US counterpart, Jim Carpenter. Former US Secretary of Defence Jim Carpenter spoke today… Jim Carpenter works for the U.S. Government. ?

PASCAL RTE Challenges (2004-present) • Move away from strict definition (Chierchia & McConnell-Ginet, 2001 [6]): A text T entails a hypothesis H if His true in every circumstance (possible world) in which T is true • ‘Applied’ Definition (Dagan & Glickman, 2004 [7]) Tentails H (TH) if humans reading Twill infer that His most likely true • 800 development, 800 test pairs for each challenge

Some Examples (2nd RTE Challenge)

Incomplete List of Citations Peter Bell and Simon King. Sparse gaussian graphical models for speech recognition. In Proc. Interspeech 2007, Antwerp, Belgium, August 2007 Connor & Roth ECML 07 Chomsky, Noam (1957,2002). Syntactic Structures. Mouton de Gruyter, 13. Image courtesy of Bill Wilson, Univ. New South Wales, Australia http://www.cse.unsw.edu.au/~billw/ Jurafsky and Martin. Speech and Language Processing, Prentice-Hall, 2000 Chierchia & McConnell-Ginet. Meaning and Grammar: An Introduction to Semantics (rev. 2nd ed.), 2000 Dagan & Glickman, 2004. Probabilistic textual entailment: Generic applied modeling of language variability. PASCAL workshop on Text Understanding and mining. 2004. Some slides came from Prof. Dan Roth, University of Illinois.

A Whirlwind Tour of Natural Language Processing

A Whirlwind Tour of Natural Language Processing

Presentation Transcript

A Whirlwind Tour of Bioinformatics

A Whirlwind Tour of Writing Formats

A Whirlwind Tour of Bioinformatics

Whirlwind Tour of Hadoop

A Whirlwind Tour of Biomedical Informatics

A whirlwind tour of EDS

Tornadoes: A Whirlwind Tour

Natural Language Processing

Natural Language Processing

Natural Language Processing

A Whirlwind Tour of Bioinformatics

Natural Language Processing

A Whirlwind Tour of Interdomain Routing

A Whirlwind Tour of Bioinformatics

A Whirlwind Tour of Biomedical Informatics

Natural Language Processing