1 / 134

Natural Language Processing & Information Extraction CSE 573 Dan Weld

Natural Language Processing & Information Extraction CSE 573 Dan Weld. todo. Be more explicit on how rule learning works (before rapier, bwi) – show search thru rule space Too much of a lecture – add more interaction / questions. Confession. Proposal 1 Keep homework and final exam

london
Download Presentation

Natural Language Processing & Information Extraction CSE 573 Dan Weld

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing & Information ExtractionCSE 573Dan Weld

  2. todo • Be more explicit on how rule learning works (before rapier, bwi) – show search thru rule space • Too much of a lecture – add more interaction / questions

  3. Confession • Proposal 1 • Keep homework and final exam • Teach next week so we can cover all the cool stuff • Proposal 2 • Talk quickly • Mini-project replaces homework & final, due 12/11

  4. Mini-Project Options • Write a program to solve the counterfeit coin problem on the midterm. • Build a DPLL and or a WalkSAT satisfiability solver. • Build a spam filter using naïve Bayes, decision trees, or compare learners in the Weka ML package. • Write a program which learns Bayes nets.

  5. Communication

  6. Is the Problem

  7. Of the century… Herman is reprinted with permission from LaughingStock Licensing Inc., Ottawa Canada. All rights reserved.

  8. Why Study NLP? • A hallmark of human intelligence. • Text is the largest repository of human knowledge and is growing quickly. • emails, news articles, web pages, IM, scientific articles, insurance claims, customer complaint letters, transcripts of phone calls, technical documents, government documents, patent portfolios, court decisions, contracts, ……

  9. What is NLP? • Natural Language Processing • Understand human language (speech, text) • Successfully communicate (NL generation) • Pragmatics: “Can you pass the salt?” • Semantics: “Paris is the capital of France” • How represent meaning? • Syntax: analyze the structure of a sentence • Morphology: analyze the structure of words • Dogs = dog + s.

  10. Fundamental NLP Tasks • Speech recognition: speech signal to words • Parsing: decompose sentence into units • Part of speech tagging: Eg, noun vs. verb • Semantic role labeling: for a verb, find the units that fill pre-defined “semantic roles” (eg, Agent, Patient or Instrument) • Example: “John hit the ball with the bat”

  11. Fundamental NLP Tasks • Semantics: map sentence to corresponding “meaning representation” (e.g., logic) • give(John, Book, Sally) • Quantification: Everybody loves somebody • Word Sense Disambiguation • orange juice vs. orangecoat • Where do word senses come from? • Co-reference resolution: • The dog found a cookie; He ate it. • Implicit “text” - what is left unsaid? • Joe entered the restaurant, ordered, ate and left. The owner said “Come back soon!” • Joe entered the restaurant, ordered, ate and left. The owner said “Hey, I’ll call the police!”

  12. NLP Applications • Question answering • Who is the first Taiwanese president? • Text Categorization/Routing • e.g., customer e-mails. • Text Mining • Find everything that interacts with BRCA1. • Machine Translation • Information Extraction • Spelling correction • Is that just dictionary lookup?

  13. Main Challenge in NLP: Ambiguity • Lexical: • Label (noun or verb)? • London (Jack or capital of the UK)? • Syntactic (examples from newspaper headlines): • Prepositional Phrase Attachment • Ban on Nude Dancing on Governor’s Desk • Killer Sentenced to Die for Second Time in 10 Years • Word Sense Disambiguation • 1990: Iraqi Head Seeking Arms • Syntactic Ambiguity (what’s modifying what) • Juvenile Court to Try Shooting Defender. • Semantic ambiguity: • “snake poison” • Rampant metaphors: • “prices went up”

  14. Speech Act Theory (Searle ’75) • Example Speech Acts: • assertives = speech acts that commit a speaker to the truth of the expressed proposition • directives = speech acts that are to cause the hearer to take a particular action, e.g. requests, commands and advice • commissives = speech acts that commit a speaker to some future action, e.g. promises and oaths • “Could you pass the salt?” • “Yes.” • Research has focused on specifying semantics and analyzing primitives instead of implemented “speech actors”

  15. Semantics • How do you map a sentence to a semantic representation? • What are the semantic primitives? • Schank: 12 (or so) primitives • (Conceptual dependency theory, 1972) • The basis of natural language is conceptual. • The conceptual level is interlingual, • while the sentential level is language-specific. • De-emphasize syntax • Focus on semantic expectations as the basis for NLU • Status: didn’t yield practical systems, out of favor, but..

  16. Syntax is about Sentence Structures • Sentences have structures and are made up of constituents. • The constituents are phrases. • A phrase consists of a head and modifiers. • The category of the head determines the category of the phrase • e.g., a phrase headed by a noun is a noun phrase

  17. Alternative Parses Teacher strikes idle kids S S VP VP NP NP NP NP N N V N N V A N Teacher strikes idle kids Teacher strikes idle kids

  18. How do you choose the parse? • Old days: syntactic rules, semantics. • Since early 1980’s: statistical parsing. • parsing as supervised learning • Input: Tree bank of (sentence, tree) • Output: PCFG Parser + lexical language model • Parsing decisions depend on the words! • Example: VP  V + NP + NP? • Does the verb take two objects or one? • ‘told Sue the story’ versus ‘invented the Internet’

  19. Issues with Statistical Parsing • Statistical parsers still make “plenty” of errors • Tree banks are language specific • Tree banks are genre specific • Train on WSJ  fail on the Web • standard distributional assumption • Unsupervised, un-lexicalized parsers exist • But performance is substantially weaker

  20. How Difficult is Morphology? • Examples from Woods et. al. 2000 • delegate (de + leg + ate) take the legs from • caress (car + ess) female car • cashier (cashy + er) more wealthy • lacerate (lace + rate) speed of tatting • ratify (rat + ify) infest with rodents • infantry (infant + ry) childish behavior

  21. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Slides from Cohen & McCallum

  22. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. Slides from Cohen & McCallum

  23. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slides from Cohen & McCallum

  24. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slides from Cohen & McCallum

  25. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slides from Cohen & McCallum

  26. NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Free Soft.. Richard Stallman founder What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * * Slides from Cohen & McCallum

  27. IE in Context Create ontology Spider Filter by relevance Segment Classify Associate Cluster IE Database Load DB Documentcollection Train extraction models Query, Search Data mine Label training data Slides from Cohen & McCallum

  28. IE History Pre-Web • Mostly news articles • De Jong’s FRUMP [1982] • Hand-built system to fill Schank-style “scripts” from news wire • Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96] • Most early work dominated by hand-built models • E.g. SRI’s FASTUS, hand-built FSMs. • But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Web • AAAI ’94 Spring Symposium on “Software Agents” • Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. • Tom Mitchell’s WebKB, ‘96 • Build KB’s from the Web. • Wrapper Induction • First by hand, then ML: [Doorenbos ‘96], [Soderland ’96], [Kushmerick ’97],… Slides from Cohen & McCallum

  29. What makes IE from the Web Different? Less grammar, but more formatting & linking Newswire Web www.apple.com/retail Apple to Open Its First Retail Store in New York City MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience. "Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles." www.apple.com/retail/soho www.apple.com/retail/soho/theatre.html The directory structure, link structure, formatting & layout of the Web is its own new grammar. Slides from Cohen & McCallum

  30. Landscape of IE Tasks (1/4):Pattern Feature Domain Text paragraphs without formatting Grammatical sentencesand some formatting & links Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets,rich formatting & links Tables Slides from Cohen & McCallum

  31. Landscape of IE Tasks (2/4):Pattern Scope Web site specific Genre specific Wide, non-specific Formatting Layout Language Amazon.com Book Pages Resumes University Names Slides from Cohen & McCallum

  32. Landscape of IE Tasks (3/4):Pattern Complexity E.g. word patterns: Regular set Closed set U.S. phone numbers U.S. states Phone: (413) 545-1323 He was born in Alabama… The CALD main office can be reached at 412-268-1299 The big Wyoming sky… Ambiguous patterns,needing context + manysources of evidence Complex pattern U.S. postal addresses University of Arkansas P.O. Box 140 Hope, AR 71802 Person names …was among the six houses sold by Hope Feldman that year. Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Pawel Opalinski, SoftwareEngineer at WhizBang Labs. Slides from Cohen & McCallum

  33. Landscape of IE Tasks (4/4):Pattern Combinations Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship N-ary record Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut Location: Connecticut “Named entity” extraction Slides from Cohen & McCallum

  34. Evaluation of Single Entity Extraction TRUTH: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. PRED: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by RichardM. Karpe and Martin Cooke. # correctly predicted segments 2 Precision = = # predicted segments 6 # correctly predicted segments 2 Recall = = # true segments 4 1 F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 2 Slides from Cohen & McCallum

  35. State of the Art Performance • Named entity recognition • Person, Location, Organization, … • F1 in high 80’s or low- to mid-90’s • Binary relation extraction • Contained-in (Location1, Location2)Member-of (Person1, Organization1) • F1 in 60’s or 70’s or 80’s • Wrapper induction • Extremely accurate performance obtainable • Human effort (~30min) required on each site Slides from Cohen & McCallum

  36. Classify Pre-segmentedCandidates Sliding Window Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternatewindow sizes: Context Free Grammars Boundary Models Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P NP Most likely parse? Classifier PP which class? VP NP VP BEGIN END BEGIN END S Landscape of IE Techniques (1/1):Models Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming …and beyond Any of these models can be used to capture words, formatting or both. Slides from Cohen & McCallum

  37. Landscape: Our Focus Pattern complexity closed set regular complex ambiguous Pattern feature domain words words + formatting formatting Pattern scope site-specific genre-specific general Pattern combinations entity binary n-ary Models lexicon regex window boundary FSM CFG Slides from Cohen & McCallum

  38. Sliding Windows Slides from Cohen & McCallum

  39. Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slides from Cohen & McCallum

  40. Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slides from Cohen & McCallum

  41. Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slides from Cohen & McCallum

  42. Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slides from Cohen & McCallum

  43. A “Naïve Bayes” Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … … w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If P(“Wean Hall Rm 5409” = LOCATION)is above some threshold, extract it. Other examples of sliding window: [Baluja et al 2000] (decision tree over individual words & their context) Slides from Cohen & McCallum

  44. “Naïve Bayes” Sliding Window Results Domain: CMU UseNet Seminar Announcements GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Field F1 Person Name: 30% Location: 61% Start Time: 98% Slides from Cohen & McCallum

  45. Realistic sliding-window-classifier IE • What windows to consider? • all windows containing as many tokens as the shortest example, but no more tokens than the longest example • How to represent a classifier? It might: • Restrict the length of window; • Restrict the vocabulary or formatting used before/after/inside window; • Restrict the relative order of tokens, etc. • Learning Method • SRV: Top-Down Rule Learning [Frietag AAAI ‘98] • Rapier: Bottom-Up [Califf & Mooney, AAAI ‘99] Slides from Cohen & McCallum

  46. Rapier: results – precision/recall Slides from Cohen & McCallum

  47. Rule-learning approaches to sliding-window classification: Summary • SRV, Rapier, and WHISK [Soderland KDD ‘97] • Representations for classifiers allow restriction of the relationships between tokens, etc • Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog) • Use of these “heavyweight” representations is complicated, but seems to pay off in results • Can simpler representations for classifiers work? Slides from Cohen & McCallum

  48. BWI: Learning to detect boundaries [Freitag & Kushmerick, AAAI 2000] • Another formulation: learn three probabilistic classifiers: • START(i) = Prob( position i starts a field) • END(j) = Prob( position j ends a field) • LEN(k) = Prob( an extracted field has length k) • Then score a possible extraction (i,j) by START(i) * END(j) * LEN(j-i) • LEN(k) is estimated from a histogram Slides from Cohen & McCallum

  49. BWI: Learning to detect boundaries • BWI uses boosting to find “detectors” for START and END • Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i). • Each “pattern” is a sequence of • tokens and/or • wildcards like: anyAlphabeticToken, anyNumber, … • Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns Slides from Cohen & McCallum

More Related