1 / 49

CSA2050: Natural Language Processing

CSA2050: Natural Language Processing. Information Extraction Information Extraction Named Entities IE Systems MUC Finite State Machines Pattern Recognition. Classification at different granularities. Text Categorization: Classify an entire document Information Extraction (IE):

Download Presentation

CSA2050: Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA2050: Natural Language Processing Information Extraction Information Extraction Named Entities IE Systems MUC Finite State Machines Pattern Recognition CSA3180: Information Extraction I

  2. Classification at different granularities • Text Categorization: • Classify an entire document • Information Extraction (IE): • Identify and classify small units within documents • Named Entity Extraction (NE): • A subset of IE • Identify and classify proper names • People, locations, organizations CSA3180: Information Extraction I

  3. Martin Baker, a person Genomics job Employers job posting form CSA3180: Information Extraction I

  4. Aggregator Websites CSA3180: Information Extraction I

  5. foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 CSA3180: Information Extraction I

  6. Aggregator Websites • Read in many web pages from different sites • Extract information into a database • Screen Scraping • Can then return data matching particular queries • Data mining can extract meaningful insight that might not have been obvious CSA3180: Information Extraction I

  7. CSA3180: Information Extraction I

  8. Data Mining CSA3180: Information Extraction I

  9. IE from Research Papers CSA3180: Information Extraction I

  10. IE from Commercial Websites CSA3180: Information Extraction I

  11. NAME TITLE ORGANIZATION What is Information Extraction? As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… CSA3180: Information Extraction I

  12. What is Information Extraction? As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. CSA3180: Information Extraction I

  13. What is Information Extraction? As a familyof techniques: Information Extraction = segmentation + classification + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction” CSA3180: Information Extraction I

  14. What is Information Extraction? A familyof techniques: Information Extraction = segmentation + classification + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation CSA3180: Information Extraction I

  15. What is Information Extraction? A familyof techniques: Information Extraction = segmentation + classification+ association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation CSA3180: Information Extraction I

  16. IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Database Load DB Documentcollection Train extraction models Query, Search Data mine Label training data CSA3180: Information Extraction I

  17. Text paragraphs without formatting Grammatical sentencesand some formatting & links Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. IE in Context: Formatting CSA3180: Information Extraction I

  18. Non-grammatical snippets,rich formatting & links Tables IE in Context: Formatting CSA3180: Information Extraction I

  19. IE in Context: Coverage Web site specific Genre specific Formatting Layout Amazon.com Book Pages Resumes CSA3180: Information Extraction I

  20. IE in Context: Coverage Wide, non-specific Language University Names CSA3180: Information Extraction I

  21. Regular set Closed set U.S. phone numbers U.S. states Phone: (413) 545-1323 He was born in Alabama… The CALD main office can be reached at 412-268-1299 The big Wyoming sky… Ambiguous patterns,needing context andmany sources of evidence Complex pattern U.S. postal addresses Person names University of Arkansas P.O. Box 140 Hope, AR 71802 …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, SoftwareEngineer at WhizBang Labs. Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 IE in Context: Complexity CSA3180: Information Extraction I

  22. IE in Context: Single Field/Record Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship N-ary record Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut Location: Connecticut “Named entity” extraction CSA3180: Information Extraction I

  23. State of the Art • Named entity recognition from newswire text • Person, Location, Organization, … • F1 in high 80’s or low- to mid-90’s • Binary relation extraction • Contained-in (Location1, Location2)Member-of (Person1, Organization1) • F1 in 60’s or 70’s or 80’s • Web site structure recognition • Extremely accurate performance obtainable • Human effort (~10min?) required on each site CSA3180: Information Extraction I

  24. IE Generations • Hand-Built Systems – Knowledge Engineering [1980s– ] • Rules written by hand • Require experts who understand both the systems and the domain • Iterative guess-test-tweak-repeat cycle • Automatic, Trainable Rule-Extraction Systems [1990s– ] • Rules discovered automatically using predefined templates, using automated rule learners • Require huge, labeled corpora (effort is just moved!) • Statistical Models [1997 – ] • Use machine learning to learn which features indicate boundaries and types of entities. • Learning usually supervised; may be partially unsupervised CSA3180: Information Extraction I

  25. Classify Pre-segmentedCandidates Sliding Window Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternatewindow sizes: Context Free Grammars Boundary Models Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P NP Most likely parse? Classifier PP which class? VP NP VP BEGIN END BEGIN END S IE Techniques Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming CSA3180: Information Extraction I

  26. Pros Annotating text is simpler & faster than writing rules. Domain independent Domain experts don’t need to be linguists or programers. Learning algorithms ensure full coverage of examples. Trainable IE Systems Cons • Hand-crafted systems perform better, especially at hard tasks. (but this is changing) • Training data might be expensive to acquire • May need huge amount of training data • Hand-writing rules isn’t that hard!! CSA3180: Information Extraction I

  27. MUC: Genesis of IE • DARPA funded significant efforts in IE in the early to mid 1990’s. • Message Understanding Conference (MUC) was an annual event/competition where results were presented. • Focused on extracting information from news articles: • Terrorist events • Industrial joint ventures • Company management changes • Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s) CSA3180: Information Extraction I

  28. MUC • Named entity • Person, Organization, Location • Co-reference • Clinton President Bill Clinton • Template element • Perpetrator, Target • Template relation • Incident • Multilingual CSA3180: Information Extraction I

  29. MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month CSA3180: Information Extraction I

  30. MUC Typical Text Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with alocal concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month CSA3180: Information Extraction I

  31. MUC Templates • Relationship • tie-up • Entities: • Bridgestone Sports Co, a local concern, a Japanese trading house • Joint venture company • Bridgestone Sports Taiwan Co • Activity • ACTIVITY 1 • Amount • NT$2,000,000 CSA3180: Information Extraction I

  32. MUC Templates • ATIVITY 1 • Activity • Production • Company • Bridgestone Sports Taiwan Co • Product • Iron and “metal wood” clubs • Start Date • January 1990 CSA3180: Information Extraction I

  33. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example from Fastus (1993) CSA3180: Information Extraction I

  34. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 CSA3180: Information Extraction I

  35. Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 CSA3180: Information Extraction I

  36. production of 20, 000 iron and metal wood clubs [company] [set up] [Joint-Venture] with [company] 1.Complex Words: Recognition of multi-words and proper names set up new Taiwan dollars 2.Basic Phrases: Simple noun groups, verb groups and particles a Japanese trading house had set up 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. CSA3180: Information Extraction I

  37. Evaluating IE Accuracy • Always evaluate performance on independent, manually-annotated test data not used during system development. • Measure for each test document: • Total number of correct extractions in the solution template: N • Total number of slot/value pairs extracted by the system: E • Number of extracted slot/value pairs that are correct (i.e. in the solution template): C • Compute average value of metrics adapted from IR: • Recall = C/N • Precision = C/E • F-Measure = Harmonic mean of recall and precision CSA3180: Information Extraction I

  38. Named Entities • Named Entities • Person Name: Colin Powell, Frodo • Location Name: Middle East, Aiur • Organization: UN, DARPA • Domain Specific vs. Open Domain CSA3180: Information Extraction I

  39. Nymble (BBN Corporation) • State of the art system • Near-human performance ~90% accuracy • Statistical system • Approach: Hidden Markov Model (HMM) CSA3180: Information Extraction I

  40. Nymble (BBN Corporation) • Noisy channel paradigm • Originally, entities were marked in the raw text • Post noisy channel, annotation is lost • Probability of most likely sequence of name classes (NC) given a sequence of words (W) Pr(NC|W) = Pr(W,NC) / Pr(W) since the a priori probability of the word sequence can be considered constant for any given sentence  maximize just numerator CSA3180: Information Extraction I

  41. Nymble (BBN Corporation) Person Start of Sentence End of Sentence Organization Five other classes Not-A-Name CSA3180: Information Extraction I

  42. Automatic Content Extraction • DARPA ACE Program • Identify Entities • Named: Bilbo, San Diego, UNICEF • Nominal: the president, the hobbit • Pronominal: she • Reference resolution • Clinton  the president  he CSA3180: Information Extraction I

  43. Question Answering The over-used pipeline paradigm: Question Analysis Information Retrieval Answer Extraction Question Answer Answer Merging CSA3180: Information Extraction I

  44. Question Answering • Feedback loops can be present for constraint relaxation purposes • Not all QA systems adhere to the pipeline architecture • Question answering flavors • Factoid vs. complex • Who invented paper? vs. Which of Mr. Bush’s friends are Black Sabbath fans? • Closed vs. open domain CSA3180: Information Extraction I

  45. Answer Extraction The over-used pipeline paradigm: • Focus on open domain, factoid question answering Question Analysis Information Retrieval Answer Extraction Question Answer Answer Merging CSA3180: Information Extraction I

  46. Practical Issues • Web Spell Checking • Mispling • nucular • Infrequent forms: • Niagra vs. Niagara, Filenes vs. Filene’s • Google QA • Genome, video, games CSA3180: Information Extraction I

  47. Practical Issues • Traditional Information Extraction • Either expert built or statistical • Specific strategies for specific question types • Person Bio vs. Location question types • Ability to generalize to new questions and new question types CSA3180: Information Extraction I

  48. Practical Issues • Who invented Blah? Blah was invented by PersonName Blah was Verb by PersonName where Verb is synonym to invented Blah VerbPhrase by PersonName CSA3180: Information Extraction I

  49. Popular Resources • Experts and/or Learning Algorithms • Gazeteers • NE taggers • Part Of Speech taggers • Parsers • Wordnet • Stopword list • Stemmer CSA3180: Information Extraction I

More Related