1 / 41

Extraction as Classification

Extraction as Classification. What is “Information Extraction”. As a task:. Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT

amy
Download Presentation

Extraction as Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extraction as Classification

  2. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. QA End User

  3. What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka Named Entity Recognition

  4. Landscape of IE Tasks (1/4):Degree of Formatting Text paragraphs without formatting Grammatical sentencesand some formatting & links Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets,rich formatting & links Tables

  5. Landscape of IE Tasks (3/4):Complexity of extraction task E.g. word patterns: Regular set Closed set U.S. phone numbers U.S. states Phone: (413) 545-1323 He was born in Alabama… The CALD main office can be reached at 412-268-1299 The big Wyoming sky… Ambiguous patterns,needing context andmany sources of evidence Complex pattern U.S. postal addresses Person names University of Arkansas P.O. Box 140 Hope, AR 71802 …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, SoftwareEngineer at WhizBang Labs. Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210

  6. Landscape of IE Tasks (4/4):Single Field/Record Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship N-ary record Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut Location: Connecticut “Named entity” extraction

  7. Classify Pre-segmentedCandidates Sliding Window Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternatewindow sizes: Boundary Models Token Tagging Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? Classifier which class? BEGIN END BEGIN END Models for NER Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming This is often treated as a structured prediction problem…classifying tokens sequentially HMMs, CRFs, ….

  8. MUC-7 • Last Message Understanding Conference, (fore-runner to ACE), about 1998 • 200 articles in development set (newswire text about aircraft accidents) • 200 articles in final test (launch events) • Names of: persons, organizations, locations, dates, times, currency & percentage

  9. TE: Template Elements (attributes) TR: Template Relations (binary relations) ST: Scenario Template (events) CO: coreference NE: named entity recognition

  10. LTG NetOwl Commercial RBS Identifinder (HMMs) MENE+Proteus Manitoba (NB filtered names)

  11. Bothwick, Sterling, Agichtein, Grishman: the MENE system ?

  12. Borthwick et al: MENE system • Simple idea: tag every token: • 4 tags/field for x=person, organization: • x_start, x_continue, x_end, x_unique • Also “other” • To extract: • Compute P(y|xi) for each xi • Use Viterbi to find ML consistent sequence: • continue follows start • end follows continue or start • …

  13. Borthwick et al: MENE system • Simple idea: tag every token: • 4 tags/field for x=person, organization: • x_start, x_continue, x_end, x_unique • Also “other” • Learner: • Maxent/logistic regression • Regularize by dropping rare features • To extract: • Compute P(y|xi) for each xi • Use Viterbi to find ML consistent sequence: • continue follows start • end follows continue or start • …

  14. Viterbi in MENE <math/>

  15. Borthwick et al: MENE system • Features g(h,f) for the loglinear model • function of “history” (features of token) and “future” (predicted class) • Lexical features combine • Identity of token in window and • Predicted class • eg g(h,f)=[token-1(h)=“Mr.” and f=person_unique] • Section features combine: • section name & class

  16. Borthwick et al: MENE system • Features g(h,f) for the loglinear model • Dictionary features: • Match each multi-world dictionary d to text • For token sequences record d_start, d_cont, .. • Combine these values with category • eg g(h,f)=[places0(h)=“places_uniq” and f=organization_start]

  17. Dictionaries in MENE

  18. Borthwick et al: MENE system • Features g(h,f) for the loglinear model • External system features: • Run someone else’s system s on text • For token sequences record sx_start, sx_cont, .. • Combine these values with category • eg g(h,f)=[proteus0(h)=“places_uniq” and f=organization_start]

  19. MENE results (dry run)

  20. MENE learning curves 92.2 93.3 96.3

  21. Longer names Short names • Largest U.S. Cable Operator Makes Bid for Walt Disney • By ANDREW ROSS SORKIN • The Comcast Corporation, the largest cable television operator in the United States, made a $54.1 billion unsolicited takeover bid today for The Walt Disney Company, the storied family entertainment colossus. • If successful, Comcast's audacious bid would once again reshape the entertainment landscape, creating a new media behemoth that would combine the power of Comcast's powerful distribution channels to some 21 million subscribers in the nation with Disney's vast library of content and production assets. Those include its ABC television network, ESPN and other cable networks, and the Disney and Miramax movie studios.

  22. LTG system • Another MUC-7 competitor • Handcoded rules for “easy” cases (amounts, etc) • Process of repeated tagging and “matching” for hard cases • Sure-fire (high precision) rules for names where type is clear (“Phillip Morris, Inc – The Walt Disney Company”) • Partial matches to sure-fire rule are filtered with a maxent classifier (candidate filtering) using contextual information, etc • Higher-recall rules, avoiding conflicts with partial-match output “Phillip Morris announced today…. - “Disney’s ….” • Final partial-match & filter step on titles with different learned filter. • Exploits discourse/context information

  23. LTG Results

  24. LTG NetOwl Commercial RBS Identifinder (HMMs) MENE+Proteus Manitoba (NB filtered names)

  25. Jansch & Abney paper He was a grammarian, and could doubtless see further into the future than others. -- J.R.R. Tolkien, "Farmer Giles of Ham" echo golf echo x-ray | tr "ladyfinger orthodontics rx-" nrogslkjhgfedc@example.com.

  26. Background on JA paper

  27. SCAN: Search & Summarization for Audio Collections (AT&T Labs)

  28. Why IE from personal voicemail • Unified interface for email, voicemail, fax, … requires uniform headers: • Sender, Time, Subject, … • Headers are key for uniform interface • Independently, voicemail access is slow: • useful to have fast access to important parts of message (contact number, caller)

  29. Background on JA – con’t • Quick review of Huang, Zweig & Padmanabhan (IBM Yorktown) “Information Extraction from Voicemail”: • Goal: find identity and contact number of callers in voicemail (NER + role classification) • Evaluated three systems on ~= 5000 labeled manually transcribed messages: • Baseline system: • 200 hand-coded rules based on “trigger phrases” • State-of-art Ratnaparki-style MaxEnt tagger: • Lexical unigrams, bigrams, dictionary features for names, numbers, “trigger phrases” + feature selection • Poor results: • On manually transcribed data, F1 in 80s for both tasks (good!) • On ASR data, F1 about 50% for caller names, 80% for contact numbers even with a very loose performance metric • Best learning method barely beat the baseline rule-based system.

  30. What’s interesting in this paper • How and when to we use ML? • Robust information extraction • Generalizing from manual transcripts (i.e., human-produced written version of voicemail) to automatic (ASR) transcripts • Place of hand-codingvs learning in information extraction • How to break up task • Where and how to use engineering Candidate Generator Candidate phrase Learned filter Extracted phrase

  31. Voicemail corpus • About 10,000 manually transcribed and annotated voice messages. • 1869 used for evaluation • Not quite the usual NER task: we only want the caller’s name

  32. Observation: caller phrases are short and near the beginning of the message.

  33. Caller-phrase extraction • Propose start positions i1,…,iN • Use a learned decision tree to pick the best i • Propose end positions i+j1,i+j2,…,i+jM • Use a learned decision tree to pick the best j

  34. Baseline (HZP, Collins log-linear) • IE as tagging, similar to Borthwick: • Pr(tag i|word i,word i-1,…,word i+1,…,tag i-1,…) estimated via MAXENT model • Beam search to find best tag sequence given word sequence (we’ll talk more about this next week) • Features of model are words, word pairs, word pair+tag trigrams, ….

  35. Performance

  36. Observation: caller names are reallyshort and near the beginningof the message.

  37. What about ASR transcripts?

  38. Extracting phone numbers • Phase 1: hand-coded grammer proposes candidate phone numbers • Not too hard, due to limited vocabulary • Optimize recall (96%) not precision (30%) • Phase 2: a learned decision tree filters candidates • Use length, position, a few context features

  39. Results

  40. Their Conclusions

More Related