1 / 78

Information Extraction

Information Extraction. Day 18. HW#5. The mystery languages: 1 2 3 4 5 6 7 8 9 10. HW#5. The mystery languages: 1 - Afrikaans 2 - Breton 3 - Basque 4 - Catalan 5 - Croatian 6 - Icelandic 7 - Indonesian 8 - Manx 9 - Frisian 10 - Estonian. Roadmap. Information extraction:

nile
Download Presentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction Day 18

  2. HW#5 • The mystery languages: 1 2 3 4 5 6 7 8 9 10

  3. HW#5 • The mystery languages: 1 - Afrikaans 2 - Breton 3 - Basque 4 - Catalan 5 - Croatian 6 - Icelandic 7 - Indonesian 8 - Manx 9 - Frisian 10 - Estonian

  4. Roadmap • Information extraction: • Definition & Motivation • General approach • Case Study of Extracting Structured Language Data

  5. Information Extraction • Motivation: • Unstructured data hard to manage, assess, analyze

  6. Information Extraction • Motivation: • Unstructured data hard to manage, assess, analyze • Many tools for manipulating structured data: • Databases: efficient search, agglomeration • Statistical analysis packages • Decision support systems

  7. Information Extraction • Motivation: • Unstructured data hard to manage, assess, analyze • Many tools for manipulating structured data: • Databases: efficient search, agglomeration • Statistical analysis packages • Decision support systems • Goal: • Given unstructured information in texts • Convert to structured data • E.g., populate DBs

  8. IE Example Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. • Lead airline: • Amount: • Effective Date: • Follower:

  9. IE Example Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. • Lead airline: United Airlines • Amount: $6 • Effective Date: 2006-10-26 • Follower: American Airlines

  10. Other Applications • Shared tasks: • Message Understanding Conferences (MUC)

  11. Other Applications • Shared tasks: • Message Understanding Conferences (MUC) • Joint ventures/mergers:

  12. Other Applications • Shared tasks: • Message Understanding Conferences (MUC) • Joint ventures/mergers: • CompanyA, CompanyB, CompanyC, amount, type • Terrorist incidents

  13. Other Applications • Shared tasks: • Message Understanding Conferences (MUC) • Joint ventures/mergers: • CompanyA, CompanyB, CompanyC, amount, type • Terrorist incidents • Incident type, Actor, Victim, Date, Location • General:

  14. Other Applications • Shared tasks: • Message Understanding Conferences (MUC) • Joint ventures/mergers: • CompanyA, CompanyB, CompanyC, amount, type • Terrorist incidents • Incident type, Actor, Victim, Date, Location • General: • Pricegrabber • Stock market analysis • Apartment finder

  15. Information ExtractionComponents • Incorporates range of techniques, tools

  16. Information ExtractionComponents • Incorporates range of techniques, tools • FSAs (pattern extraction) • Supervised learning: • Classification • Sequence labeling

  17. Information ExtractionComponents • Incorporates range of techniques, tools • FSAs (pattern extraction) • Supervised learning: • Classification • Sequence labeling • Semi-, un-supervised learning: clustering, bootstrapping

  18. Information ExtractionComponents • Incorporates range of techniques, tools • FSAs (pattern extraction) • Supervised learning: • Classification • Sequence labeling • Semi-, un-supervised learning: clustering, bootstrapping • Part-of-Speech tagging • Named Entity Recognition • Chunking • Parsing

  19. Information Extraction Process • Typically pipeline of major components

  20. Information Extraction Process • Typically pipeline of major components • First step: Named Entity Recognition (NER) • Often application-specific • Generic type: Person, Organization, Location • Specific: genes to college courses

  21. Information Extraction Process • Typically pipeline of major components • First step: Named Entity Recognition (NER) • Often application-specific • Generic type: Person, Organization, Location • Specific: genes to college courses • Identify NE mentions: • specific instances

  22. Example: Pre-NER • Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco.

  23. Example: post-NER • Citing high fuel prices, [ORGUnited Airlines]said [TIMEFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [ORGUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday]an applies to most routes where it competes against discount carriers, such as [LOCChicago]to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

  24. Information Extraction:Reference Resolution • Builds on NER • Links entity mentions referring to same things • Creates entity equivalence classes

  25. Information Extraction:Reference Resolution • Builds on NER • Links entity mentions referring to same things • Creates entity equivalence classes • Includes anaphora resolution • E.g. pronoun resolution (it, he, she…; one; this, that)

  26. Information Extraction:Reference Resolution • Builds on NER • Links entity mentions referring to same things • Creates entity equivalence classes • Includes anaphora resolution • E.g. pronoun resolution (it, he, she…; one; this, that) • Also non-anaphoric, general mentions

  27. Example: post-NER • Citing high fuel prices, [ORGUnited Airlines] said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

  28. Example: post-NER • Citing high fuel prices, [ORGUnited Airlines] said [TimeFriday]it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

  29. Relation Detection & Classification • Relation detection & classification: • Find and classify relations among entities in text

  30. Relation Detection & Classification • Relation detection & classification: • Find and classify relations among entities in text • Mix of application-specific and generic relations • Generic relations:

  31. Relation Detection & Classification • Relation detection & classification: • Find and classify relations among entities in text • Mix of application-specific and generic relations • Generic relations: • Family, employment, part-whole, membership, geospatial

  32. Example: post-NER • Citing high fuel prices, [ORGUnited Airlines] said [TimeFriday]it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

  33. Relation Detection & Classification • Relation detection & classification: • Find and classify relations among entities in text • Mix of application-specific and generic relations • Generic relations: • Family, employment, part-whole, membership, geospatial • Generic relations: • United is part-of UAL Corp. • American Airlines is part-of AMR • Tim Wagner is an employee of American Airlines

  34. Relation Detection & Classification • Relation detection & classification: • Find and classify relations among entities in text • Mix of application-specific and generic relations • Generic relations: • Family, employment, part-whole, membership, geospatial • Generic relations: • United is part-of UAL Corp. • American Airlines is part-of AMR • Tim Wagner is an employee of American Airlines • Domain-specific relations: • United serves Chicago; United serves Denver, etc

  35. Example: post-NER • Citing high fuel prices, [ORGUnited Airlines] said [TimeFriday]it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

  36. Event Detection &Classification • Event detection and classification: • Find key events

  37. Event Detection &Classification • Event detection and classification: • Find key events • Fare increase by United • Matching fare increase by American • Reporting of fare increases • How cued?

  38. Event Detection &Classification • Event detection and classification: • Find key events • Fare increase by United • Matching fare increase by American • Reporting of fare increases • How cued? • Instances of “said” and ”cite”

  39. Temporal Expressions:Recognition and Analysis • Temporal expression recognition • Finds temporal events in text

  40. Example: post-NER • Citing high fuel prices, [ORGUnited Airlines] said [TimeFriday]it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

  41. Temporal Expressions:Recognition and Analysis • Temporal expression recognition • Finds temporal events in text • E.g. Friday, Thursday in example text

  42. Temporal Expressions:Recognition and Analysis • Temporal expression recognition • Finds temporal events in text • E.g. Friday, Thursday in example text • Others: • Date expressions: • Absolute: days of week, holidays • Relative: two days from now, next year • Clock times: • noon, 3:30pm

  43. Temporal Expressions:Recognition and Analysis • Temporal expression recognition • Finds temporal events in text • E.g. Friday, Thursday in example text • Others: • Date expressions: • Absolute: days of week, holidays • Relative: two days from now, next year • Clock times: • noon, 3:30pm • Temporal analysis: • Map expressions to points/spans on timeline • Anchor: e.g. Friday, Thursday: tied to example dateline

  44. Template Filling • Texts describe stereotypical situations in domain • Identify texts evoking template situations • Fill slots in templates based on text

  45. Template Filling • Texts describe stereotypical situations in domain • Identify texts evoking template situations • Fill slots in templates based on text • In example: fare-raise is stereo-typical event

  46. Information Extraction Steps • Named Entity Recognition • Reference Resolution • Relation Detection & Classification • Event Detection & Classification • Temporal Expression Recognition & Analysis • Template-filling

  47. Case Study: Extracting Semi-Structured Language Data (“Template Filling”) IE and language data

  48. Linguistic Document • Page from scholarly linguistic paper • Note the embedded annotated language data • Identify characteristics • Of the data wrt the rest of the document • Of the data itself

  49. Linguistic Document • Plan: Extract data of this type form Linguistic Docs • Find linguistic docs (crawl) • Identify docs that contain this kind of data • Identify the language for each instance • Align instances with • Papers/Titles • Authors • Languages • Extract the data • Preserve and align annotations • Enrich the data with further annotations

  50. Interlinear Glossed Text • Interlinear Glossed Text (IGT) - enriched language data used for illustrative purposes as part of a larger analysis Transcription Line Gloss Line Translation Line

More Related