1 / 31

Introduction to Information Extraction

Introduction to Information Extraction. Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw. Problem Definition.

aarondavis
Download Presentation

Introduction to Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw

  2. Problem Definition • Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form. • The output template of the IE task • Several fields (slots) • Several instances of a field

  3. Difficulties of IE tasks depends on … • Text type • From Wall Street Journal articles, or email message, to HTML documents. • Domain • From financial news, or tourist information, to various language. • Scenario

  4. Various IE Tasks • Free-text IE: • For MUC (Message Understanding Conference) • E.g. terrorist activities, corporate joint ventures • Semi-structured IE: • E.g.: meta-search engines, shopping agents, Bio-integration system

  5. Types of IE from MUC • Named Entity recognition (NE) • Finds and classifies names, places, etc. • Coreference Resolution (CO) • Identifies identity relations between entities in texts. • Template Element construction (TE) • Adds descriptive information to NE results. • Scenario Template production (ST) • Fits TE results into specified event scenarios.

  6. http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_3.html Name Entity Recognition

  7. Spanish: 93% Japanese: 92% Chinese: 84.51% NE Recognition (Cont.)

  8. Coreference Resolution • Coreference resolution (CO) involves identifying identity relations between entities in texts. • For example, in Alas, poor Yorick, I knew him well. • Tie “Yorick" with “him“. • The Sheffield system scored 51% recall and 71% precision. • http://www.cs.nyu.edu/cs/faculty/grishman/COtask21.book_4.html

  9. Adds description with named entities Sheffield system scores 71% Template Element Production

  10. STs are the prototypical outputs of IE systems They tie together TE entities into event and relation descriptions. Performance for Sheffield: 49% Scenario Template Extraction http://www.cs.nyu.edu/cs/ faculty/grishman/ IEtask15.book_2.html

  11. Example • The operational domains that user interests are centered around are drug enforcement, money laundering, organized crime, terrorism, …. 1. Input: texts dealing with drug enforcement, money laundering, organized crime, terrorism, and legislation; 2. NE: recognizes entities in those texts and assigns them to one of a number of categories drawn from the set of entities of interest (person, company, . . . ); 3. TE: associates certain types of descriptive information with these entities, e.g. the location of companies; 4. ST: identifies a set (relatively small to begin with) of events of interest by tying entities together into event relations.

  12. Example Text

  13. Output Example (NE, TE)

  14. Output (STs)

  15. Another IE Example • Corporate Management Changes • Purpose • which positions in which organizations are changing hands? • who is leaving a position and where the person is going to? • who is appointed to a position and where the person is coming from? • the locations and types of the organizations involved in the succession events; • the names and titles of the persons involved in the succession events • http://www.cs.umanitoba.ca/~lindek/ie-ex.htm

  16. Input Text President Clinton nominated John Rollwagen, the chairman and CEO of Cray Research Inc., as the No. 2 Commerce Department official. Mr. Rollwagen said he wants to push the Clinton administration to aggressively confront U.S. trading partners such as Japan to open their markets, particularly for high-tech industries. In a letter sent throughout the Eagan, Minn.-based company on Friday, Mr. Rollwagen warned: "Whether we like it or not, our country is in an economic war; and we are at a key turning point in that war." ...... Cray said it has appointed John F. Carlson, its president and chief operating officer, to succeed him. ......

  17. Corporate Management Database Person Organization Position Transition John Rollwagen Cray Research Inc. chairman out John Rollwagen Cray Research Inc. CEO out John F. Carlson Cray Research Inc. chairman in John F. Carlson Cray Research Inc. CEO in Organization Database Name Location Alias Type Cray Research Inc. Eagan, Minn. Cray COMPANY Commerce Department GOVERNMENT Extraction Result

  18. MUC • Data Set for • MET2http://www.itl.nist.gov/iaui/894.02/related_projects/muc/met2/met2package.tar.gz • MUC3&4http://www.itl.nist.gov/iaui/894.02/related_projects/muc/muc_data/muc34.tar.gz • MUC6&7 from LDChttp://www.ldc.upenn.edu/ • MUC-6: http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html • MUC-7 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ proceedings/muc_7_toc.html

  19. Evaluation Precision= Recall= Design Methodology Natural Language Processing Machine Learning Summary # of correctly extracted fields # of extracted fields # of correctly extracted fields # of fields to be extracted

  20. IE from Semi-structured Documents • Output Template: k-tuple • Multiple instances of a field • Missing data

  21. Various IE Tasks for Semi-structured Documents • Multiple-record page extraction • One-record (singular) page extraction

  22. Multiple-record page extraction

  23. One-record (singular) page extraction

  24. Summary • Evaluation • Precision= • Recall= • Design Methodology • Machine Learning • Pattern Mining # of correctly extracted records # of extracted records # of correctly extracted records # of records to be extracted

  25. News Group IE • Example: Computer-Related Jobs

  26. Output Template • Between free-text IE and semi-structured IE • [CaliffRapier 99]

  27. Annotated Training Examples • Most systems require annotated training examples (answer keys) • AutoSlog, Rapier, SRV, WIEN, Softmealy, Stalker • Very few systems require unannotated training examples • AutoSlog-TS, IEPAD, OLERA

  28. The Type of Extraction Rule • Delimiter-based Rule • WIEN, Stalker • Content-based Rule • Context-based Rule • Rapier, AutoSlog, SRV, IEPAD

  29. Background Knowledge • For Rule Generalization • Implicit or Explicit • Example • Specified format for date, email, etc. • Special feature for color, location, etc.

  30. Conclusion • Define the IE problem • Specify the input: training example • with annotation, or • without annotation • Depict the extraction rule • Use necessary background knowledge

  31. References • *H. Cunningham, Information Extraction – a User Guide, http://www.dcs.shef.ac.uk • *MUC-6, http://www.cs.nyu.edu/cs/faculty/ grishman/muc6.html • *I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction. • Califf, Relational Learning of Pattern-Matching Rule for Information Extraction, AAAI-99.

More Related