introduction to information extraction n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Information Extraction PowerPoint Presentation
Download Presentation
Introduction to Information Extraction

Loading in 2 Seconds...

play fullscreen
1 / 35

Introduction to Information Extraction - PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on

Introduction to Information Extraction. Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw. Problem Definition.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Information Extraction' - hyman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to information extraction

Introduction to Information Extraction

Chia-Hui Chang

Dept. of Computer Science and Information Engineering, National Central University, Taiwan

chia@csie.ncu.edu.tw

problem definition
Problem Definition
  • Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form.

Input  extractor structured output

  • The output template of the IE task
    • Several fields (slots)
    • Several instances of a field
difficulties of ie tasks depends on
Difficulties of IE tasks depends on …
  • Text type
    • From plain text to semi-structured Web pages
    • e.g. Wall Street Journal articles, or email message, HTML documents.
  • Domain
    • From financial news, or tourist information, to various language.
  • Scenario
various ie tasks
Various IE Tasks
  • Free-text IE:
    • For MUC (Message Understanding Conference)
    • E.g. terrorist activities, corporate joint ventures
  • Semi-structured IE:
    • E.g.: meta-search engines, shopping agents, Bio-integration system
types of ie from muc
Types of IE from MUC
  • Named Entity recognition (NE)
    • Finds and classifies names, places, etc.
  • Coreference Resolution (CO)
    • Identifies identity relations between entities in texts.
  • Template Element construction (TE)
    • Adds descriptive information to NE results.
  • Scenario Template production (ST)
    • Fits TE results into specified event scenarios.
ne recognition cont
Spanish: 93%

Japanese: 92%

Chinese: 84.51%

NE Recognition (Cont.)
coreference resolution
Coreference Resolution
  • Coreference resolution (CO) involves identifying identity relations between entities in texts.
  • For example, in

Alas, poor Yorick, I knew him well.

  • Tie “Yorick" with “him“.
  • The Sheffield system scored 51% recall and 71% precision.

http://www.cs.nyu.edu/cs/faculty/grishman/COtask21.book_4.html

scenario template extraction
STs are the prototypical outputs of IE systems

They tie together TE entities into event and relation descriptions.

Performance for Sheffield: 49%

Scenario Template Extraction

http://www.cs.nyu.edu/cs/ faculty/grishman/ IEtask15.book_2.html

example
Example
  • The operational domains that user interests are centered around are drug enforcement, money laundering, organized crime, terrorism, ….

1. Input: texts dealing with drug enforcement, money laundering, organized crime, terrorism, and legislation;

2. NE: recognizes entities in those texts and assigns them to one of a number of categories drawn from the set of entities of interest (person, company, . . . );

3. TE: associates certain types of descriptive information with these entities, e.g. the location of companies;

4. ST: identifies a set (relatively small to begin with) of events of interest by tying entities together into event relations.

another ie example
Another IE Example
  • Corporate Management Changes
  • Purpose
    • which positions in which organizations are changing hands?
    • who is leaving a position and where the person is going to?
    • who is appointed to a position and where the person is coming from?
    • the locations and types of the organizations involved in the succession events;
    • the names and titles of the persons involved in the succession events
  • http://www.cs.umanitoba.ca/~lindek/ie-ex.htm
input text
Input Text

President Clinton nominated John Rollwagen, the chairman and CEO of Cray Research Inc., as the No. 2 Commerce Department official. Mr. Rollwagen said he wants to push the Clinton administration to aggressively confront U.S. trading partners such as Japan to open their markets, particularly for high-tech industries. In a letter sent throughout the Eagan, Minn.-based company on Friday, Mr. Rollwagen warned: "Whether we like it or not, our country is in an economic war; and we are at a key turning point in that war." ......

Cray said it has appointed John F. Carlson, its president and chief operating officer, to succeed him. ......

extraction result

Corporate Management Database

Person

Organization

Position

Transition

John Rollwagen

Cray Research Inc.

chairman

out

John Rollwagen

Cray Research Inc.

CEO

out

John F. Carlson

Cray Research Inc.

chairman

in

John F. Carlson

Cray Research Inc.

CEO

in

Organization Database

Name

Location

Alias

Type

Cray Research Inc.

Eagan, Minn.

Cray

COMPANY

Commerce Department

GOVERNMENT

Extraction Result
slide18
MUC
  • Data Set for
    • MET2http://www.itl.nist.gov/iaui/894.02/related_projects/muc/met2/met2package.tar.gz
    • MUC3&4http://www.itl.nist.gov/iaui/894.02/related_projects/muc/muc_data/muc34.tar.gz
    • MUC6&7 from LDChttp://www.ldc.upenn.edu/
  • MUC-6: http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html
  • MUC-7

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ proceedings/muc_7_toc.html

summary
Evaluation

Precision=

Recall=

Design Methodology for Text IE

Natural Language Processing

Machine Learning

Summary

# of correctly extracted fields

# of extracted fields

# of correctly extracted fields

# of fields to be extracted

ie from web pages
IE from Web pages
  • Output Template: k-tuple
    • Multiple instances of a field
    • Missing data
web data extraction
Web data extraction
  • Various Web pages
    • Multiple-record page extraction
    • One-record (singular) page extraction
applications
Applications
  • Information integration
    • Meta Search Engines
    • Shopping agents
    • Travel agents
information integration systems

Human & Computer Users

  • User Services:
    • Query
    • Monitor
    • Update

Information

Integration

Service

Mediator

Mediator

Mediator

Wrapper

Wrapper

SQL

ORB

Text,

Images/Video,

Spreadsheets

Hierarchical

& Network

Databases

Object &

Knowledge

Bases

Relational

Databases

Heterogeneous Data Sources

Information Integration Systems

Abstracted

Information

Agent/Module Coordination

Mediation

Semantic Integration

Translation and Wrapping

Unprocessed,

Unintegrated

Details

web wrappers
Web Wrappers
  • What is a wrapper?
    • An extracting program to extract desired information from Web pages.

Web pages → wrapper→ Structure Info.

  • Web wrappers wrap...
    • “Query-able’’ or “Search-able’’ Web sites
    • Web pages with large itemized lists
summary1
Summary
  • Evaluation
    • Precision=
    • Recall=
  • Methodology for Web IE
    • Programming package
    • Machine Learning
    • Pattern Mining

# of correctly extracted records

# of extracted records

# of correctly extracted records

# of records to be extracted

type iii news group ie
Type III: News Group IE
  • Example: Computer-Related Jobs
output template
Output Template
  • Between free-text IE and semi-structured IE
  • [CaliffRapier 99]
wrapper induction systems
Wrapper Induction Systems
  • Wrapper induction (WI) or information extraction (IE) systems are software that are designed to generate wrappers.
  • Taxonomy of Web IE systems by
    • Task domain
      • free text vs semi-structured pages
    • Automation degree
      • supervised vs unsupervised
    • Techniques applied
      • Machine learning vs pattern mining
task domain
Task Domain
  • Document type
  • Extraction level
    • Field-level, record-level, page-level
  • Extraction target variation
    • Missing Attributes
    • Multi-valued Attributes
    • Multi-order attribute Permutations
    • Nested Data Objects
  • Template variation
    • Various Templates for an attribute
    • Common Templates for various attributes
  • Untokenized Attributes
automation degree
Automation Degree
  • Page-fetching Support
  • Annotation Requirement
  • Output Support
  • API Support
techniques applied
Techniques Applied
  • Scan passes
  • Extraction rule types
  • Learning algorithms
  • Tokenization schemes
  • Feature used
conclusion
Conclusion
  • Define the IE problem
  • Specify the input: training example
    • with annotation, or
    • without annotation
  • Depict the extraction rule
    • Use necessary background knowledge
references
References
  • *H. Cunningham, Information Extraction – a User Guide, http://www.dcs.shef.ac.uk
  • *MUC-6, http://www.cs.nyu.edu/cs/faculty/ grishman/muc6.html
  • *I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.
  • Califf, Relational Learning of Pattern-Matching Rule for Information Extraction, AAAI-99.