information extraction n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Extraction PowerPoint Presentation
Download Presentation
Information Extraction

Loading in 2 Seconds...

play fullscreen
1 / 30

Information Extraction - PowerPoint PPT Presentation


  • 177 Views
  • Uploaded on

Information Extraction. Sunita Sarawagi IIT Bombay http://www.it.iitb.ac.in/~sunita. Information Extraction (IE) & Integration. The Extraction task: Given, E: a set of structured elements S: unstructured source S extract all instances of E from S.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction' - abie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information extraction

Information Extraction

Sunita Sarawagi

IIT Bombay

http://www.it.iitb.ac.in/~sunita

information extraction ie integration
Information Extraction (IE) & Integration

The Extraction task: Given,

  • E: a set of structured elements
  • S: unstructured source S

extract all instances of E from S

  • Many versions involving many source types
  • Actively researched in varied communities
  • Several tools and techniques
  • Several commercial applications
ie from free format text
IE from free format text
  • Classical Named Entity Recognition
    • Extract person, location, organization names

According to Robert Callahan, president of Eastern's flight attendants union, the past practice of Eastern's parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrier's terms

  • Several applications
    • News tracking
      • Monitor events
    • Bio-informatics
      • Protein and Gene names from publications
    • Customer care
      • Part number, problem description from emails in help centers
problem definition

Title

Journal

Year

Author

Volume

Page

Problem definition

Source: concatenation of structured elements with limited reordering and some missing fields

  • Example: Addresses, bib records

House number

Zip

City

Building

Road

Area

156 Hillside ctype Scenic drive Powai Mumbai 400076

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

relation extraction disease outbreaks
Relation Extraction: Disease Outbreaks
  • Extract structured relations from text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Disease Outbreaks in The New York Times

Information Extraction System (e.g., NYU’s Proteus)

personal information systems
Personal Information Systems
  • Automatically add a bibtex entry of a paper I download
  • Integrate a resume in email with the candidates database

Papers

Files

People

Email

Emails

Web

Projects

Resumes

hand coded methods

ContactPattern  RegularExpression(Email.body,”can be reached at”)

PersonPhone  Precedes(Person

Precedes(ContactPattern, Phone, D),

D)

Hand-Coded Methods
  • Easy to construct in many cases
    • e.g., to recognize prices, phone numbers, zip codes, conference names, etc.
  • Easier to debug & maintain
    • Especially if written in a “high-level” language (as is usually the case): e.g.,
  • Easier to incorporate / reuse domain knowledge
  • Can be quite labor intensive to write

[From Avatar]

example of hand coded entity tagger ramakrishnan g 2005 slides from doan et al sigmod 2006
Example of Hand-Coded Entity Tagger [Ramakrishnan. G, 2005, Slides from Doan et al., SIGMOD 2006]

Rule 1 This rule will find person names with a salutation (e.g. Dr. Laura Haas) and two capitalized words

<token> INITIAL</token>

<token>DOT </token>

<token>CAPSWORD</token>

<token>CAPSWORD</token>

Rule 2 This rule will find person names where two capitalized words are present in a Person dictionary

<token>PERSONDICT, CAPSWORD </token>

<token>PERSONDICT,CAPSWORD</token>

CAPSWORD : Word starting with uppercase, second letter lowercase

E.g., DeWitt will satisfy it (DEWITT will not)

\p{Upper}\p{Lower}[\p{Alpha}]{1,25}

DOT : The character ‘.’

hand coded rule example conference name
Hand Coded Rule Example: Conference Name

# These are subordinate patterns$wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)";my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";my $ordinals="(?:$wordOrdinals|$numberOrdinals)";my $confTypes="(?:Conference|Workshop|Symposium)";my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spacesmy $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference name for workshops (e.g. "VLDB Workshop ...")my $connectors="(?:on|of)";my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # Conference abbreviations like "(SIGMOD'06)"# The actual pattern we search for.  A typical conference name this pattern will find is# "3rd International Conference on Blah Blah Blah (ICBBB-05)"my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|<)";############################## ################################# Given a <dbworldMessage>, look for the conference pattern##############################################################lookForPattern($dbworldMessage, $fullNamePattern);########################################################## In a given <file>, look for occurrences of <pattern># <pattern> is a regular expression#########################################################sub lookForPattern {    my ($file,$pattern) = @_;

some hand coded entity taggers
Some Hand Coded Entity Taggers
  • FRUMP [DeJong 82]
  • CIRCUS / AutoSlog [Riloff 93]
  • SRI FASTUS [Appelt, 1996]
  • MITRE Alembic (available for use)
  • Alias-I LingPipe (available for use)
  • OSMX [Embley, 2005]
  • DBLife [Doan et al, 2006]
  • Avatar [Jayram et al, 2006]
learning models for extraction
Learning models for extraction
  • Rule-based extractors
    • For each label, build two classifiers for accepting its two boundaries.
    • Each classifier: sequence of rules
      • Each rule: conjunction of predicates
        • E.g: If previous token a last-name, current token “.”, next token an article start of title.
    • Examples: Rapier, GATE, LP2 & several more
  • Critique of rule-based approaches
    • Cannot output meaningful uncertainty values
    • Brittle
    • Limited flexibility in clues that can be exploited
    • Not too good about combining several weak clues.
    • (Pros) Somewhat easier to tune.
statistical models of ie
Statistical models of IE
  • Generative models like HMM
    • Intuitive
    • Very restricted feature setslower accuracy
    • Output probabilities are highly skewed (counterpart, naïve Bayes)
  • Conditional discriminative models
    • Local models: Maximum entropy models
    • Global models: Conditional Random Fields.

Conditional models

  • output meaningful probabilities,
  • flexible, generalize,
  • getting increasingly popular
  • State-of-the-art!
ie with hidden markov models

Y

A

C

X

B

Z

A

B

C

0.1

0.1

0.8

0.4

0.2

0.4

0.6

0.3

0.1

Emission probabilities

Transition probabilities

0.5

0.9

0.5

0.1

0.8

0.2

dddd

dd

0.8

0.2

IE with Hidden Markov Models
  • Probabilistic models for IE

Title

Author

Journal

Year

hmm structure
HMM Structure
  • Naïve Model: One state per element
  • Nested model
  • Each element another HMM
hmm dictionary
HMM Dictionary
  • For each word (=feature), associate the probability of emitting that word
      • Multinomial model
  • More advanced models with overlapping features of a word,
    • example,
      • part of speech,
      • capitalized or not
      • type: number, letter, word etc
    • Maximum entropy models (McCallum 2000)
learning model parameters
Learning model parameters
  • When training data defines unique path through HMM
    • Transition probabilities
      • Probability of transitioning from state i to state j =

number of transitions from i to j

total transitions from state i

    • Emission probabilities
      • Probability of emitting symbol k from state i =

number of times k generated from i

number of transition from I

  • When training data defines multiple path:
    • A more general EM like algorithm (Baum-Welch)
using the hmm to segment
Using the HMM to segment
  • Find highest probability path through the HMM.
  • Viterbi: quadratic dynamic programming algorithm

115 Grant street Mumbai 400070

115 Grant ……….. 400070

House

House

House

House

Road

Road

Road

Road

City

City

City

ot

ot

Pin

Pin

Pin

Pin

comparative evaluation
Comparative Evaluation
  • Naïve model – One state per element in the HMM
  • Independent HMM – One HMM per element;
  • Rule Learning Method – Rapier
  • Nested Model – Each state in the Naïve model replaced by a HMM
results comparative evaluation
Results: Comparative Evaluation

The Nested model does best in all three cases

(from Borkar 2001)

hmm approach summary
HMM approach: summary

Inter-element sequencing

Intra-element sequencing

Element length

Characteristic words

Non-overlapping tags

  • Outer HMM transitions
  • Inner HMM
  • Multi-state Inner HMM
  • Dictionary
  • Global optimization
statistical models of ie1
Statistical models of IE
  • Generative models like HMM
    • Intuitive
    • Very restricted feature setslower accuracy
    • Output probabilities are highly skewed (counterpart, naïve Bayes)
  • Conditional discriminative models
    • Local models: Maximum entropy models
    • Global models: Conditional Random Fields.

Conditional models

  • output meaningful probabilities,
  • flexible, generalize,
  • getting increasingly popular
  • State-of-the-art!
basic chain model for extraction

t

x

y

Basic chain model for extraction

My review of Fermat’s last theorem by S. Singh

y1

y2

y3

y4

y5

y6

y7

y8

y9

Independent model

features
Features
  • The word as-is
  • Orthographic word properties
    • Capitalized? Digit? Ends-with-dot?
  • Part of speech
    • Noun?
  • Match in a dictionary
    • Appears in a dictionary of people names?
    • Appears in a list of stop-words?
  • Fire these for each label and
    • The token,
    • W tokens to the left or right, or
    • Concatenation of tokens.
basic chain model for extraction1

t

x

y

Basic chain model for extraction

My review of Fermat’s last theorem by S. Singh

y1

y2

y3

y4

y5

y6

y7

y8

y9

Global conditional model over Pr(y1,y2…y9|x)

features1
Features
  • Feature vector for each position
  • Examples
  • Parameters: weight for each feature (vector)

User provided

previous label

i-th label

Word i & neighbors

Machine learnt

transforming real world extraction
Transforming real-world extraction
  • Partition label into different parts?
  • Independent extraction per label?

Unique

Other

Begin

Continue

End

typical numbers
Typical numbers
  • Seminars announcements (CMU):
    • speaker, location, timings
    • SVMs for start-end boundaries
    • 250 training examples
    • F1: 85% speaker, location, 92% timings (Finn & Kushmerick ’04)
  • Jobs postings in news groups
    • 17 fields: title, location, company,language, etc
    • 150 training examples
    • F1: 84% overall (LP2) (Lavelli et al 04)
publications
Publications
  • Cora dataset
    • Paper headers: Extract title,author affiliation, address,email,abstract
      • 94% F1 with CRFs
      • 76% F1 with HMMs
    • Paper citations: Extract title,author,date, editor,booktitle,pages,institution
      • 91% F1 with CRFs
      • 78% F1 with HMMs

Peng & McCallum 2004