relation extraction n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Relation Extraction PowerPoint Presentation
Download Presentation
Relation Extraction

Loading in 2 Seconds...

play fullscreen
1 / 107

Relation Extraction - PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on

Relation Extraction. Slides from Dan Jurafsky , Rion Snow, Jim Martin, Chris Manning and William Cohen. Background: Information Extraction. Extract information from text Sometimes called text analytics commercially

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Relation Extraction' - kelli


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
relation extraction

Relation Extraction

Slides from Dan Jurafsky, RionSnow, Jim Martin, Chris Manning and William Cohen

background information extraction
Background: Information Extraction
  • Extract information from text
    • Sometimes called text analyticscommercially
    • Extract entities (the people, organizations, locations, times, dates, genes, diseases, medicines, etc. in a text)
    • Extract the relations between entities
    • Figure out the larger events that are taking place
information extraction
Information Extraction
  • Creating knowledge bases and ontologies
  • Implications for cognitive modeling
  • Digital Libaries
    • Google scholar, Citeseer need to extract the title, author and references
  • Bioinformatics
  • Patent analysis
  • Specific market segments for stock analysis
  • SEC filings
  • Intelligence analysis
paradigms data mining vs text mining
Paradigms: Data Mining vs Text Mining
  • Data Mining is a set of techniques that aims to extract knowledge by big data analysis, e.g. from databases, using simple patterns and statistical methods
  • In contrast, the Text Mining's goal (or Machine Reading),
  • is to extract relationship between entities that just existing in examinatedtext, semantically deeper than data mining does
  • The difference is in the source information:
    • relationship in text just exist in our source,
    • we don't need to try to guess it but only discover it!
outline
Outline
  • Reminder: Named Entity Tagging
  • Relation Extraction
    • Hand-built patterns
    • Seed (bootstrap) methods
    • Supervised classification
    • Distant supervision
what is information extraction
Slide from William CohenWhat is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

what is information extraction1
Slide from William CohenWhat is “Information Extraction”

As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

IE

NAME TITLE ORGANIZATION

Bill GatesCEOMicrosoft

Bill VeghteVPMicrosoft

Richard StallmanfounderFree Soft..

what is information extraction2
Slide from William CohenWhat is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

“named entity extraction”

what is information extraction3
Slide from William CohenWhat is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

what is information extraction4
Slide from William CohenWhat is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

what is information extraction5
Slide from William Cohen

TITLE ORGANIZATION

NAME

Bill Gates

CEO

Microsoft

Bill

Veghte

VP

Microsoft

Free Soft..

Stallman

founder

Richard

What is “Information Extraction”

As a familyof techniques:

Information Extraction =

segmentation + classification+ association+ clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman

founder

Free Software Foundation

*

*

*

*

extracting structured knowledge
Extracting Structured Knowledge

Each article can contain hundreds or thousands of items of knowledge...

“The Lawrence Livermore National Laboratory (LLNL) in Livermore, California is a scientific research laboratory founded by the University of California in 1952.”

LLNL EQ Lawrence Livermore National Laboratory

LLNL LOC-IN California

Livermore LOC-IN California

LLNL IS-A scientific research laboratory

LLNL FOUNDED-BY University of California

LLNL FOUNDED-IN 1952

goal machine readable summaries
Goal: Machine-readable summaries

Structured knowledge extraction: Summary for machine

Textual abstract:

Summary for human

slide14

From Unstructured Text to Structured Knowledge

Unstructured Text

News articles...

slide from Rion Snow

slide15

From Unstructured Text to Structured Knowledge

Unstructured Text

Blog posts....

slide from Rion Snow

slide16

From Unstructured Text to Structured Knowledge

Unstructured Text

Scientific journal articles...

slide from Rion Snow

slide17

From Unstructured Text to Structured Knowledge

Unstructured Text

Tweets, instant messages, chat logs...

slide from Rion Snow

slide18

From Unstructured Text to Structured Knowledge

Unstructured Text

slide from Rion Snow

slide19

From Unstructured Text to Structured Knowledge

Structured Knowledge

Unstructured Text

slide from Rion Snow

slide20

From Unstructured Text to Structured Knowledge

Structured Knowledge

Unstructured Text

slide from Rion Snow

slide21

From Unstructured Text to Structured Knowledge

Structured Knowledge

Unstructured Text

slide from Rion Snow

slide22

From Unstructured Text to Structured Knowledge

Structured Knowledge

Unstructured Text

slide from Rion Snow

slide23

From Unstructured Text to Structured Knowledge

Structured Knowledge

Unstructured Text

slide from Rion Snow

slide24

From Unstructured Text to Structured Knowledge

Structured Knowledge

Unstructured Text

slide from Rion Snow

relation extraction1
Relation Extraction
  • What is relation extraction?
  • Founded in 1801 as South Carolina College, USC is the flagship institution of the University of South Carolina System and offers more than 350 programs of study leading to bachelor's, master's, and doctoral degrees from fourteen degree-granting colleges and schools to an enrollment of approximately 45,251 students, 30,967 on the main Columbia campus. … [wiki]
  • complex relation = summarization
  • focus on binary relation predicate(subject, object) or triples <subj predicate obj>
wiki info box structured data
Wiki Info Box – structured data
  • template
  • standard things about Universities
  • Established
  • type
  • faculty
  • students
  • location
  • mascot
focus on extracting binary relations
Focus on extracting binary relations
  • predicate(subject, object) from predicate logic
  • triples <subj relation object>
  • Directed graphs
why relation extraction
Why relation extraction?
  • create new structured KB
  • Augmenting existing: words -> wordnet, facts -> FreeBase or DBPedia
  • Support question answering: Jeopardy
  • Which relations
  • Automated Content Extraction (ACE) http://www.itl.nist.gov/iad/mig//tests/ace/
  • 17 relations
  • ACE examples
unified medical language system umls
Unified Medical Language System (UMLS)
  • UMLS: Unified Medical 134 entities, 54 relations

http://www.nlm.nih.gov/research/umls/

current relations in the umls semantic network
Current Relations in the UMLS Semantic Network
  • isaassociated_withphysically_related_topart_ofconsists_of            contains connected_to            interconnects branch_oftributary_ofingredient_ofspatially_related_tolocation_ofadjacent_to            surrounds             traverses functionally_related_to            affects                  …
  • temporally_related_to             co-occurs_with             precedes
  • conceptually_related_to
  • evaluation_ofdegree_of             analyzes assesses_effect_ofmeasurement_of             measures              diagnoses property_ofderivative_ofdevelopmental_form_ofmethod_of             …
databases of wikipedia relations
Databases of Wikipedia Relations
  • DBpedia is a crowd-sourced community effort
  • to extract structured information from Wikipedia
  • and to make this information readily available
  • DBpedia allows you to make sophisticated queries

http://dbpedia.org/About

english version of the dbpedia knowledge base
English version of the DBpedia knowledge base
  • 3.77 million things
  • 2.35 million are classified in an ontology
  • including:
    • including 764,000 persons,
    • 573,000 places (including 387,000 populated places),
    • 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games),
    • 192,000 organizations (including 45,000 companies and 42,000 educational institutions),
    • 202,000 species and 
    • 5,500 diseases.
freebase
freebase
  • google (freebase wiki) http://wiki.freebase.com/wiki/Main_Page
ontological relations
Ontological relations
  • Ontological relations
  • IS-A hypernym
  • Instance-of
  • has-Part
  • hyponym (opposite of hypernym)
reminder task 1 named entity tagging
Slide from Chris ManningReminder: Task 1: Named Entity Tagging
  • General NER or Biomedical NER

<PER> John Hennessy</PER> is a professor at <ORG> Stanford University </ORG>, in <LOC> Palo Alto </LOC>.

<RNA> TAR </RNA> independent transactivation by <PROTEIN> Tat </PROTEIN>in cells derived from the <CELL> CNS </CELL>- a novel mechanism of <DNA> HIV-1 gene </DNA>regulation.

relations between words
Relations between words
  • Language Understanding Applications needs word meaning!
    • Question answering
    • Conversational agents
    • Summarization
  • One key meaning component: word relations
    • Hierarchical (ontological) relations
      • “San Francisco” ISA “city”
    • Other relations between words
      • “alternator” is a part of a “car”
relation prediction
Relation Prediction

“...works by such authors as Herrick, Goldsmith, and Shakespeare.”

“If you consider authors like Shakespeare...”

“Some authors (including Shakespeare)...”

“Shakespeare was the author of several...”

“Shakespeare, author of The Tempest...”

ShakespeareIS-A author(0.87)

How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?

hyponymy
Hyponymy
  • One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other
    • car is a hyponym of vehicle
    • dog is a hyponym of animal
    • mango is a hyponym of fruit
  • Conversely
    • vehicle is a hypernym/superordinate of car
    • animal is a hypernym of dog
    • fruit is a hypernym of mango
slide42

WordNet relations

X is-a-kind-of Y

(hyponym / hypernym)

X is-a-part-of Y

(meronym / holonym)

wordnet is incomplete
WordNet is incomplete
  • Ontological relations are missing for many words
  • Especially true for specific domains (restaurants, auto parts, finance)
other kinds of relations disease outbreaks
Slide from Eugene AgichteinOther kinds of Relations: Disease Outbreaks
  • Extract structured information from text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Disease Outbreaks in The New York Times

Information Extraction System (e.g., NYU’s Proteus)

more relations protein interactions

interact

complex

CBF-A CBF-C

CBF-B CBF-A-CBF-C complex

associates

More relations: Protein Interactions

„We show that CBF-A and CBF-C interact

with each other to form a CBF-A-CBF-C complex

and that CBF-B does not interact with CBF-A or

CBF-C individually but that it associates with the

CBF-A-CBF-C complex.“

Slide from Rosario and Hearst

yet more relations
Slide from Jim MartinYet More Relations

CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

relation types
Slide from Jim MartinRelation Types
  • For generic news texts...
more relations umls
More relations: UMLS
  • Unified Medical Language System
    • integrates linguistic, terminological and semantic information
    • Semantic Network consists of 134 semantic types and 54 relations between types

Pharmacologic Substance affects Pathologic Function

Pharmacologic Substance causes Pathologic Function

Pharmacologic Substance complicates Pathologic Function

Pharmacologic Substance diagnoses Pathologic Function

Pharmacologic Substance prevents Pathologic Function

Pharmacologic Substance treats Pathologic Function

Slide from Paul Buitelaar

relations in ontologies go gene ontology
Relations in Ontologies: GO (Gene Ontology)
  • GO (Gene Ontology)
    • Aligns descriptions of gene products in different databases, including plant, animal and microbial genomes
    • Organizing principles are molecular function, biological process and cellular component

Accession: GO:0009292

Ontology: biological process

Synonyms: broad: genetic exchange

Definition: In the absence of a sexual life cycle, the processes involved in the introduction of genetic information to create a genetically different individual.

Term Lineage all : all (164142)

GO:0008150 : biological process (115947)

GO:0007275 : development (11892)

GO:0009292 : genetic transfer (69)

Slide from Paul Buitelaar

relations in ontologies geographical

F-Logic

Ontology

similar

Relations in Ontologies: geographical

Geographical Entity (GE)

is-a

flow_through

Inhabited GE

Natural GE

capital_of

city

mountain

river

country

instance_of

located_in

Zugspitze

Neckar

Germany

capital_of

height (m)

length (km)

flow_through

located_in

Stuttgart

Berlin

2962

367

flow_through

Design: Philipp Cimiano

Slide from Paul Buitelaar

mesh medical subject headings thesaurus
MeSH (Medical Subject Headings) Thesaurus

Definition

MeSH Descriptor

Synonym set

Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and Il-Yeol Song

mesh tree
MeSH Tree

MeSH Tree

  • MeSH Ontology
    • Hierarchically arranged from most general to most specific.
    • Actually a graph rather than a tree
      • normally appear in more than one place in the tree

Slide from Yoo, Hu, Song

types of ace relations 2003
Slide from Doug AppeltTypes of ACE Relations, 2003
  • ROLE - relates a person to an organization or a geopolitical entity
    • Subtypes: member, owner, affiliate, client, citizen
  • PART - generalized containment
    • Subtypes: subsidiary, physical part-of, set membership
  • AT - permanent and transient locations
    • Subtypes: located, based-in, residence
  • SOCIAL- social relations among persons
    • Subtypes: parent, sibling, spouse, grandparent, associate
predicting the is a relation
Predicting the “is-a” relation

“...works by such authors as Herrick, Goldsmith, and Shakespeare.”

“If you consider authors like Shakespeare...”

“Some authors (including Shakespeare)...”

“Shakespeare was the author of several...”

“Shakespeare, author of The Tempest...”

ShakespeareIS-A author(0.87)

How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?

why this is hard ambiguity

Treatment

Disease

Why this is hard: Ambiguity!

Which relations hold between 2 entities?

Cure?

Prevent?

Side Effect?

different relations between disease hepatitis and treatment
Slide from B. Rosario and M. HearstDifferent relations between Disease (Hepatitis) and Treatment
  • Cure
    • These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135.
  • Prevent
    • A two-dose combined hepatitis A and B vaccine would facilitate immunization programs
  • Vague
    • Effect of interferon on hepatitis B
5 easy methods for relation extraction
5 easy methods for relation extraction
  • Hand-built patterns
  • Bootstrapping (seed) methods
  • Supervised methods
  • Unsupervised methods
  • Distant supervision
5 easy methods for relation extraction1
5 easy methods for relation extraction
  • Hand-built patterns
  • Supervised methods
  • Bootstrapping (seed) methods
  • Unsupervised methods
  • Distant supervision
goal add hyponyms to wordnet directly from text
Goal: Add hyponyms to WordNet directly from text.
  • Intuition from Hearst (1992)
    • “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
  • What does Gelidium mean?
  • How do you know?
goal add hyponyms to wordnet directly from text1
Goal: Add hyponyms to WordNet directly from text.
  • Intuition from Hearst (1992)
    • “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
  • What does Gelidium mean?
  • How do you know?
hearst s hand designed lexico syntactic patterns
Hearst’s Hand-Designed Lexico-Syntactic Patterns

(Hearst, 1992): Automatic Acquisition of Hyponyms

“Y such as X ((, X)* (, and/or) X)”

“such Y as X…”

“X… or other Y”

“X… and other Y”

“Y including X…”

“Y, especially X…”

problem with hand built patterns
Problem with hand-built patterns
  • Requires that we hand-build patterns for each relation!
    • don’t want to have to do this for all possible relations!
    • we’d like better accuracy
5 easy methods for relation extraction2
5 easy methods for relation extraction
  • Hand-built patterns
  • Supervised methods
  • Bootstrapping (seed) methods
  • Unsupervised methods
2 supervised relation extraction
2. Supervised Relation Extraction
  • Sometimes done in 3 steps
    • Find all pairs of named entities
    • Decide if 2 entities are related
    • If yes, classifying the relation
  • Why the extra step?
    • Cuts down on training time for classification by eliminating most pairs
    • Producing separate feature-sets that are appropriate for each task.
relation analysis
Slide from Jim MartinRelation Analysis
  • Usually just run on named entities within the same sentence
relation extraction2
Slide from Jing JiangRelation Extraction
  • Task definition: to label the semantic relation between a pair of entities in a sentence (fragment)

…[leaderarg-1] of a minority [governmentarg-2]…

PHYS

PER-SOC

EMP-ORG

NIL

PHYS: Physical

PER-SOC: Personal / Social

EMP-ORG: Employment / Membership / Subsidiary

supervised learning
Supervised Learning
  • Supervised machine learning (e.g. [Zhou et al. 2005], [Bunescu & Mooney 2005], [Zhang et al. 2006], [Surdeanu & Ciaramita 2007])
  • Training data is needed for each relation type

…[leaderarg-1] of a minority [governmentarg-2]…

arg-1 word: leader

arg-2 type: ORG

dependency:

arg-1  of  arg-2

PHYS

PER-SOC

EMP-ORG

NIL

Slide from Jing Jiang

features words
Features: Words
  • Headwords of M1 and M2, and combination
      • George Washington Bridge
  • Bag of words and bigrams in M1 and M2
  • Words or bigrams in particular positions to the left and right of the M1 and M2
      • +/- 1, 2, 3
  • Bag of words or bigrams between the two entities
features named entity type and mention level
Features: Named Entity Type and Mention Level
  • Named-entity types (ORG, LOC, etc)
  • Concatenation of the types
  • Entity Level of M1 and M2
    • (NAME, NOMINAL, PRONOUN)
features parse tree and base phrases
Slide from Jim MartinFeatures: Parse Tree and Base Phrases
  • Syntactic environment
    • Constituent path through the tree from one to the other
    • Base syntactic chunk sequence from one to the other
    • Dependency path
features gazeteers and trigger words
Features: Gazeteers and trigger words
  • Personal relative trigger list
    • from wordnet: parent, wife, husabnd, grandparent, etc
  • Country name list
classifiers for supervised methods
Classifiers for supervised methods
  • Now you can use any classifier you like
    • SVM
    • Logistic regression
    • Naïve Bayes
    • etc
summary
Summary
  • Can get high accuracies with enough hand-labeled training data
  • If test data looks exactly like the training data
  • But
    • labeling 5000 relations (and named entities) is expensive
    • the approach doesn’t generalize to different genres
5 easy methods for relation extraction3
5 easy methods for relation extraction
  • Hand-built patterns
  • Supervised methods
  • Bootstrapping (seed) methods
  • Unsupervised methods
  • Distant supervision
bootstrapping approaches
Slide from Jim MartinBootstrapping Approaches
  • What if you don’t have enough annotated text to train on.
    • But you might have some seed tuples
    • Or you might have some patterns that work pretty well
  • Can you use those seeds to do something useful?
    • Co-training and active learning use the seeds to train classifiers to tag more data to train better classifiers...
    • Bootstrapping tries to learn directly (populate a relation) through direct use of the seeds
bootstrapping example seed tuple
Slide from Jim MartinBootstrapping Example: Seed Tuple
  • <Mark Twain, Elmira> Seed tuple
    • Grep (google)
    • “Mark Twain is buried in Elmira, NY.”
      • X is buried in Y
    • “The grave of Mark Twain is in Elmira”
      • The grave of X is in Y
    • “Elmira is Mark Twain’s final resting place”
      • Y is X’s final resting place.
  • Use those patterns to grep for new tuples that you don’t already know
hearst 1992 proposal for bootstrapping
Hearst (1992) proposal for bootstrapping
  • Choose lexical relation R
  • Gather a set of pairs that have this relation
  • Find places in the corpus where these expressions occur near each other and record the environment
  • Find the commonalities among these environments and hypothesize that common ones yield patterns that indicate the relation of interest
dipre brin 1998
Dipre (Brin 1998)
  • Extract <author, book> pairs
  • Start with these 5 seeds
  • Learn these patterns:
  • Now iterate, using these patterns to get more instances and patterns…
snowball agichtein gravano 2000
Snowball[Agichtein & Gravano 2000]
  • Exploit duality between patterns and tuples

- find tuples that match a set of patterns

    • find patterns that match a lot of tuples bootstrapping approach

Initial Seed Tuples

Occurrences of Seed Tuples

Generate New Seed Tuples

Tag Entities

Generate Extraction Patterns

Augment Table

snowball
Snowball

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, share ofRedmond-based Microsoft fell…

The Armonk-based IBM introduceda new line…

The combined company will operate

from Boeing’s headquarters in Seattle.

Intel, Santa Clara, cut prices of itsPentium processor.

initial seed tuples

occurrences of seed tuples

snowball1
Snowball
  • Require that X and Y be named entities of particular types

{<’s 0.7> <headquarters 0.7> <in 0.7> }

LOCATION

ORGANIZATION

{<- 0.75> <based 0.75>}

LOCATION

ORGANIZATION

patterns
Patterns

(extraction)pattern has format <left, tag1, middle, tag2, right>,

where tag1, tag2 are named-entity tags and left, middle, and right are vectors of weighted terms

  • patterns derived directly from occurrences are too specific

ORGANIZATION 's central headquarters in LOCATION is home to...

{<'s 0.5>, <central 0.5> <headquarters 0.5>, < in 0.5>}

{<is 0.75>, <home 0.75> }

LOCATION

ORGANIZATION

< left , tag1 , middle , tag2 , right >

pattern clusters
Pattern Clusters

cluster patterns, cluster centroids define patterns

Cluster 1

{<servers 0.75><at 0.75>}

ORGANIZATION

{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}

LOCATION

ORGANIZATION

{<operate 0.75><from 0.75>}

ORGANIZATION

{<’s 0.7> <headquarters 0.7> <in 0.7>}

LOCATION

Cluster 2

LOCATION

{<- 0.75> <based 0.75> }

ORGANIZATION

{<fell 1>}

{<shares 0.75><of 0.75>}

LOCATION

{<- 0.75> <based 0.75> }

ORGANIZATION

{<introduced 0.75> <a 0.75>}

{<the 1>}

5 easy methods for relation extraction4
5 easy methods for relation extraction
  • Hand-built patterns
  • Supervised methods
  • Bootstrapping (seed) methods
  • Unsupervised methods
  • Distant supervision
distant supervision paradigm
Distant supervision paradigm

Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17

Mintz, Bills, Snow, Jurafsky (2009) Distant supervision for relation extraction without labeled data. ACL-2009.

  • Instead of hand-creating 5 seed examples
  • Use a large database to get our seed examples
    • lots of examples
    • supervision from a database, not a corpus!
      • Not genre-dependent!
    • Create lots and lots of noisy features from all these examples
    • Combine in a classifier
distant supervision paradigm1
Distant supervision paradigm
  • Has advantages of supervised classification:
    • use of rich of hand-created knowledge
  • Has advantages of unsupervised classification:
    • infinite amounts of data
    • allows for very large number of weak features
    • not sensitive to training corpus
relation classification with distant supervision
Relation Classification with “Distant Supervision”

We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.

Suppose scientists could erase memories with a single substance in the brain.

Slide from Rion Snow

relation classification with distant supervision1
Relation Classification with “Distant Supervision”

Construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.

This leads to high-signal examples like:

“...consider authors like Shakespeare...”

“Some authors (including Shakespeare)...”

“Shakespeare was the author of several...”

“Shakespeare, author of The Tempest...”

Slide from Rion Snow

relation classification with distant supervision2
Relation Classification with “Distant Supervision”

This leads to high-signal examples like:

“...consider authors like Shakespeare...”

“Some authors (including Shakespeare)...”

“Shakespeare was the author of several...”

“Shakespeare, author of The Tempest...”

But noisy examples like:

“The author of Shakespeare in Love...”

“...authors at the ShakespeareFestival...”

Training set (TREC and Wikipedia):

14,000 hypernym pairs, ~600,000 total pairs

Slide from Rion Snow

how to learn patterns
How to learn patterns

Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17

… doubly heavy hydrogen atomcalleddeuterium…

1

Take corpus sentences

Collect noun pairs

752,311 pairs from 6M words of newswire

Is pair an IS-A in WordNet?

14,387 yes, 737,924 no

Parse the sentences

Extract patterns

69,592 dependency paths >5 pairs)

Train classifier on these patterns

Logistic regression with 70K features(actually converted to 974,288 bucketed binary features)

(Atom, deuterium)

2

YES

3

4

5

6

one of 70 000 patterns
One of 70,000 patterns

“<superordinate> ‘called’ <subordinate>”

Learned from cases such as:

“sarcoma / cancer”: …an uncommon bone cancercalled osteogenicsarcoma and to…

“deuterium / atom” ….heavy water rich in the doubly heavy hydrogen atomcalleddeuterium.

New pairs discovered:

“efflorescence / condition”: …and a conditioncalledefflorescence are other reasons for…

“’neal_inc / company” …The company, now called O'Neal Inc., was sole distributor of E-Ferol…

“hat_creek_outfit / ranch” …run a small ranch called the Hat Creek Outfit.

“hiv-1 / aids_virus” …infected by the AIDS virus, called HIV-1.

“bateau_mouche / attraction” …local sightseeing attraction called the Bateau Mouche...

“kibbutz_malkiyya / collective_farm” …an Israeli collective farm called Kibbutz Malkiyya…

idea use each pattern as a feature precision recall for hypernym classification
Idea: use each pattern as a feature!!!!Precision/Recall for Hypernym Classification:

logistic regression

10-fold Cross Validation on 14,000 WordNet-Labeled Pairs

Slide from Rion Snow

extracting more relations with distant supervision

Corpus

Training Set

Extracting more relations with distant supervision

2700 relations > 10 instances

2.3 million articles

31.5 million sentences

3.7 million entities

5.2 million instances

Mintz, Bills, Snow, Jurafsky (2009) Distant supervision for relation

extraction without labeled data. ACL-2009.

algorithm distant supervision
Algorithm: Distant Supervision
  • A kind of weakly supervised learning
    • Use a large database to get seed examples
    • Create lots and lots of noisy pattern features from all these examples
    • Combine in a classifier
extract parse and other features
Extract parse and other features

“Astronomer Edward Hubble was born in Marshfield, Missouri”

  • “Named entities”

Edwin Hubble is a PERSON

Marshfield is a LOCATION

  • Lexical items nearby…
new relations learned
New relations learned
  • Montmartre IS-IN Paris
  • Fort Erie IS-IN Ontario
  • Fyoder Kamesnky DIED-IN Clearwater
  • Utpon Sinclair WROTE Lanny Budd
  • Vince McMahon FOUNDED WWE
  • Thomas Mellon HAS-PROFESSION Judge
human evaluation precision using mechanical turk labelers
Human evaluation: precision using Mechanical Turk labelers

Feature Precision

Syntactic .67

Lexical .66

Both .69

where syntactic knowledge helps
Where syntactic knowledge helps
  • Back Street is a 1932 film made by Universal Pictures, directed by John M. Stahl, and produced by Carl Laemmle Jr.
  • Back Street and John M. Stahl are very far apart in surface string
  • But are close together in dependency parse
unsupervised relation extraction
Unsupervised relation extraction
  • Banko et al 2007 “Open information extraction from the Web”
  • Extracting relations from the web with
    • no training data
    • no predetermined list of relations
  • The Open Approach
    • Use parse data to train a “trust-worthy” classifier
    • Extract trustworthy relations among NPs
    • Rank relations based on text redundancy