introduction to information extraction n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Information Extraction PowerPoint Presentation
Download Presentation
Introduction to Information Extraction

Loading in 2 Seconds...

play fullscreen
1 / 25

Introduction to Information Extraction - PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on

Introduction to Information Extraction. Transition: Documents to Phrases. Information Retrieval and Text Mining make document-level judgments Rank documents for a query Assign a label to a document We’re going to start looking more closely at the text within a document.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Information Extraction' - reba


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
transition documents to phrases
Transition: Documents to Phrases
  • Information Retrieval and Text Mining make document-level judgments
    • Rank documents for a query
    • Assign a label to a document
  • We’re going to start looking more closely at the text within a document.
  • IE is a first step: we’re going to identify a few nuggets of interesting text, and pull them out.
information extraction
Information Extraction

Definition:

The automatic extraction of structured information from unstructured documents.

Overall Goals:

  • Making information more accessible to people
  • Making information more machine-processable

Practical Goal: Build large knowledge bases

traditional information extraction
Traditional Information Extraction

Systems find instances of target relations.

e.g., HeadquarteredIn(<company>, <city>)

Some newswire text:

EMI Music Publishing Latin America, the Latin music and entertainment arm of the EMI music conglomerate, has its headquarters inMiami, FL.

HeadquarteredIn(EMI, Miami)

outline
Outline
  • Goals and Uses
  • Major Problems and Obstacles
  • Brief history of techniques
  • Demo
information extraction in applications
Information Extraction in Applications
  • Structured Search
  • Opinion Mining/Sentiment Extraction
  • Data Mining over Extracted Relationships
structured search
Structured Search

Search today is primarily “keyword search”.

e.g., a search for “EMI headquarters”

But what if you want to know something that’s not listed on any one page, but is spread out over many pages?

e.g., What music companies are headquartered in major cities in the Southeastern US?

How many schools in PA closed two or more times because of snow?

What are some high-paying job offers for computer science PhDs?

- Probably no single document mentions all these.

- Many different documents mention parts of the answer.

- If we extracted all these relationships into a database, running this query is trivial.

data mining over extracted relationships
Data Mining over Extracted Relationships

Researchers have built classifiers for predicting breast cancer based on databases of doctors’ and nurses’ reports.

However, the reports often have incomplete fields, and many fields are raw text.

Information extraction can fill in the missing fields from the text, to support the classifiers.

problems for ie
Problems for IE
  • Typical NLP problems
    • Paraphrase – many ways to say the same thing
    • Ambiguity – the same word/phrase/sentence may mean different things in different contexts
  • IE-specific problems: data integration
    • Representation: what counts as a relationship? an entity?
    • Large-scale entity and relation resolution
entity resolution
Entity Resolution
  • How many distinct “Alexander Yates” entities are there on the Web?
  • One of those entities is a professor at Temple
  • Is that the same one who is the author of Moondogs, or a different one? How do you know?
slide13

the margherita

Smith

invented

the telephone

Alexander Graham Bell

invented

light bulbs

Thomas Edison

invented

the cotton gin

Eli Whitney

invented

the phonograph

Edison

invented

http://www.cs.washington.edu/research/textrunner/

slide14

the Internet

Al Gore

invented

http://www.cs.washington.edu/research/textrunner/

slide15

the margherita

Smith

invented

the margherita

C. Smith

invented

http://www.cs.washington.edu/research/textrunner/

slide16

light bulbs

Thomas Edison

invented

the phonograph

Edison

invented

http://www.cs.washington.edu/research/textrunner/

representations for ie
Representations for IE
  • Relation Resolution
    • Raised(fire truck, ladder)  Lifted(fire truck, ladder)
    • Lifted(UN, sanctions)  Removed(UN, sanctions)
    • Raised(Walmart, prices) ? Removed(Walmart, prices)
  • What set of relationships exist in the world?
    • Extremely old problem in philosophy; no good answer.
  • Which set of relations should we try to extract examples of?
open information extraction on the web
Open Information Extraction on the Web

TextRunnerBanko et al., IJCAI’07

Unsupervised, single-pass extraction for the Web.

No relation names required for input.

Extracted

Tuple:

was founded by (EBay, Pierre Omidyar )

Noun

Relation

Noun Phrase

EBay was founded by Pierre Omidyar.

some sample ie techniques
Some Sample IE Techniques
  • Manually constructed patterns
  • Pattern-learning and bootstrapping
  • Supervised Classifiers (more on this later)
manually constructed ie patterns
Manually-Constructed IE Patterns

Pattern: A:physical-object was bombed by B

 exists C . terrorist-attack(C)

^ perpetrator(C, B)

^ target(C, A)

“The parliament building was bombed by guerrillas.”

  • perpetrator(C, guerrillas)

and target(C, parliament building)

marti hearst patterns for hyponymy
Marti Hearst Patterns for Hyponymy
  • Hyponym: the set X is a hyponym of the set Y if forall x ϵ X, x ϵ Y
    • In other words, X is a subclass of Y
    • E.g., “physicists” is a hyponym of “scientists”
    • Hypernym is the opposite, a superclass
  • Hearst (COLING 1992) defined a set of about 5 really common patterns for extracting hyponyms:
    • Y such as X (, X2, X3, …)
    • X and/or/among other Y
    • Y, including X (, X2, X3, …)
    • Y, especially X (, X2, X3, …)
    • These still get used all of the time (including in KnowItAll)
rule learning
Rule Learning
  • Thinking up some patterns for hyponyms might not be too hard, but what about some new relationship?
    • E.g., enzymes and the molecular pathway(s) they’re involved in?
    • Cities and their mayors? Films and their directors?
  • Can we automate the process of identifying patterns?
  • Rule learning automates this process, if it is given some examples of the relationship of interest.
    • For instance, some example enzyme names and the names of the pathways they’re involved in.
bootstrapping
Bootstrapping

Rule Learning

High-confidence

Extractions

bootstrapping1
Bootstrapping

Rule Learning

High-confidence

Extractions

demos
Demos

TextRunner

http://www.cs.washington.edu/research/textrunner/

YAGO

http://www.mpi-inf.mpg.de/yago-naga/yago/demo.html

Google Sets

http://labs.google.com/sets