information extraction l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Extraction PowerPoint Presentation
Download Presentation
Information Extraction

Loading in 2 Seconds...

play fullscreen
1 / 49

Information Extraction - PowerPoint PPT Presentation


  • 199 Views
  • Uploaded on

Information Extraction. (Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry). Information Extraction (IE). Identify specific pieces of information (data) in a unstructured or semi-structured textual document.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Extraction' - bluma


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information extraction

Information Extraction

(Slides based on those by Ray Mooney, Craig Knoblock, Dan Weld and Perry)

information extraction ie
Information Extraction (IE)
  • Identify specific pieces of information (data) in a unstructured or semi-structured textual document.
  • Transform unstructured information in a corpus of documents or web pages into a structured database.
  • Applied to different types of text:
    • Newspaper articles
    • Web pages
    • Scientific articles
    • Newsgroup messages
    • Classified ads
    • Medical notes
information extraction vs nlp
Information Extraction vs. NLP?
  • Information extraction is attempting to find some of the structure and meaning in the hopefully template driven web pages.
  • As IE becomes more ambitious and text becomes more free form, then ultimately we have IE becoming equal to NLP.
  • Web does give one particular boost to NLP
    • Massive corpora..
slide4
MUC
  • DARPA funded significant efforts in IE in the early to mid 1990’s.
  • Message Understanding Conference (MUC) was an annual event/competition where results were presented.
  • Focused on extracting information from news articles:
    • Terrorist events
    • Industrial joint ventures
    • Company management changes
  • Information extraction of particular interest to the intelligence community (CIA, NSA).
other applications
Other Applications
  • Job postings:
    • Newsgroups: Rapier from austin.jobs
    • Web pages: Flipdog
  • Job resumes:
    • BurningGlass
    • Mohomine
  • Seminar announcements
  • Company information from the web
  • Continuing education course info from the web
  • University information from the web
  • Apartment rental ads
  • Molecular biology information from MEDLINE
sample job posting
Sample Job Posting

Subject: US-TN-SOFTWARE PROGRAMMER

Date: 17 Nov 1996 17:37:29 GMT

Organization: Reference.Com Posting Service

Message-ID: <56nigp$mrs@bilbo.reference.com>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more

experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training.

Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:

Kim Anderson

AdNET

(901) 458-2888 fax

kimander@memphisonline.com

Subject: US-TN-SOFTWARE PROGRAMMER

Date: 17 Nov 1996 17:37:29 GMT

Organization: Reference.Com Posting Service

Message-ID: <56nigp$mrs@bilbo.reference.com>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more

experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training.

Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:

Kim Anderson

AdNET

(901) 458-2888 fax

kimander@memphisonline.com

extracted job template
Extracted Job Template

computer_science_job

id: 56nigp$mrs@bilbo.reference.com

title: SOFTWARE PROGRAMMER

salary:

company:

recruiter:

state: TN

city:

country: US

language: C

platform: PC \ DOS \ OS-2 \ UNIX

application:

area: Voice Mail

req_years_experience: 2

desired_years_experience: 5

req_degree:

desired_degree:

post_date: 17 Nov 1996

amazon book description
Amazon Book Description

….

</td></tr>

</table>

<b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br>

<font face=verdana,arial,helvetica size=-1>

by <a href="/exec/obidos/search-handle-url/index=books&field-author=

Kurzweil%2C%20Ray/002-6235079-4593641">

Ray Kurzweil</a><br>

</font>

<br>

<a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg">

<img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90

height=140 align=left border=0></a>

<font face=verdana,arial,helvetica size=-1>

<span class="small">

<span class="small">

<b>List Price:</b> <span class=listprice>$14.95</span><br>

<b>Our Price: <font color=#990000>$11.96</font></b><br>

<b>You Save:</b> <font color=#990000><b>$2.99 </b>

(20%)</font><br>

</span>

<p> <br>

….

</td></tr>

</table>

<b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br>

<font face=verdana,arial,helvetica size=-1>

by <a href="/exec/obidos/search-handle-url/index=books&field-author=

Kurzweil%2C%20Ray/002-6235079-4593641">

Ray Kurzweil</a><br>

</font>

<br>

<a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg">

<img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90

height=140 align=left border=0></a>

<font face=verdana,arial,helvetica size=-1>

<span class="small">

<span class="small">

<b>List Price:</b> <span class=listprice>$14.95</span><br>

<b>Our Price: <font color=#990000>$11.96</font></b><br>

<b>You Save:</b> <font color=#990000><b>$2.99 </b>

(20%)</font><br>

</span>

<p> <br>…

extracted book template
Extracted Book Template

Title: The Age of Spiritual Machines :

When Computers Exceed Human Intelligence

Author: Ray Kurzweil

List-Price: $14.95

Price: $11.96

:

:

web extraction
Web Extraction
  • Many web pages are generated automatically from an underlying database.
  • Therefore, the HTML structure of pages is fairly specific and regular (semi-structured).
  • However, output is intended for human consumption, not machine interpretation.
  • An IE system for such generated pages allows the web site to be viewed as a structured database.
  • An extractor for a semi-structured web site is sometimes referred to as a wrapper.
  • Process of extracting from such pages is sometimes referred to as screen scraping.
web extraction using dom trees
Web Extraction using DOM Trees
  • Web extraction may be aided by first parsing web pages into DOM trees.
  • Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract.
  • May still need regex patterns to identify proper portion of the final CharacterData node.
sample dom tree extraction
Sample DOM Tree Extraction

HTML

Element

HEADER

BODY

Character-Data

B

FONT

Age of Spiritual

Machines

A

by

Ray

Kurzweil

Title: HTMLBODYBCharacterData

Author: HTML BODYFONTA CharacterData

template types
Template Types
  • Slots in template typically filled by a substring from the document.
  • Some slots may have a fixed set of pre-specified possible fillers that may not occur in the text itself.
    • Terrorist act: threatened, attempted, accomplished.
    • Job type: clerical, service, custodial, etc.
    • Company type: SEC code
  • Some slots may allow multiple fillers.
    • Programming language
  • Some domains may allow multiple extracted templates per document.
    • Multiple apartment listings in one ad
simple extraction patterns
Simple Extraction Patterns
  • Specify an item to extract for a slot using a regular expression pattern.
    • Price pattern: “\b\$\d+(\.\d{2})?\b”
  • May require preceding (pre-filler) pattern to identify proper context.
    • Amazon list price:
      • Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”
      • Filler pattern: “\$\d+(\.\d{2})?\b”
  • May require succeeding (post-filler) pattern to identify the end of the filler.
    • Amazon list price:
      • Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”
      • Filler pattern: “.+”
      • Post-filler pattern: “</span>”
simple template extraction
Simple Template Extraction
  • Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order.
    • Title
    • Author
    • List price
  • Make patterns specific enough to identify each filler always starting from the beginning of the document.
pre specified filler extraction
Pre-Specified Filler Extraction
  • If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot.
    • Job category
    • Company type
  • Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.
learning for ie
Learning for IE
  • Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.
  • Alternative is to use machine learning:
    • Build a training set of documents paired with human-produced filled extraction templates.
    • Learn extraction patterns for each slot using an appropriate machine learning algorithm.
information extraction from unstructured text automated support for semantic web
Information Extraction from Unstructured Text:Automated Support for “Semantic Web”
  • Semantic web needs:
    • Tagged data
    • Background knowledge
  • (blue sky approaches to) automate both
    • Knowledge Extraction
      • Extract base level knowledge (“facts”) directly from the web
    • Automated tagging
      • Start with a background ontology and tag other web pages
        • Semtag/Seeker
extraction from free text involves natural language processing
Extraction from Free Text involvesNatural Language Processing

Analogy to regex patterns on DOM trees for structured tex

  • If extracting from automatically generated web pages, simple regex patterns usually work.
  • If extracting from more natural, unstructured, human-written text, some NLP may help.
    • Part-of-speech (POS) tagging
      • Mark each word as a noun, verb, preposition, etc.
    • Syntactic parsing
      • Identify phrases: NP, VP, PP
    • Semantic word categories (e.g. from WordNet)
      • KILL: kill, murder, assassinate, strangle, suffocate
  • Off-the-shelf software available to do this!
    • The “Brill” tagger
  • Extraction patterns can use POS or phrase tags.
i generate n test architecture
I. Generate-n-Test Architecture

Generic extraction patterns (Hearst ’92):

  • “…Cities such as Boston, Los Angeles, and Seattle…”

(“C such as NP1, NP2, and NP3”) =>

IS-A(each(head(NP)), C), …

Template

Driven

Extraction

(where template

In in terms of

Syntax Tree)

  • Detailed information for several countries such as maps, …” ProperNoun(head(NP))
  • “I listen to pretty much all music but prefer country such as Garth Brooks”
slide22
Test

Assess candidate extractions using Mutual Information (PMI-IR) (Turney ’01).

Many variations are possible…

assessment
Assessment
  • PMI = frequency of I & D co-occurrence
  • 5-50 discriminators Di
  • Each PMI for Di is a feature fi
  • Naïve Bayes evidence combination:

PMI is used for feature selection. NBC is used for learning. Hits used for assessing

PMI as well as conditional probabilities

assessment in action
Assessment In Action
  • I = “Yakima” (1,340,000)
  • D = <class name>
  • I+D = “Yakima city” (2760)
  • PMI = (2760 / 1.34M)= 0.02
  • I = “Avocado” (1,000,000)
  • I+D =“Avocado city” (10)
      • PMI = 0.00001 << 0.02
some sources of ambiguity
Some Sources of ambiguity
  • Time: “Clinton is the president” (in 1996).
  • Context: “common misconceptions..”
  • Opinion: Elvis…
  • Multiple word senses: Amazon, Chicago, Chevy Chase, etc.
    • Dominant senses can mask recessive ones!
    • Approach: unmasking. ‘Chicago –City’
chicago
Chicago

City

Movie

chicago unmasked
Chicago Unmasked

City sense

Movie sense

impact of unmasking on pmi
Impact of Unmasking on PMI

Name Recessive Original Unmask Boost

Washington city 0.50 0.99 96%

Casablanca city 0.41 0.93 127%

Chevy Chase actor 0.09 0.58 512%

Chicago movie 0.02 0.21 972%

cbioc collaborative bio curation
CBioC: Collaborative Bio-Curation
  • Motivation
    • To help get information nuggets of articles and abstracts and store in a database.
    • The challenge is that the number of articles are huge and they keep growing, and need to process natural language.
    • The two existing approaches
      • human curation and use of automatic information extraction systems
      • They are not able to meet the challenge, as the first is expensive, while the second is error-prone.
cbioc cont d
CBioC (cont’d)
  • Approach: We propose a solution that is inexpensive, and that scales up.
    • Our approach takes advantage of automatic information extraction methods as a starting point,
      • Based on the premise that if there are a lot of articles, then there must be a lot of readers and authors of these articles.
    • We provide a mechanism by which the readers of the articles can participate and collaborate in the curation of information.
    • We refer to our approach as “Collaborative Curation''.
slide32

What is the main difference between Knowitall and CBIOC?

Assessment– Knowitall does it by HITS. CBioC by voting

annotation
Annotation

“The Chicago Bulls announced yesterday that Michael Jordan will. . . ”

The <resource ref="http://tap.stanford.edu/

BasketballTeam_Bulls">Chicago Bulls</resource>

announced yesterday that <resource ref=

"http://tap.stanford.edu/AthleteJordan,_Michael">

Michael Jordan</resource> will...’’

semantic annotation
Semantic Annotation

Name Entity Identification

This simplest task of meta-data extraction on NL is to establish “type” relation between entities in the NL resources and concepts in ontologies.

Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt

semantics
Semantics
  • Semantic Annotation

- The content of annotation consists of some rich

semantic information

- Targeted not only at human reader of resources

but also software agents

- formal : metadata following structural standards

informal : personal notes written in the margin while

reading an article

- explicit : carry sufficient information for interpretation

tacit : many personal annotations (telegraphic and incomplete)

http://www-scf.usc.edu/~csci586/slides/6

uses of annotation
Uses of Annotation

http://www-scf.usc.edu/~csci586/slides/8

objectives of annotation
Objectives of Annotation
  • Generate Metadata for existing information
    • e.g., author-tag in HTML
    • RDF descriptions to HTML
    • Content description to Multimedia files
  • Employ metadata for
    • Improved search
    • Navigation
    • Presentation
    • Summarization of contents

http://www.aifb.uni-karlsruhe.de/WBS/sst/Teaching/Intelligente%20System%20im%20WWW%20SS%202000/10-Annotation.pdf

annotation38

is complex

is time consuming

needs annotation by experts

Annotation

Current practice of annotation for knowledge identification and extraction

Reduce burden of text annotation for Knowledge Management

www.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.ppt

semtag seeker
SemTag & Seeker
  • WWW-03 Best Paper Prize
  • Seeded with TAP ontology (72k concepts)
    • And ~700 human judgments
  • Crawled 264 million web pages
  • Extracted 434 million semantic tags
    • Automatically disambiguated
semtag
SemTag
  • Uses broad, shallow knowledge base
  • TAP – lexical and taxonomic information about popular objects
    • Music
    • Movies
    • Sports
    • Etc.
semtag41
SemTag
  • Problem:
    • No write access to original document, so how do you annotate?
  • Solution:
    • Store annotations in a web-available database
semtag42
SemTag
  • Semantic Label Bureau
    • Separate store of semantic annotation information
    • HTTP server that can be queried for annotation information
    • Example
      • Find all semantic tags for a given document
      • Find all semantic tags for a particular object
semtag43
SemTag
  • Methodology
semtag44
SemTag
  • Three phases
    • Spotting Pass:
      • Tokenize the document
      • All instances plus 20 word window
    • Learning Pass:
      • Find corpus-wide distribution of terms at each internal node of taxonomy
      • Based on a representative sample
    • Tagging Pass:
      • Scan windows to disambiguate each reference
      • Finally determined to be a TAP object
semtag45
SemTag
  • Solution:
    • Taxonomy Based Disambiguation (TBD)
  • TBD expectation:
    • Human tuned parameters used in small, critical sections
    • Automated approaches deal with bulk of information
semtag46
SemTag
  • TBD methodology:
    • Each node in the taxonomy is associated with a set of labels
      • Cats, Football, Cars all contain “jaguar”
    • Each label in the text is stored with a window of 20 words – the context
    • Each node has an associated similarity function mapping a context to a similarity
      • Higher similarity  more likely to contain a reference
semtag47
SemTag
  • Similarity:
    • Built a 200,000 word lexicon (200,100 most common – 100 most common)
    • 200,000 dimensional vector space
    • Training: spots (label, context) and correct node
    • Estimated the distribution of terms for nodes
    • Standard cosine similarity
    • TFIDF vectors (context vs. node)
semtag48
SemTag
  • Some internal nodes very popular:
    • Associate a measurement of how accurate Sim is likely to be at a node
    • Also, how ambiguous the node is overall (consistency of human judgment)
  • TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v
  • 82% accuracy on 434 million spots
summary
Summary
  • Information extraction can be motivated either as explicating more structure from the data or as an automated way to Semantic Web
  • Extraction complexity depends on whether the text you have is “templated” or “free-form”
    • Extraction from templated text can be done by regular expressions
    • Extraction from free form text requires NLP
      • Can be done in terms of parts-of-speech-tagging
  • “Annotation” involves connecting terms in a free form text to items in the background knowledge
    • It too can be automated