slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Motivation PowerPoint Presentation
Download Presentation
Motivation

Loading in 2 Seconds...

play fullscreen
1 / 53

Motivation - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

Declarative Information Extraction The Avatar Group IBM Almaden Research Center Rajasekar Krishnamurthy, Yunyao Li , Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu. Motivation. Where is the party?. Hmmm…I don’t know. Let me check my email.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Motivation' - yasuo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Declarative Information ExtractionThe Avatar Group IBM Almaden Research CenterRajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu

Sonoma State University Computer Science Colloquium

motivation
Motivation

Where is the party?

Hmmm…I don’t know. Let me check my email.

John and Jane are going to a salsa party tonight! But …

where is the party
Where is the party?

Hi guys,

We are planning a salsa party tonight starting at

10:00pm for our class at Miami Beach Club,

175 San Pedro Square

San Jose, CA 95109

Whoever who is interested, please let me know

so we can organize some car-pooling.

-Juan

PS: you can call me at 408.123.4567 if needed.

salsa address

0 email found

salsa

100 emails found

address

0 email found

The address of the party!

But the email itself does not contain the word “address”!

information extraction

Select Address

From EVENTS

Where event = ‘salsa party’

Event Address

salsa party175 San Pedro Square ...

... ...

175 San Pedro Square …

Information Extraction
  • Exploit the extracted datain your applications
    • E.g. for search, for advertisement
  • Distill structured data from unstructured and semi-structured text
    • E.g. extracting phone numbers from emails, extracting person names from the web

Hi guys,

We are planning a salsa party tonight starting at 10:00pm for our salsa class at Miami Beach Club,

175 San Pedro Square San Jose, CA 95109

Whoever who is interested, please let me know so we can organize some car-pooling.

-Juan

PS: you can call me at 408.123.4567 if needed.

revisit where is the party
Revisit: Where is the Party?

salsa address

Lotus Notes 8.01 Live Text

San Jose, CA 95109

and many others
And many others
  • Literature Citations/ Research Communities
    • DBLife
    • Google Scholar
  • Terminology Extraction
  • Document Summarization
  • Life Science
    • Eg. Gene Sequence Extraction, Protein Interaction Extraction

… …

As the amount of data in text explodes,

information extraction is becoming

increasing important!

basic terminology
Basic Terminology

Programs used to extract structured data

Structured data extracted by annotators

Higher Level Applications

annotations

Annotator

annotations

Annotator

Data Repository

documents

annotations

Annotator

background avatar
Background: Avatar
  • Working on information extraction (IE) since 2003
  • Main goals:
    • Extract structured information from text
    • Build a system that can scale IE to real enterprise apps
    • Build new enterprise applications that leverage IE
evolution of the avatar ie system
Evolution of the Avatar IE System

Evolutionary Triggers

2004

Custom Code

Large number of annotators

RAP(CPSL-style cascading grammar system)

2005

Diverse data sets, Complex extraction tasks

RAP++(RAP + Extensions outside the scope of grammars)

2006

Performance, Expressivity

System T(algebraic information extraction system)

2007

2008

the custom code era

The Custom Code Era

Sonoma State University Computer Science Colloquium

extracting information with custom code
Extracting Information with Custom Code
  • “It’s just pattern matching”
    • Use scripts and regular expressions
  • Then reality sets in…
    • Dozens of rules, even for simple concepts
    • Many special cases
    • Convoluted logic
    • Painfully slow code
the age of cascading grammars

The Age of Cascading Grammars

Sonoma State University Computer Science Colloquium

historical perspective
Historical Perspective
  • MUC (Message Understanding Conference) – 1987 to 1997
    • Competition-style conferences organized by DARPA
    • Shared data sets and performance metrics
      • News articles, Radio transcripts, Military telegraphic messages
  • Classical IE Tasks
    • Entity and Relationship/Link extraction
    • Event detection, sentiment mining etc.
    • Entity resolution/matching
  • Several IE systems were built
    • FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS [Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]
cascading finite state grammars
Cascading Finite-state Grammars
  • Most IE systems share a common formalism
    • Input text viewed as a sequence of tokens
    • Rules expressed as regular expression patterns over the lexical features of these tokens
  • Several levels of processing  Cascading Grammars
  • CPSL
    • A standard language for specifying cascading grammars
    • Created in 1998
  • Several known implementations
    • TextPro: reference implementation of CPSL by Doug Appelt
    • JAPE (Java Annotation Pattern Engine)
      • Part of the GATE NLP framework
      • Under active consideration for commercial use by several companies
cascading grammars by example

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora

Cascading Grammars By Example

Level 2

Name Token[~ “at”] Phone  PersonPhone

Level 1

Token[~ “[1-9]\d{2}-\d{4}”]  Phone

Token[~ “John | Smith| …”]+  Name

Level 0 (Tokenize)

experiences with cascading grammars
Experiences with Cascading Grammars
  • Benefits
    • Big step forward from custom code
    • Can express many simple concepts
  • Drawbacks
    • Expressiveness
      • Dealing with overlap
      • Building complex structures
    • Performance
sequencing overlapping input annotations
Sequencing Overlapping Input Annotations

ProperNoun

Instrument

Marco Doe on the Hammond organ

John Pipe plays the guitar

Instrument

ProperNoun

ProperNoun

Instrument

<ProperNoun> <within 30 characters> <Instrument>

Regular Expression Dictionary

Match Match

<d1|d2|…dn>

<[A-Z]\w+(\s[A-Z]\w+)?>

Example rule from the Band Review

sequencing overlapping input annotations1

Instrument

John Pipe plays the guitar

ProperNoun

Sequencing Overlapping Input Annotations
  • Possible options
    • Pre-specified disambiguation rules (e.g., pick earlier annotation)
    • Supply tie-breaking rules for every possible overlap scenario
    • Let implementation make an internal non-deterministic choice (as in JAPE, RAP, ..)

Over 4.5M blog entries a choice one way or another on a single rule would change the number of annotations by +/- 25%.

There is no magic!

Which of the two should we pick?

Prefer ProperNoun over  Instrument

John Pipe plays the guitar

ProperNounToken Token Instrument

John Pipe plays the guitar

TokenInstrument TokenToken Instrument

Instrument

Marco Doe onthe Hammond organ

ProperNounTokenToken Instrument

ProperNoun

Marco Doe on the Hammond organ

Marco Doe onthe Hammond organ

ProperNounTokenToken PoperNountoken

Instrument

ProperNoun

complex structures example signature annotator

Person

Organization

Phone

URL

Within 50 tokens

At least 1 Phone

Start with Person

Person

Organization

Phone

URL

At least 2 of {Phone, Organization, URL}

End with one of these.

Complex Structures Example: Signature Annotator

Laura Haas, PhD

Distinguished Engineer and Director, Computer Science

Almaden Research Center

408-927-1700

http://www.almaden.ibm.com/cs

complex structures existing solutions
Complex Structures: Existing Solutions
  • Approximate using regular expressions
  • Example: Signature
    • Rule: (Person Token{,25} Phone (Token{,25} Contact)+)|

(Person (Token{,25} Contact)+ Token{,25} Phone (Token{,25} Contact)*)

    • Problems:
      • Need to enumerate all possible orders of sub-annotations
        • What if you want at least one phone and one email?
      • Does not restrict total token count
performance

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora

Performance

Level 2

Name Token[~ “at”] Phone  PersonPhone

Level 1

Token[~ “[1-9]\d{2}-\d{4}”]  Phone

Token[~ “John | Smith| …”]+  Name

Each level in a cascading grammar looks at each character in each document

Level 0

dawn of declarative information extraction

Dawn of Declarative Information Extraction

Sonoma State University Computer Science Colloquium

system t architecture

Operator

Runtime

System-T Architecture

AQL Language

Specify annotator semantics declaratively

Annotation Algebra

Optimizer

Choose an efficient execution plan that implements semantics

declarative information extraction aql
Declarative Information Extraction: AQL
  • SQL-like language for defining annotators
  • Declarative
    • Define basic patterns and the relationships between them
    • Let the system worry about the order of operations
aql example
AQL Example

<ProperNoun> <within 30 characters> <Instrument>

Regular Expression Dictionary

Match Match

select CombineSpans(name.match, instrument.match)

as annot

from

Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name,

Dictionary(“instr.dict”, DocScan.text) instrument

where

Follows(0, 30, name.match, instrument.match);

annotation algebra
Annotation Algebra
  • Each Operator in the algebra…
    • …operates on one or more tuples of annotations
    • …produces tuples of annotations
  • “Document at a time” execution model
    • Algebra expression is defined over
      • the current document d
      • annotations defined over d
  • Algebra expression is evaluated over each document in the corpus individually
basic single argument operator
Basic Single-Argument Operator

Document

Annotation 1

Output Tuple 1

Document

Annotation 2

Output Tuple 2

Operator

Parameters

Document

Input Tuple

comparison with cascading grammars

…John Smith at 555-1212…

…<PersonPhone>…

…<Name> at <Phone>…

…John Smith at 555-1212…

Comparison with Cascading Grammars

John Smith at 555-1212

Apply PersonPhone

Join

John Smith

Block

555-1212

Apply Name Rule

John

Smith

Apply Phone Rule

Dictionary

Regex

Fewer passes over the documents

Grammar

Algebra

revisit problem of sequencing annotations

Instrument

ProperNoun

John Pipe plays the guitar

Marco Benevento on the Hammond organ

ProperNoun

Instrument

Instrument

ProperNoun

Revisit Problem of Sequencing Annotations
slide31

Algebra expression for the Rule from Band Review(Reiss, Raghavan, Krishnamurthy, Zhu and Vaithyanathan, ICDE 2008)

<ProperNoun> <within 30 characters> <Instrument>

Regular Expression Dictionary

Match Match

Join

(followed within 30 characters)

Regular expression

Dictionary

ProperNoun

Instrument

slide32

doc

John Pipe

guitar

doc

Marco Benevento

Hammond organ

doc

John Pipe

doc

Marco Benevento

doc

Hammond

Regex

Dictionary

Instrument

ProperNoun

John Pipe plays the guitar

Marco Benevento on the Hammond organ

ProperNoun

Instrument

Instrument

ProperNoun

ProperNoun <0-30 chars> Instrument

Join

ProperNoun

Instrument

doc

Pipe

doc

guitar

doc

Hammond organ

how is aggregation handled

Person

Organization

Phone

URL

Within 50 tokens

At least 1 Phone

Start with Person

Person

Organization

Phone

URL

At least 2 of {Phone, Organization, URL}

End with one of these.

How is aggregation handled

Laura Haas, PhD

Distinguished Engineer and Director, Computer Science

Almaden Research Center

408-927-1700

http://www.almaden.ibm.com/cs

back to signature

Signature

Person

Organization

Phone

URL

Join

Organization

Phone

URL

Phone

URL

Block

Organization

Union

Back to signature

Person

Person

Org

Phone

URL

Cleaner and potentially faster

performance1
Performance
  • Performance issues with grammars
    • Complete pass through tokens for each rule
    • Many of these passes are wasted work
  • Dominant approach: Make each pass go faster
    • Doesn’t solve root problem!
  • Algebraic approach: Build a query optimizer!
optimizations
Optimizations
  • Query optimization is a familiar topic in databases
  • What’s different in text?
    • Operations over sequences and texts
    • Document boundaries
    • Costs concentrated in extraction operators (dictionary, regular expression)
  • Can leverage these characteristics
    • Text-specific optimizations
    • Significant performance improvements
optimization example
Optimization Example

<ProperNoun> <within 30 characters> <Instrument>

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elentum non ante. John Pipeplayed theguitar. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, arcu augue rutrum ve

Regex match

Dictionary match

0-30 characters

slide38

Classic Query Optimization

(Followed within 30 characters)

Join

<ProperNoun>

<Instrument>

Plan A

Find <Instrument> within 30 characters

Find <ProperNoun> within 30 characters

Consider text to the right

Consider text to the left

<Instrument>

<ProperNoun>

Plan B

Plan C

example of text specific optimization

…John Smith at 555-1212…

Example of Text-Specific Optimization:

John Smith at 555-1212

  • Conditional Evaluation (CE)
    • Leverage document-at-a-time processing
    • Don’t evaluate the inner operand of a join if the outer has no results
    • Costing plans is challenging

CEJoin

John Smith

555-1212

Dictionary

Regex

Don’t evaluate this Regex when there are no dictionary matches.

experimental results band review annotator
Experimental Results (Band Review Annotator)

Classical query optimization

Text-specific optimizations

iopes extracting relationships and composite entities
IOPES: Extracting Relationships and Composite Entities
  • IOPES = IBM Omnifind Personal Email Search
  • Extract entities such as email address, url
  • Associations such as name ↔ phone number
  • Complex entities like conference schedules, directions, signature blocks
thank you
Thank you!
  • For more information…
    • Try out IOPES
      • http://www.alphaworks.ibm.com/tech/emailsearch
    • Avatar Project home page
      • http://almaden.ibm.com/cs/projects/avatar/
    • Contact me
      • yunyaoli@us.ibm.com
backup slides

Backup Slides

Sonoma State University Computer Science Colloquium

block operator b
Block Operator (b)

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. In augue mi, scelerisque non, dictum non, vestibulum congue, erat. Donec non felis. Maecenas urna nunc, pulvinar et, fringilla a, porta at, diam. In iaculis dignissim erat. Quisque pharetra. Suspendisse cursus viverra urna. Aliquam erat volutpat. Donec quis sapien et metus molestie eleifend. Maecenas sit amet metus eleifend nibh semper fringilla. Pellentesque habitant morbi tristique senectus et netus et malesuada

Constraint on distance between inputs

Input

Input

Block

Input

Input

Constraint on number of inputs

conditional evaluation ce

…John Smith at 555-1212…

Conditional Evaluation (CE)

John Smith at 555-1212

  • Leverage document-at-a-time processing
  • Don’t evaluate the inner operand of a join if the outer has no results
  • Costing plans is challenging

CEJoin

John Smith

555-1212

Dictionary

Regex

Don’t evaluate this Regex when there are no dictionary matches.

restricted span evaluation

…John Smith at 555-1212…

Restricted Span Evaluation

John Smith at 555-1212

  • Leverage the sequential nature of text
  • Only evaluate the inner on the relevant portions of the document
  • Limited applicability (compared with CE)
    • Only certain operands and predicates

RSEJoin

555-1212

John Smith

Regex

Dictionary

Only look for dictionary matches in the vicinity of a phone number.

implementing restricted span evaluation rse
Implementing Restricted Span Evaluation (RSE)

s1 binding

  • RSE join operator
  • RSE extraction operator
  • Pass join bindings down to the inner of a join
  • Requires special physical operators at edges of plan

p(s1,s2)Dict(D,s2)

s1

p

R1

RSEDict

D

s2’s that satisfyp(binding, s2)

RSEDictionaryOperator

rse dictionary operator

Length of longest dictionary entry

RSE Dictionary Operator

To find dictionary matches that end in this range…

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin tincidunt eleifend quam. Aliquam ut pede ut enim dapibus venenatis.

…need to examine this range.

  • RSE version of an operator must produce the exact same answer
    • Ongoing work: RSE Regular Expression operator
closely related work shen doan naughton ramakrishnan vldb 2007
Closely related work (Shen, Doan, Naughton, Ramakrishnan, VLDB 2007)

Regular Expressions and Custom Code

Cascading Grammars

Workflows

CPSL, AFST

UIMA, GATE

In the context of Project Cimple.Search for “cimple wisc”

System T

DBLife

delving deeper into system t versus dblife
Delving deeper into System T versus DBLife

Conditional Evaluation

Pushing Down Text Properties

DBLife

Restricted Span Evaluation

Scoping Extractions

System T

Shared Dictionary Matching

Pattern Matching

cascading grammar reality
Cascading Grammar Reality

Pre-processing step: Tokenization of the document text

  • Set of simple grammar rules for person name recognition

Tokenize(Document Text)  Sequence of <Token>

Level 1: Rules that look for patterns in each token to produce corresponding annotations

Token[~ “Mr. | Mrs. | Dr. | …”]  Salutation

Token[~ “Ph.D | MBA | …”]  Qualification

Token[~ “[A-Z][a-z]*”]  CapsWord

Token[~ “Michael | Richard | Smith| …”]  PersonDict

Level 2: Rules that look for patterns involving Level-1 annotations to identify Persons

PersonDict PersonDict  Person

Salutation CapsWord CapsWord  Person

CapsWord CapsWord Token[~“,”]? Qualification Person

Richard Smith

Dr. Laura Haas

Laura Haas, Ph.D

iopes extracting relationships and composite entities1
IOPES: Extracting Relationships and Composite Entities
  • IOPES = IBM Omnifind Personal Email Search
  • Entities like addresses, person names
  • Relationships like name ↔ phone number
  • Complex entities like conference schedules, directions, signature blocks
extracting entities in notes 8 01 live text
Extracting Entities in Notes 8.01 Live Text
  • Leverages Information Extraction Techniques
  • Names, addresses, phone numbers…
  • Ships with Lotus Notes 8.01