Building Discerning Knowledge Bases from Multiple Source Documents,
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering PowerPoint PPT Presentation


  • 42 Views
  • Uploaded on
  • Presentation posted in: General

Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering. Jason Hale 1 , Sumali Conlon 1 , Tim McCready 1 , Susan Lukose 2 , Anil Vinjamur 2 1 Department of Management Information Systems University of Mississippi University, MS 38677

Download Presentation

Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Building discerning knowledge bases from multiple source documents with novel fact filtering

Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering

Jason Hale1, Sumali Conlon1,

Tim McCready1, Susan Lukose2, Anil Vinjamur2

1Department of Management Information Systems

University of Mississippi

University, MS 38677

2Department of Computer and Information Science

University of Mississippi

University, MS 38677


Building discerning knowledge bases from multiple source documents with novel fact filtering

XML Facts

Novelty Fact Filtering Agent

Web

Articles

Novel

Facts

WSJ “Article LMN”1/2/05

xx Axx Bx xx Hx Ixx xx X !Yxx…

Reuters “OPQ”1/3/05

x Bx Cxx xx HIx x xx XYx…

Outline

Presentation Outline

Background

Motivation

Research Goals

Systems Architecture

Method of Approach

Future Research

Information Extraction Agent

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Knowledge Base

  • FACTS

  • ABC

  • HI

  • X !Y

  • Articles

  • LMN

  • OPQ

  • CONFLICTS

  • X !Y

  • X Y

  • Sources

  • WSJ

  • Reuters


Building discerning knowledge bases from multiple source documents with novel fact filtering

Business Information

Yesterday

Today

  • Scarce

  • Expensive

  • Printed text

  • Slow moving

  • Stale upon arrival

  • Hoarded by Experts

  • Manually Processed

  • Trusted, but not always correct

  • Over abundance

  • Cheap

  • Electronic text

  • Electric Speed

  • Fresh mixed w/stale

  • Communicable

  • Semi-automatic

  • Mix of correct/incorrect, trusted/untrusted


Building discerning knowledge bases from multiple source documents with novel fact filtering

Looking for Information on the Web

  • Repetitive information in multiple packages

  • No time to read them all

  • You want just the facts you need

  • - From all (and just) the relevant docs

  • Information Retrieval (IR)

  • - Maybe without reading any articles

  • Information Extraction (IE)

  • - Definitely without redundant reading

  • Novelty Filtering

  • Impossible to keep up with, manually


Building discerning knowledge bases from multiple source documents with novel fact filtering

Ongoing Research Goals of UM Team

  • Advancing Information Extraction Methods

    • Extracting financial information from online documents (Reuters, Wall Street Journal).

    • via FIRST System (Lukose et. al, AMCIS 2004)

  • Making business information available on the web more processable

    • Converting the extracted facts into XML.

    • FIRST Quarter (Vinjamur, et. al., AMCIS 2005)


Building discerning knowledge bases from multiple source documents with novel fact filtering

Ongoing Research of Our Team –

Goals Addressed in this Paper

  • Making web business information more manageable

    • Adding a Novelty FilteringLayer

    • to evolving First Quarter IE System

    • Storing novel facts extracted from FIRST Quarter into a Knowledge Base

  • Liberating facts from their sources

    • Multiple Sourcing(Wall Street Journal, Reuters)

    • Fact trustworthiness


Building discerning knowledge bases from multiple source documents with novel fact filtering

Flexible Information

extRaction SysTem (FIRST)

  • Extracted info from Wall Street Journal only

    • corporate earnings facts and predictions

  • Human text-pattern based rule creation

  • Used natural language processing

    • - w/ WordNET to enhance recall

    • w/ KWIC Index to enhance precision

  • Output facts in semi-structured text


Building discerning knowledge bases from multiple source documents with novel fact filtering

FIRST Quarter Enhancements

  • Extracting from multiple sources:

    • WSJ, Reuters, etc.

    • multi-sourced facts

    • requires humans adding more rules

  • - Extracting time and date information

  • - Extracting more-structured facts


Building discerning knowledge bases from multiple source documents with novel fact filtering

Information Retrieval Agent

XML Facts

Novelty Fact Filtering Agent

Web

Articles

Novel

Facts

Reuters “OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

WSJ “Article LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

A theoretical IR agent

retrieves relevant, text-based

corporate earnings reports

from multiple web sources…

…and feeds them to an IE agent, such as

FIRST Quarter.

Knowledge Base


Example wsj article fed into first quarter

Example WSJ Article Fed Into FIRST Quarter


Building discerning knowledge bases from multiple source documents with novel fact filtering

XML Facts

XML Facts

Novelty Fact Filtering Agent

Web

Articles

Novel

Facts

Reuters “OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

WSJ “Article LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

Information is extracted

from the text, producing discrete XML facts.

This pool of XML factsis funneled into a

novelty filter.

Knowledge Base


Building discerning knowledge bases from multiple source documents with novel fact filtering

XML Facts

Novelty Fact Filtering Agent

Web

Articles

Novel

Facts

Reuters “OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

Reuters “OPQ” 1/3/05

WSJ “Article LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

Tasks of the FIRST Quarter Novelty Filter

Weed out duplicate facts

Fold in complimentary facts

- facts of differing precision

Detect and manage conflicting facts

- corrected facts

Each XML fact is packaged with meta-data identifying

its respective source.

Knowledge Base


Xml fact extracted by first quarter

XML Fact Extracted by FIRST Quarter


Building discerning knowledge bases from multiple source documents with novel fact filtering

XML Facts

Novelty Fact Filtering Agent

Web

Articles

Novel

Facts

Reuters “OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

WSJ “Article LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

In concept…

…and joined

into complete facts

…partial facts

are detected in the novelty filter

before entering

the knowledge base.

Knowledge Base

FACTS

ARTICLES

CONFLICTS

SOURCES


Building discerning knowledge bases from multiple source documents with novel fact filtering

XML Facts

Web

Articles

Novel

Facts

Reuters“OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

WSJ “LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

In practice…

…made to reveal

its source…

Novelty Filter

... then admitted

to the knowledge base.

…each partial fact

is interrogated in isolation…

LMN

WSJ

Knowledge Base

FACTS

ARTICLES

SOURCES


Building discerning knowledge bases from multiple source documents with novel fact filtering

Match Types

Complimenting Facts

Duplicate Facts

Facts of Differing Precision

Conflicting Facts

XML Facts

Web

Articles

Novel

Facts

Reuters“OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

WSJ “LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

As each

subsequent fact

is digested…

Novelty Fact Filtering Agent

Does it match a

fact already learned?

AB and BC provide complimentary

info about B.

so rather than

inserting another partial fact

OPQ

Reuters

Knowledge Base

FACTS

ARTICLES

SOURCES

LMN

WSJ

we augment (update)

the existing fact.


Building discerning knowledge bases from multiple source documents with novel fact filtering

XML Facts

Novelty Fact Filtering Agent

Web

Articles

Novel

Facts

Reuters“OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

WSJ “LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

Novel fact HIis detected…

…from a

familiar source…

HI enters the

Knowledge base…

…and remembers its sole source.

LMN

WSJ

Knowledge Base

FACTS

ARTICLES

SOURCES

ABC

LMN

OPQ

WSJ

Reuters


Building discerning knowledge bases from multiple source documents with novel fact filtering

XML Facts

Novelty Fact Filtering Agent

Web

Articles

Novel

Facts

Reuters“OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

WSJ “LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

Novel fact XY

is detected…

…and digested as a sole-sourced fact.

LMN

WSJ

Knowledge Base

FACTS

ARTICLES

SOURCES

ABC

LMN

OPQ

WSJ

Reuters


Building discerning knowledge bases from multiple source documents with novel fact filtering

Match Types

Complimenting Facts

Duplicate Facts

Facts of Differing Precision

Conflicting Facts

XML Facts

Novelty Fact Filtering Agent

Web

Articles

Novel

Facts

Reuters“OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

WSJ “LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

Duplicate fact HI is found to

have come

from a 2nd source…

We remember

the new source…

H1 is now linked

to multiple sources.

…but discard

the duplicate fact.

Reuters

OPQ

Knowledge Base

FACTS

ARTICLES

FACT_ARTICLE

SOURCES

WSJ

Reuters

LMN

OPQ


Building discerning knowledge bases from multiple source documents with novel fact filtering

Match Types

Complimenting Facts

Duplicate Facts

Facts of Differing Precision

Conflicting Facts

XML Facts

Novelty Fact Filtering Agent

Web

Articles

Novel

Facts

Reuters“OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

WSJ “LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

Both facts are moved to a Conflicts table

Fact X!Y

is matched

against known facts

…and found to conflict with XY

Knowledge Base

FACTS

ARTICLES

CONFLICTS

SOURCES

WSJ

Reuters

LMN

OPQ


Building discerning knowledge bases from multiple source documents with novel fact filtering

Novelty Filter

Web

Articles

Novel

Facts

Reuters“OPQ”1/3/05

x Bx Cxx xx HIx x xx Xx!Yx…

WSJ“ZZZ”1/4/05

Xx!Yxxxxx

WSJ “LMN”1/2/05

xx Axx Bx xx Hx Ixx xx XYxx…

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Information Extraction Agent

X!Y is

later extracted from a 3rd source

…and matched

against known facts and conflicts.

Since it matches an existing conflict…

X!Y is now a dual-sourced fact.

While XY is disavowed.

X!Y is vindicated.

Knowledge Base

FACTS

ARTICLES

CONFLICTS

SOURCES

LMN

OPQZZZ

WSJ

Reuters

LMN

OPQ


Building discerning knowledge bases from multiple source documents with novel fact filtering

Knowledge Base

Schema


Building discerning knowledge bases from multiple source documents with novel fact filtering

  • Method of Approach

  • Find a pair of related earnings reports from WSJ and Reuters.

  • Manually extract all targeted facts from the articles.

  • For each document in the pair, count the number of:

    • Facts to be extracted

    • Items to be extracted

    • Duplicate facts

    • Complimenting facts

    • Conflicting facts


Building discerning knowledge bases from multiple source documents with novel fact filtering

  • Method of Approach (cont.)

  • Feed the document pair into the FIRST Quarter system.

  • At the end, look in the database and compare the results with the manually extracted facts.

  • If all facts were not processed correctly, then:

    • Manually update the rule base

    • Re-process the pair of source documents.

    • Backup and wipe out the database

    • Re-process the corpus of test documents, and compare with backup database to compute the new scores


Building discerning knowledge bases from multiple source documents with novel fact filtering

  • Method of Approach

  • We will be finished with FIRST Quarter when:

  • The last X pair of new documents processed does notresult in a improved accuracies over the previous X, in spite of rule updates. [WE STOP IMPROVING]


Building discerning knowledge bases from multiple source documents with novel fact filtering

  • Measures of Effectiveness

    • Fact-level Recall/Precision

    • Item-level Recall/Precision

    • Duplicate Fact Recall/Precision

    • Complimenting Fact Recall/Precision

    • Conflicting Fact Recall/Precision


Building discerning knowledge bases from multiple source documents with novel fact filtering

FIRST Results to Date

  • Precision = The number of items that are tagged correctly

  • The number of items being tagged

  • First’s Precision = 90%

  • Recall = The number of items tagged by the system

  • The number of possible items that experts would tag

  • First’s Recall = 85%

  • F = 2 PR

    • P + R

    • First’s F value = 87.43%


Building discerning knowledge bases from multiple source documents with novel fact filtering

  • Future Research Goals of UM Team

  • Incorporate Machine Learning Techniques to improve

  • FIRST Quarter IE precision and recall

  • Build tools to:

    • mark-up/weed-out copies of processed source docs

      • to reflect which facts were extracted

      • to weed out redundant information

  • Add an IR agent to feed the FIRST Quarter system docs to build the knowledge base automatically from the web

  • Add web services built on the knowledge base.


Building discerning knowledge bases from multiple source documents with novel fact filtering

XML Facts

Novelty Fact Filtering Agent

Web

Articles

Novel

Facts

WSJ “Article LMN”1/2/05

xx Axx Bx xx Hx Ixx xx X !Yxx…

Reuters “OPQ”1/3/05

x Bx Cxx xx HIx x xx XYx…

Information Extraction Agent

Questions?

BC Compliments AB

HI duplicates HI

XY conflicts with X !Y

Knowledge Base

  • FACTS

  • ABC

  • HI

  • X !Y

  • Articles

  • LMN

  • OPQ

  • CONFLICTS

  • X !Y

  • X Y

  • Sources

  • WSJ

  • Reuters


  • Login