1 / 30

Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering

Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering. Jason Hale 1 , Sumali Conlon 1 , Tim McCready 1 , Susan Lukose 2 , Anil Vinjamur 2 1 Department of Management Information Systems University of Mississippi University, MS 38677

malia
Download Presentation

Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Discerning Knowledge Bases from Multiple Source Documents, with Novel Fact Filtering Jason Hale1, Sumali Conlon1, Tim McCready1, Susan Lukose2, Anil Vinjamur2 1Department of Management Information Systems University of Mississippi University, MS 38677 2Department of Computer and Information Science University of Mississippi University, MS 38677

  2. XML Facts Novelty Fact Filtering Agent Web Articles Novel Facts WSJ “Article LMN”1/2/05 xx Axx Bx xx Hx Ixx xx X !Yxx… Reuters “OPQ”1/3/05 x Bx Cxx xx HIx x xx XYx… Outline Presentation Outline Background Motivation Research Goals Systems Architecture Method of Approach Future Research Information Extraction Agent BC Compliments AB HI duplicates HI XY conflicts with X !Y Knowledge Base • FACTS • ABC • HI • X !Y • Articles • LMN • OPQ • CONFLICTS • X !Y • X Y • Sources • WSJ • Reuters

  3. Business Information Yesterday Today • Scarce • Expensive • Printed text • Slow moving • Stale upon arrival • Hoarded by Experts • Manually Processed • Trusted, but not always correct • Over abundance • Cheap • Electronic text • Electric Speed • Fresh mixed w/stale • Communicable • Semi-automatic • Mix of correct/incorrect, trusted/untrusted

  4. Looking for Information on the Web • Repetitive information in multiple packages • No time to read them all • You want just the facts you need • - From all (and just) the relevant docs • Information Retrieval (IR) • - Maybe without reading any articles • Information Extraction (IE) • - Definitely without redundant reading • Novelty Filtering • Impossible to keep up with, manually

  5. Ongoing Research Goals of UM Team • Advancing Information Extraction Methods • Extracting financial information from online documents (Reuters, Wall Street Journal). • via FIRST System (Lukose et. al, AMCIS 2004) • Making business information available on the web more processable • Converting the extracted facts into XML. • FIRST Quarter (Vinjamur, et. al., AMCIS 2005)

  6. Ongoing Research of Our Team – Goals Addressed in this Paper • Making web business information more manageable • Adding a Novelty FilteringLayer • to evolving First Quarter IE System • Storing novel facts extracted from FIRST Quarter into a Knowledge Base • Liberating facts from their sources • Multiple Sourcing(Wall Street Journal, Reuters) • Fact trustworthiness

  7. Flexible Information extRaction SysTem (FIRST) • Extracted info from Wall Street Journal only • corporate earnings facts and predictions • Human text-pattern based rule creation • Used natural language processing • - w/ WordNET to enhance recall • w/ KWIC Index to enhance precision • Output facts in semi-structured text

  8. FIRST Quarter Enhancements • Extracting from multiple sources: • WSJ, Reuters, etc. • multi-sourced facts • requires humans adding more rules • - Extracting time and date information • - Extracting more-structured facts

  9. Information Retrieval Agent XML Facts Novelty Fact Filtering Agent Web Articles Novel Facts Reuters “OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… WSJ “Article LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent A theoretical IR agent retrieves relevant, text-based corporate earnings reports from multiple web sources… …and feeds them to an IE agent, such as FIRST Quarter. Knowledge Base

  10. Example WSJ Article Fed Into FIRST Quarter

  11. XML Facts XML Facts Novelty Fact Filtering Agent Web Articles Novel Facts Reuters “OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… WSJ “Article LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent Information is extracted from the text, producing discrete XML facts. This pool of XML factsis funneled into a novelty filter. Knowledge Base

  12. XML Facts Novelty Fact Filtering Agent Web Articles Novel Facts Reuters “OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… Reuters “OPQ” 1/3/05 WSJ “Article LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent Tasks of the FIRST Quarter Novelty Filter Weed out duplicate facts Fold in complimentary facts - facts of differing precision Detect and manage conflicting facts - corrected facts Each XML fact is packaged with meta-data identifying its respective source. Knowledge Base

  13. XML Fact Extracted by FIRST Quarter

  14. XML Facts Novelty Fact Filtering Agent Web Articles Novel Facts Reuters “OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… WSJ “Article LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent In concept… …and joined into complete facts …partial facts are detected in the novelty filter before entering the knowledge base. Knowledge Base FACTS ARTICLES CONFLICTS SOURCES

  15. XML Facts Web Articles Novel Facts Reuters“OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… WSJ “LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent In practice… …made to reveal its source… Novelty Filter ... then admitted to the knowledge base. …each partial fact is interrogated in isolation… LMN WSJ Knowledge Base FACTS ARTICLES SOURCES

  16. Match Types Complimenting Facts Duplicate Facts Facts of Differing Precision Conflicting Facts XML Facts Web Articles Novel Facts Reuters“OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… WSJ “LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent As each subsequent fact is digested… Novelty Fact Filtering Agent Does it match a fact already learned? AB and BC provide complimentary info about B. so rather than inserting another partial fact OPQ Reuters Knowledge Base FACTS ARTICLES SOURCES LMN WSJ we augment (update) the existing fact.

  17. XML Facts Novelty Fact Filtering Agent Web Articles Novel Facts Reuters“OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… WSJ “LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent Novel fact HIis detected… …from a familiar source… HI enters the Knowledge base… …and remembers its sole source. LMN WSJ Knowledge Base FACTS ARTICLES SOURCES ABC LMN OPQ WSJ Reuters

  18. XML Facts Novelty Fact Filtering Agent Web Articles Novel Facts Reuters“OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… WSJ “LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent Novel fact XY is detected… …and digested as a sole-sourced fact. LMN WSJ Knowledge Base FACTS ARTICLES SOURCES ABC LMN OPQ WSJ Reuters

  19. Match Types Complimenting Facts Duplicate Facts Facts of Differing Precision Conflicting Facts XML Facts Novelty Fact Filtering Agent Web Articles Novel Facts Reuters“OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… WSJ “LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent Duplicate fact HI is found to have come from a 2nd source… We remember the new source… H1 is now linked to multiple sources. …but discard the duplicate fact. Reuters OPQ Knowledge Base FACTS ARTICLES FACT_ARTICLE SOURCES WSJ Reuters LMN OPQ

  20. Match Types Complimenting Facts Duplicate Facts Facts of Differing Precision Conflicting Facts XML Facts Novelty Fact Filtering Agent Web Articles Novel Facts Reuters“OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… WSJ “LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent Both facts are moved to a Conflicts table Fact X!Y is matched against known facts …and found to conflict with XY Knowledge Base FACTS ARTICLES CONFLICTS SOURCES WSJ Reuters LMN OPQ

  21. Novelty Filter Web Articles Novel Facts Reuters“OPQ”1/3/05 x Bx Cxx xx HIx x xx Xx!Yx… WSJ“ZZZ”1/4/05 Xx!Yxxxxx WSJ “LMN”1/2/05 xx Axx Bx xx Hx Ixx xx XYxx… BC Compliments AB HI duplicates HI XY conflicts with X !Y Information Extraction Agent X!Y is later extracted from a 3rd source …and matched against known facts and conflicts. Since it matches an existing conflict… X!Y is now a dual-sourced fact. While XY is disavowed. X!Y is vindicated. Knowledge Base FACTS ARTICLES CONFLICTS SOURCES LMN OPQZZZ WSJ Reuters LMN OPQ

  22. Knowledge Base Schema

  23. Method of Approach • Find a pair of related earnings reports from WSJ and Reuters. • Manually extract all targeted facts from the articles. • For each document in the pair, count the number of: • Facts to be extracted • Items to be extracted • Duplicate facts • Complimenting facts • Conflicting facts

  24. Method of Approach (cont.) • Feed the document pair into the FIRST Quarter system. • At the end, look in the database and compare the results with the manually extracted facts. • If all facts were not processed correctly, then: • Manually update the rule base • Re-process the pair of source documents. • Backup and wipe out the database • Re-process the corpus of test documents, and compare with backup database to compute the new scores

  25. Method of Approach • We will be finished with FIRST Quarter when: • The last X pair of new documents processed does notresult in a improved accuracies over the previous X, in spite of rule updates. [WE STOP IMPROVING]

  26. Measures of Effectiveness • Fact-level Recall/Precision • Item-level Recall/Precision • Duplicate Fact Recall/Precision • Complimenting Fact Recall/Precision • Conflicting Fact Recall/Precision

  27. FIRST Results to Date • Precision = The number of items that are tagged correctly • The number of items being tagged • First’s Precision = 90% • Recall = The number of items tagged by the system • The number of possible items that experts would tag • First’s Recall = 85% • F = 2 PR • P + R • First’s F value = 87.43%

  28. Future Research Goals of UM Team • Incorporate Machine Learning Techniques to improve • FIRST Quarter IE precision and recall • Build tools to: • mark-up/weed-out copies of processed source docs • to reflect which facts were extracted • to weed out redundant information • Add an IR agent to feed the FIRST Quarter system docs to build the knowledge base automatically from the web • Add web services built on the knowledge base.

  29. XML Facts Novelty Fact Filtering Agent Web Articles Novel Facts WSJ “Article LMN”1/2/05 xx Axx Bx xx Hx Ixx xx X !Yxx… Reuters “OPQ”1/3/05 x Bx Cxx xx HIx x xx XYx… Information Extraction Agent Questions? BC Compliments AB HI duplicates HI XY conflicts with X !Y Knowledge Base • FACTS • ABC • HI • X !Y • Articles • LMN • OPQ • CONFLICTS • X !Y • X Y • Sources • WSJ • Reuters

More Related