wok a web of knowledge
Download
Skip this Video
Download Presentation
WoK: A Web of Knowledge

Loading in 2 Seconds...

play fullscreen
1 / 77

WoK: A Web of Knowledge - PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on

WoK: A Web of Knowledge. David W. Embley Brigham Young University Provo, Utah, USA. A Web of Pages  A Web of Facts. Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' WoK: A Web of Knowledge' - ida


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
wok a web of knowledge

WoK: A Web of Knowledge

David W. Embley

Brigham Young University

Provo, Utah, USA

a web of pages a web of facts
A Web of Pages  A Web of Facts
  • Birthdate of my great grandpa Orson
  • Price and mileage of red Nissans, 1990 or newer
  • Location and size of chromosome 17
  • US states with property crime rates above 1%
toward a web of knowledge
Toward a Web of Knowledge
  • Fundamental questions
    • What is knowledge?
    • What are facts?
    • How does one know?
  • Philosophy
    • Ontology
    • Epistemology
    • Logic and reasoning
ontology
Ontology
  • Existence  asks “What exists?”
  • Concepts, relationships, and constraints with formal foundation
epistemology
Epistemology
  • The nature of knowledge  asks: “What is knowledge?” and “How is knowledge acquired?”
  • Populated conceptual model
logic and reasoning
Logic and Reasoning
  • Principles of valid inference  asks: “What is known?” and “What can be inferred?”
  • For us, it answers: what can be inferred (in a formal sense) from conceptualized data.

Find price and mileage of red Nissans, 1990 or newer

making this work how
Making this Work  How?
  • Distill knowledge from the wealth of digital web data
  • Annotate web pages
  • Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge

Annotation

Annotation

Fact

Fact

Fact

turning raw symbols into knowledge
Turning Raw Symbols into Knowledge
  • Symbols: $ 11,500 117K Nissan CD AC
  • Data: price(11,500) mileage(117K) make(Nissan)
  • Conceptualized data:
    • Car(C123) has Price($11,500)
    • Car(C123) has Mileage(117,000)
    • Car(C123) has Make(Nissan)
    • Car(C123) has Feature(AC)
  • Knowledge
    • “Correct” facts
    • Provenance
actualization with extraction ontologies
Actualization (with Extraction Ontologies)

Find me the price and mileage of all red Nissans – I want a 1990 or newer.

explanation how it works
Explanation: How it Works
  • Extraction Ontologies
  • Semantic Annotation
  • Free-Form Query Interpretation
extraction ontologies
Extraction Ontologies

Object sets

Relationship sets

Participation constraints

Lexical

Non-lexical

Primary object set

Aggregation

Generalization/Specialization

extraction ontologies1
Extraction Ontologies

Data Frame:

Internal Representation: float

Values

External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})?

Left Context: $

Key Word Phrase

Key Words: ([Pp]rice)|([Cc]ost)| …

Operators

Operator: >

Key Words: (more\s*than)|(more\s*costly)|…

generality resiliency of extraction ontologies
Generality & Resiliency ofExtraction Ontologies
  • Generality: assumptions about web pages
    • Data rich
    • Narrow domain
    • Document types
      • Single-record documents (hard, but doable)
      • Multiple-record documents (harder)
      • Records with scattered components (even harder)
  • Resiliency: declarative
    • Still works when web pages change
    • Works for new, unseen pages in the same domain
    • Scalable, but takes work to declare the extraction ontology
free form query interpretation
Free-Form Query Interpretation
  • Parse Free-Form Query

(with respect to data extraction ontology)

  • Select Ontology
  • Formulate Query Expression
  • Run Query Over Semantically Annotated Data
parse free form query
Parse Free-Form Query

“Find me the and of all s – I want a ”

price

mileage

red

Nissan

1996

or newer

>=Operator

select ontology
Select Ontology

“Find me the price and mileage of all red Nissans – I want a 1996 or newer”

slide21

Formulate Query Expression

  • Conjunctive queries and aggregate queries
  • Projection on mentioned object sets
  • Selection via values and operator keywords
    • Color = “red”
    • Make = “Nissan”
    • Year >= 1996

>= Operator

great but problems still need resolution
Great!But Problems Still Need Resolution
  • How do we create extraction ontologies?
    • Manual creation requires several dozen person hours
    • Semi-automatic creation
      • TISP (Table Interpretation by Sibling Pages)
      • TANGO (Table ANalysis for Generating Ontologies)
      • Nested Schemas with Regular Expressions
      • Synergistic Bootstrapping
      • Form-based Information Harvesting
  • How do we scale up?
    • Practicalities of technology transfer and usage
    • Millions of queries over zillions of facts for thousands of ontologies
manual creation2
Manual Creation
  • Library of instance recognizers
  • Library of lexicons
automatic annotation with tisp table interpretation with sibling pages
Automatic Annotation with TISP(Table Interpretation with Sibling Pages)
  • Recognize tables (discard non-tables)
  • Locate table labels
  • Locate table values
  • Find label/value associations
recognize tables
Recognize Tables

Layout Tables (discard)

Data Table

Nested

Data Tables

locate table labels
Locate Table Labels

Examples:

Identification.Gene model(s).Protein

Identification.Gene model(s).2

locate table labels1
Locate Table Labels

Examples:

Identification.Gene model(s).Gene Model

Identification.Gene model(s).2

1

2

find label value associations
Find Label/Value Associations

Example:

(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918

1

2

technique details
Technique Details
  • Unnest tables
  • Match tables in sibling pages
    • “Perfect” match (table for layout  discard )
    • “Reasonable” match (sibling table)
  • Determine & use table-structure pattern
    • Discover pattern
    • Pattern usage
    • Dynamic pattern adjustment
semi automatic annotation with tango table analysis for generating ontologies
Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies)
  • Recognize and normalize table information
  • Construct mini-ontologies from tables
  • Discover inter-ontology mappings
  • Merge mini-ontologies into a growing ontology
recognize table information
Recognize Table Information

Religion

Population Albanian Roman Shi’a Sunni

Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%

Albania 3,510,484 20% 70% 10%

construct mini ontology

Religion

Population Albanian Roman Shi’a Sunni

Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%

Albania 3,510,484 20% 70% 10%

Construct Mini-Ontology
slide46
Semi-Automatic Annotation viaSynergistic Bootstrapping(Based on Nested Schemas with Regular Expressions)
  • Build a page-layout, pattern-based annotator
  • Automate layout recognition based on examples
  • Auto-generate examples with extraction ontologies
  • Synergistically run pattern-based annotator & extraction-ontology annotator
synergistic execution
Synergistic Execution

Extraction Ontology

Partially

Annotated

Document

Conceptual Annotator

(ontology-based annotation)

Pattern Generation

Document

Layout

Patterns

Annotated

Document

Structural Annotator

(layout-driven annotation)

form based information harvesting
Form-Based Information Harvesting
  • Forms
    • General familiarity
    • Reasonable conceptual framework
    • Appropriate correspondence
      • Transformable to ontological descriptions
      • Capable of accepting source data
  • Instance recognizers
    • Some pre-existing instance recognizers
    • Lexicons
  • Automated extraction ontology creation?
form creation
Form Creation
  • Basic form-construction facilities:
  • single-entry field
  • multiple-entry field
  • nested form
almost ready to harvest
Almost Ready to Harvest
  • Need reading path: DOM-tree structure
  • Need to resolve mapping problems
    • Split/Merge
    • Union/Selection
almost ready to harvest1
Almost Ready to Harvest …
  • Need reading path: DOM-tree structure
  • Need to resolve mapping problems
    • Split/Merge
    • Union/Selection

Name

Voltage-dependent anion-selective

channel protein 3

VDAC-3

hVDAC3

Outer mitochondrial membrane

Protein porin 3

almost ready to harvest2
Almost Ready to Harvest …
  • Need reading path: DOM-tree structure
  • Need to resolve mapping problems
    • Split/Merge
    • Union/Selection

Name

Voltage-dependent anion-selective

channel protein 3

VDAC-3

hVDAC3

Outer mitochondrial membrane

Protein porin 3

almost ready to harvest3
Almost Ready to Harvest …
  • Need reading path: DOM-tree structure
  • Need to resolve mapping problems
    • Split/Merge
    • Union/Selection

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15

almost ready to harvest4
Almost Ready to Harvest …
  • Need reading path: DOM-tree structure
  • Need to resolve mapping problems
    • Split/Merge
    • Union/Selection

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15

can now harvest1
Can Now Harvest

Name

14-3-3 protein epsilon

Mitochondrial import stimulation

factor Lsubunit

Protein kinase C inhibitor protein-1

KCIP-1

14-3-3E

can now harvest2
Can Now Harvest

Name

Voltage-dependent anion-selective

channel protein 3

VDAC-3

hVDAC3

Outer mitochondrial membrane

Protein porin 3

can now harvest3
Can Now Harvest

Name

Tryptophanyl-tRNA synthetase,

mitochondrial precursor

EC 6.1.1.2

Tryptophan—tRNA ligase

TrpRS

(Mt)TrpRS

harvesting populates ontology1
Harvesting Populates Ontology

Also helps adjust ontology constraints

can harvest from additional sites
Can Harvest from Additional Sites

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15

automating extraction ontology creation
AutomatingExtraction Ontology Creation

Lexicons

Name

14-3-3 protein epsilon

Mitochondrial import stimulation

factor Lsubunit

Protein kinase C inhibitor protein-1

KCIP-1

14-3-3E

14-3-3 protein epsilon

Mitochondrial import stimulation

factor Lsubunit

Protein kinase C inhibitor protein-1

KCIP-1

14-3-3E

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15

Tryptophanyl-tRNA synthetase,

mitochondrial precursor

EC 6.1.1.2

Tryptophan—tRNA ligase

TrpRS

(Mt)TrpRS

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15

Name

Tryptophanyl-tRNA synthetase,

mitochondrial precursor

EC 6.1.1.2

Tryptophan—tRNA ligase

TrpRS

(Mt)TrpRS

automating extraction ontology creation1
AutomatingExtraction Ontology Creation

Instance Recognizers

Number Patterns

Context Keywords and Phrases

automatic semantic annotation
Automatic Semantic Annotation

Recognize and annotate with respect to an ontology

ontology transformations
Ontology Transformations

Transformations to and from all

practicalities wok query interfaces
Practicalities: WoK Query Interfaces

(Future Work)

  • Advanced free-form queries with disjunction and negation
  • Form-based query language
  • Table-based query languages
  • Graphical query languages
practicalities bootstrapping the wok
Practicalities: Bootstrapping the WoK

(Future Work)

  • Won’t just happen without sufficient content
  • Niche applications
    • Historical Data (e.g. Genealogy)
    • Topical Blogs
  • Local WoKs
    • Intra-organizational effort
    • Individual interests
practicalities scalability
Practicalities: Scalability

(Future Work)

  • Potential Rapid growth
    • Thousands of ontologies
    • Millions of simultaneous queries
    • Billions of annotated pages
    • Trillions of facts
  • Search-engine-like caching & query processing
key to success simplicity via automation
Key to Success:Simplicity via Automation
  • Automatic (or near automatic) creation of extraction ontologies
  • Automatic (or near automatic) annotation of web pages
  • Simple but accurate query specification without specialized training

www.deg.byu.edu

ad