Wok a web of knowledge
Download
1 / 77

WoK: A Web of Knowledge - PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on

WoK: A Web of Knowledge. David W. Embley Brigham Young University Provo, Utah, USA. A Web of Pages  A Web of Facts. Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' WoK: A Web of Knowledge' - ida


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Wok a web of knowledge

WoK: A Web of Knowledge

David W. Embley

Brigham Young University

Provo, Utah, USA


A web of pages a web of facts
A Web of Pages  A Web of Facts

  • Birthdate of my great grandpa Orson

  • Price and mileage of red Nissans, 1990 or newer

  • Location and size of chromosome 17

  • US states with property crime rates above 1%


Toward a web of knowledge
Toward a Web of Knowledge

  • Fundamental questions

    • What is knowledge?

    • What are facts?

    • How does one know?

  • Philosophy

    • Ontology

    • Epistemology

    • Logic and reasoning


Ontology
Ontology

  • Existence  asks “What exists?”

  • Concepts, relationships, and constraints with formal foundation


Epistemology
Epistemology

  • The nature of knowledge  asks: “What is knowledge?” and “How is knowledge acquired?”

  • Populated conceptual model


Logic and reasoning
Logic and Reasoning

  • Principles of valid inference  asks: “What is known?” and “What can be inferred?”

  • For us, it answers: what can be inferred (in a formal sense) from conceptualized data.

Find price and mileage of red Nissans, 1990 or newer


Making this work how
Making this Work  How?

  • Distill knowledge from the wealth of digital web data

  • Annotate web pages

  • Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge

Annotation

Annotation

Fact

Fact

Fact


Turning raw symbols into knowledge
Turning Raw Symbols into Knowledge

  • Symbols: $ 11,500 117K Nissan CD AC

  • Data: price(11,500) mileage(117K) make(Nissan)

  • Conceptualized data:

    • Car(C123) has Price($11,500)

    • Car(C123) has Mileage(117,000)

    • Car(C123) has Make(Nissan)

    • Car(C123) has Feature(AC)

  • Knowledge

    • “Correct” facts

    • Provenance


Actualization with extraction ontologies
Actualization (with Extraction Ontologies)

Find me the price and mileage of all red Nissans – I want a 1990 or newer.





Explanation how it works
Explanation: How it Works

  • Extraction Ontologies

  • Semantic Annotation

  • Free-Form Query Interpretation


Extraction ontologies
Extraction Ontologies

Object sets

Relationship sets

Participation constraints

Lexical

Non-lexical

Primary object set

Aggregation

Generalization/Specialization


Extraction ontologies1
Extraction Ontologies

Data Frame:

Internal Representation: float

Values

External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})?

Left Context: $

Key Word Phrase

Key Words: ([Pp]rice)|([Cc]ost)| …

Operators

Operator: >

Key Words: (more\s*than)|(more\s*costly)|…


Generality resiliency of extraction ontologies
Generality & Resiliency ofExtraction Ontologies

  • Generality: assumptions about web pages

    • Data rich

    • Narrow domain

    • Document types

      • Single-record documents (hard, but doable)

      • Multiple-record documents (harder)

      • Records with scattered components (even harder)

  • Resiliency: declarative

    • Still works when web pages change

    • Works for new, unseen pages in the same domain

    • Scalable, but takes work to declare the extraction ontology



Free form query interpretation
Free-Form Query Interpretation

  • Parse Free-Form Query

    (with respect to data extraction ontology)

  • Select Ontology

  • Formulate Query Expression

  • Run Query Over Semantically Annotated Data


Parse free form query
Parse Free-Form Query

“Find me the and of all s – I want a ”

price

mileage

red

Nissan

1996

or newer

>=Operator


Select ontology
Select Ontology

“Find me the price and mileage of all red Nissans – I want a 1996 or newer”


Formulate Query Expression

  • Conjunctive queries and aggregate queries

  • Projection on mentioned object sets

  • Selection via values and operator keywords

    • Color = “red”

    • Make = “Nissan”

    • Year >= 1996

>= Operator


Formulate Query Expression

For

Let

Where

Return


Run query over semantically annotated data
Run QueryOver Semantically Annotated Data


Great but problems still need resolution
Great!But Problems Still Need Resolution

  • How do we create extraction ontologies?

    • Manual creation requires several dozen person hours

    • Semi-automatic creation

      • TISP (Table Interpretation by Sibling Pages)

      • TANGO (Table ANalysis for Generating Ontologies)

      • Nested Schemas with Regular Expressions

      • Synergistic Bootstrapping

      • Form-based Information Harvesting

  • How do we scale up?

    • Practicalities of technology transfer and usage

    • Millions of queries over zillions of facts for thousands of ontologies




Manual creation2
Manual Creation

  • Library of instance recognizers

  • Library of lexicons


Automatic annotation with tisp table interpretation with sibling pages
Automatic Annotation with TISP(Table Interpretation with Sibling Pages)

  • Recognize tables (discard non-tables)

  • Locate table labels

  • Locate table values

  • Find label/value associations


Recognize tables
Recognize Tables

Layout Tables (discard)

Data Table

Nested

Data Tables


Locate table labels
Locate Table Labels

Examples:

Identification.Gene model(s).Protein

Identification.Gene model(s).2


Locate table labels1
Locate Table Labels

Examples:

Identification.Gene model(s).Gene Model

Identification.Gene model(s).2

1

2



Find label value associations
Find Label/Value Associations

Example:

(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918

1

2


Interpretation technique sibling page comparison
Interpretation Technique:Sibling Page Comparison


Interpretation technique sibling page comparison1
Interpretation Technique:Sibling Page Comparison

Same


Interpretation technique sibling page comparison2
Interpretation Technique:Sibling Page Comparison

Almost Same


Interpretation technique sibling page comparison3
Interpretation Technique:Sibling Page Comparison

Different

Same


Technique details
Technique Details

  • Unnest tables

  • Match tables in sibling pages

    • “Perfect” match (table for layout  discard )

    • “Reasonable” match (sibling table)

  • Determine & use table-structure pattern

    • Discover pattern

    • Pattern usage

    • Dynamic pattern adjustment




Semi automatic annotation with tango table analysis for generating ontologies
Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies)

  • Recognize and normalize table information

  • Construct mini-ontologies from tables

  • Discover inter-ontology mappings

  • Merge mini-ontologies into a growing ontology


Recognize table information
Recognize Table Information

Religion

Population Albanian Roman Shi’a Sunni

Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%

Albania 3,510,484 20% 70% 10%


Construct mini ontology

Religion

Population Albanian Roman Shi’a Sunni

Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%

Albania 3,510,484 20% 70% 10%

Construct Mini-Ontology



Merge
Merge Religion


Semi-Automatic Annotation via ReligionSynergistic Bootstrapping(Based on Nested Schemas with Regular Expressions)

  • Build a page-layout, pattern-based annotator

  • Automate layout recognition based on examples

  • Auto-generate examples with extraction ontologies

  • Synergistically run pattern-based annotator & extraction-ontology annotator


Synergistic execution
Synergistic Execution Religion

Extraction Ontology

Partially

Annotated

Document

Conceptual Annotator

(ontology-based annotation)

Pattern Generation

Document

Layout

Patterns

Annotated

Document

Structural Annotator

(layout-driven annotation)


Form based information harvesting
Form-Based Information Harvesting Religion

  • Forms

    • General familiarity

    • Reasonable conceptual framework

    • Appropriate correspondence

      • Transformable to ontological descriptions

      • Capable of accepting source data

  • Instance recognizers

    • Some pre-existing instance recognizers

    • Lexicons

  • Automated extraction ontology creation?


Form creation
Form Creation Religion

  • Basic form-construction facilities:

  • single-entry field

  • multiple-entry field

  • nested form








Almost ready to harvest
Almost Ready to Harvest Religion

  • Need reading path: DOM-tree structure

  • Need to resolve mapping problems

    • Split/Merge

    • Union/Selection


Almost ready to harvest1
Almost Ready to Harvest … Religion

  • Need reading path: DOM-tree structure

  • Need to resolve mapping problems

    • Split/Merge

    • Union/Selection

Name

Voltage-dependent anion-selective

channel protein 3

VDAC-3

hVDAC3

Outer mitochondrial membrane

Protein porin 3


Almost ready to harvest2
Almost Ready to Harvest … Religion

  • Need reading path: DOM-tree structure

  • Need to resolve mapping problems

    • Split/Merge

    • Union/Selection

Name

Voltage-dependent anion-selective

channel protein 3

VDAC-3

hVDAC3

Outer mitochondrial membrane

Protein porin 3


Almost ready to harvest3
Almost Ready to Harvest … Religion

  • Need reading path: DOM-tree structure

  • Need to resolve mapping problems

    • Split/Merge

    • Union/Selection

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15


Almost ready to harvest4
Almost Ready to Harvest … Religion

  • Need reading path: DOM-tree structure

  • Need to resolve mapping problems

    • Split/Merge

    • Union/Selection

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15


Can now harvest
Can Now Harvest Religion

Name


Can now harvest1
Can Now Harvest Religion

Name

14-3-3 protein epsilon

Mitochondrial import stimulation

factor Lsubunit

Protein kinase C inhibitor protein-1

KCIP-1

14-3-3E


Can now harvest2
Can Now Harvest Religion

Name

Voltage-dependent anion-selective

channel protein 3

VDAC-3

hVDAC3

Outer mitochondrial membrane

Protein porin 3


Can now harvest3
Can Now Harvest Religion

Name

Tryptophanyl-tRNA synthetase,

mitochondrial precursor

EC 6.1.1.2

Tryptophan—tRNA ligase

TrpRS

(Mt)TrpRS



Harvesting populates ontology1
Harvesting Populates Ontology Religion

Also helps adjust ontology constraints


Can harvest from additional sites
Can Harvest from Additional Sites Religion

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15


Automating extraction ontology creation
Automating ReligionExtraction Ontology Creation

Lexicons

Name

14-3-3 protein epsilon

Mitochondrial import stimulation

factor Lsubunit

Protein kinase C inhibitor protein-1

KCIP-1

14-3-3E

14-3-3 protein epsilon

Mitochondrial import stimulation

factor Lsubunit

Protein kinase C inhibitor protein-1

KCIP-1

14-3-3E

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15

Tryptophanyl-tRNA synthetase,

mitochondrial precursor

EC 6.1.1.2

Tryptophan—tRNA ligase

TrpRS

(Mt)TrpRS

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15

Name

Tryptophanyl-tRNA synthetase,

mitochondrial precursor

EC 6.1.1.2

Tryptophan—tRNA ligase

TrpRS

(Mt)TrpRS


Automating extraction ontology creation1
Automating ReligionExtraction Ontology Creation

Instance Recognizers

Number Patterns

Context Keywords and Phrases



Automatic semantic annotation
Automatic Semantic Annotation Religion

Recognize and annotate with respect to an ontology


Ontology transformations
Ontology Transformations Religion

Transformations to and from all


Practicalities wok query interfaces
Practicalities: WoK Query Interfaces Religion

(Future Work)

  • Advanced free-form queries with disjunction and negation

  • Form-based query language

  • Table-based query languages

  • Graphical query languages


Practicalities bootstrapping the wok
Practicalities: Bootstrapping the WoK Religion

(Future Work)

  • Won’t just happen without sufficient content

  • Niche applications

    • Historical Data (e.g. Genealogy)

    • Topical Blogs

  • Local WoKs

    • Intra-organizational effort

    • Individual interests


Practicalities scalability
Practicalities: Scalability Religion

(Future Work)

  • Potential Rapid growth

    • Thousands of ontologies

    • Millions of simultaneous queries

    • Billions of annotated pages

    • Trillions of facts

  • Search-engine-like caching & query processing


Key to success simplicity via automation
Key to Success: ReligionSimplicity via Automation

  • Automatic (or near automatic) creation of extraction ontologies

  • Automatic (or near automatic) annotation of web pages

  • Simple but accurate query specification without specialized training

www.deg.byu.edu


ad