Wok a web of knowledge
This presentation is the property of its rightful owner.
Sponsored Links
1 / 77

WoK: A Web of Knowledge PowerPoint PPT Presentation


  • 35 Views
  • Uploaded on
  • Presentation posted in: General

WoK: A Web of Knowledge. David W. Embley Brigham Young University Provo, Utah, USA. A Web of Pages  A Web of Facts. Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%.

Download Presentation

WoK: A Web of Knowledge

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Wok a web of knowledge

WoK: A Web of Knowledge

David W. Embley

Brigham Young University

Provo, Utah, USA


A web of pages a web of facts

A Web of Pages  A Web of Facts

  • Birthdate of my great grandpa Orson

  • Price and mileage of red Nissans, 1990 or newer

  • Location and size of chromosome 17

  • US states with property crime rates above 1%


Toward a web of knowledge

Toward a Web of Knowledge

  • Fundamental questions

    • What is knowledge?

    • What are facts?

    • How does one know?

  • Philosophy

    • Ontology

    • Epistemology

    • Logic and reasoning


Ontology

Ontology

  • Existence  asks “What exists?”

  • Concepts, relationships, and constraints with formal foundation


Epistemology

Epistemology

  • The nature of knowledge  asks: “What is knowledge?” and “How is knowledge acquired?”

  • Populated conceptual model


Logic and reasoning

Logic and Reasoning

  • Principles of valid inference  asks: “What is known?” and “What can be inferred?”

  • For us, it answers: what can be inferred (in a formal sense) from conceptualized data.

Find price and mileage of red Nissans, 1990 or newer


Making this work how

Making this Work  How?

  • Distill knowledge from the wealth of digital web data

  • Annotate web pages

  • Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge

Annotation

Annotation

Fact

Fact

Fact


Turning raw symbols into knowledge

Turning Raw Symbols into Knowledge

  • Symbols: $ 11,500 117K Nissan CD AC

  • Data: price(11,500) mileage(117K) make(Nissan)

  • Conceptualized data:

    • Car(C123) has Price($11,500)

    • Car(C123) has Mileage(117,000)

    • Car(C123) has Make(Nissan)

    • Car(C123) has Feature(AC)

  • Knowledge

    • “Correct” facts

    • Provenance


Actualization with extraction ontologies

Actualization (with Extraction Ontologies)

Find me the price and mileage of all red Nissans – I want a 1990 or newer.


Data extraction demo

Data Extraction Demo


Semantic annotation demo

Semantic Annotation Demo


Free form query demo

Free-Form Query Demo


Explanation how it works

Explanation: How it Works

  • Extraction Ontologies

  • Semantic Annotation

  • Free-Form Query Interpretation


Extraction ontologies

Extraction Ontologies

Object sets

Relationship sets

Participation constraints

Lexical

Non-lexical

Primary object set

Aggregation

Generalization/Specialization


Extraction ontologies1

Extraction Ontologies

Data Frame:

Internal Representation: float

Values

External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})?

Left Context: $

Key Word Phrase

Key Words: ([Pp]rice)|([Cc]ost)| …

Operators

Operator: >

Key Words: (more\s*than)|(more\s*costly)|…


Generality resiliency of extraction ontologies

Generality & Resiliency ofExtraction Ontologies

  • Generality: assumptions about web pages

    • Data rich

    • Narrow domain

    • Document types

      • Single-record documents (hard, but doable)

      • Multiple-record documents (harder)

      • Records with scattered components (even harder)

  • Resiliency: declarative

    • Still works when web pages change

    • Works for new, unseen pages in the same domain

    • Scalable, but takes work to declare the extraction ontology


Semantic annotation

Semantic Annotation


Free form query interpretation

Free-Form Query Interpretation

  • Parse Free-Form Query

    (with respect to data extraction ontology)

  • Select Ontology

  • Formulate Query Expression

  • Run Query Over Semantically Annotated Data


Parse free form query

Parse Free-Form Query

“Find me the and of all s – I want a ”

price

mileage

red

Nissan

1996

or newer

>=Operator


Select ontology

Select Ontology

“Find me the price and mileage of all red Nissans – I want a 1996 or newer”


Wok a web of knowledge

Formulate Query Expression

  • Conjunctive queries and aggregate queries

  • Projection on mentioned object sets

  • Selection via values and operator keywords

    • Color = “red”

    • Make = “Nissan”

    • Year >= 1996

>= Operator


Wok a web of knowledge

Formulate Query Expression

For

Let

Where

Return


Run query over semantically annotated data

Run QueryOver Semantically Annotated Data


Great but problems still need resolution

Great!But Problems Still Need Resolution

  • How do we create extraction ontologies?

    • Manual creation requires several dozen person hours

    • Semi-automatic creation

      • TISP (Table Interpretation by Sibling Pages)

      • TANGO (Table ANalysis for Generating Ontologies)

      • Nested Schemas with Regular Expressions

      • Synergistic Bootstrapping

      • Form-based Information Harvesting

  • How do we scale up?

    • Practicalities of technology transfer and usage

    • Millions of queries over zillions of facts for thousands of ontologies


Manual creation

Manual Creation


Manual creation1

Manual Creation


Manual creation2

Manual Creation

  • Library of instance recognizers

  • Library of lexicons


Automatic annotation with tisp table interpretation with sibling pages

Automatic Annotation with TISP(Table Interpretation with Sibling Pages)

  • Recognize tables (discard non-tables)

  • Locate table labels

  • Locate table values

  • Find label/value associations


Recognize tables

Recognize Tables

Layout Tables (discard)

Data Table

Nested

Data Tables


Locate table labels

Locate Table Labels

Examples:

Identification.Gene model(s).Protein

Identification.Gene model(s).2


Locate table labels1

Locate Table Labels

Examples:

Identification.Gene model(s).Gene Model

Identification.Gene model(s).2

1

2


Locate table values

Locate Table Values

Value


Find label value associations

Find Label/Value Associations

Example:

(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918

1

2


Interpretation technique sibling page comparison

Interpretation Technique:Sibling Page Comparison


Interpretation technique sibling page comparison1

Interpretation Technique:Sibling Page Comparison

Same


Interpretation technique sibling page comparison2

Interpretation Technique:Sibling Page Comparison

Almost Same


Interpretation technique sibling page comparison3

Interpretation Technique:Sibling Page Comparison

Different

Same


Technique details

Technique Details

  • Unnest tables

  • Match tables in sibling pages

    • “Perfect” match (table for layout  discard )

    • “Reasonable” match (sibling table)

  • Determine & use table-structure pattern

    • Discover pattern

    • Pattern usage

    • Dynamic pattern adjustment


Generated rdf

Generated RDF


Wok demo via tisp

WoK Demo (via TISP)


Semi automatic annotation with tango table analysis for generating ontologies

Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies)

  • Recognize and normalize table information

  • Construct mini-ontologies from tables

  • Discover inter-ontology mappings

  • Merge mini-ontologies into a growing ontology


Recognize table information

Recognize Table Information

Religion

Population Albanian Roman Shi’a Sunni

Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%

Albania 3,510,484 20% 70% 10%


Construct mini ontology

Religion

Population Albanian Roman Shi’a Sunni

Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%

Albania 3,510,484 20% 70% 10%

Construct Mini-Ontology


Discover mappings

Discover Mappings


Merge

Merge


Wok a web of knowledge

Semi-Automatic Annotation viaSynergistic Bootstrapping(Based on Nested Schemas with Regular Expressions)

  • Build a page-layout, pattern-based annotator

  • Automate layout recognition based on examples

  • Auto-generate examples with extraction ontologies

  • Synergistically run pattern-based annotator & extraction-ontology annotator


Synergistic execution

Synergistic Execution

Extraction Ontology

Partially

Annotated

Document

Conceptual Annotator

(ontology-based annotation)

Pattern Generation

Document

Layout

Patterns

Annotated

Document

Structural Annotator

(layout-driven annotation)


Form based information harvesting

Form-Based Information Harvesting

  • Forms

    • General familiarity

    • Reasonable conceptual framework

    • Appropriate correspondence

      • Transformable to ontological descriptions

      • Capable of accepting source data

  • Instance recognizers

    • Some pre-existing instance recognizers

    • Lexicons

  • Automated extraction ontology creation?


Form creation

Form Creation

  • Basic form-construction facilities:

  • single-entry field

  • multiple-entry field

  • nested form


Created sample form

Created Sample Form


Generated ontology view

Generated Ontology View


Source to form mapping

Source-to-Form Mapping


Source to form mapping1

Source-to-Form Mapping


Source to form mapping2

Source-to-Form Mapping


Source to form mapping3

Source-to-Form Mapping


Almost ready to harvest

Almost Ready to Harvest

  • Need reading path: DOM-tree structure

  • Need to resolve mapping problems

    • Split/Merge

    • Union/Selection


Almost ready to harvest1

Almost Ready to Harvest …

  • Need reading path: DOM-tree structure

  • Need to resolve mapping problems

    • Split/Merge

    • Union/Selection

Name

Voltage-dependent anion-selective

channel protein 3

VDAC-3

hVDAC3

Outer mitochondrial membrane

Protein porin 3


Almost ready to harvest2

Almost Ready to Harvest …

  • Need reading path: DOM-tree structure

  • Need to resolve mapping problems

    • Split/Merge

    • Union/Selection

Name

Voltage-dependent anion-selective

channel protein 3

VDAC-3

hVDAC3

Outer mitochondrial membrane

Protein porin 3


Almost ready to harvest3

Almost Ready to Harvest …

  • Need reading path: DOM-tree structure

  • Need to resolve mapping problems

    • Split/Merge

    • Union/Selection

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15


Almost ready to harvest4

Almost Ready to Harvest …

  • Need reading path: DOM-tree structure

  • Need to resolve mapping problems

    • Split/Merge

    • Union/Selection

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15


Can now harvest

Can Now Harvest

Name


Can now harvest1

Can Now Harvest

Name

14-3-3 protein epsilon

Mitochondrial import stimulation

factor Lsubunit

Protein kinase C inhibitor protein-1

KCIP-1

14-3-3E


Can now harvest2

Can Now Harvest

Name

Voltage-dependent anion-selective

channel protein 3

VDAC-3

hVDAC3

Outer mitochondrial membrane

Protein porin 3


Can now harvest3

Can Now Harvest

Name

Tryptophanyl-tRNA synthetase,

mitochondrial precursor

EC 6.1.1.2

Tryptophan—tRNA ligase

TrpRS

(Mt)TrpRS


Harvesting populates ontology

Harvesting Populates Ontology


Harvesting populates ontology1

Harvesting Populates Ontology

Also helps adjust ontology constraints


Can harvest from additional sites

Can Harvest from Additional Sites

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15


Automating extraction ontology creation

AutomatingExtraction Ontology Creation

Lexicons

Name

14-3-3 protein epsilon

Mitochondrial import stimulation

factor Lsubunit

Protein kinase C inhibitor protein-1

KCIP-1

14-3-3E

14-3-3 protein epsilon

Mitochondrial import stimulation

factor Lsubunit

Protein kinase C inhibitor protein-1

KCIP-1

14-3-3E

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15

Tryptophanyl-tRNA synthetase,

mitochondrial precursor

EC 6.1.1.2

Tryptophan—tRNA ligase

TrpRS

(Mt)TrpRS

Name

T-complex protein 1 subunit theta

TCP-1-theta

CCT-theta

Renal carcinoma antigen NY-REN-15

Name

Tryptophanyl-tRNA synthetase,

mitochondrial precursor

EC 6.1.1.2

Tryptophan—tRNA ligase

TrpRS

(Mt)TrpRS


Automating extraction ontology creation1

AutomatingExtraction Ontology Creation

Instance Recognizers

Number Patterns

Context Keywords and Phrases


Automatic source to form mapping

Automatic Source-to-Form Mapping


Automatic semantic annotation

Automatic Semantic Annotation

Recognize and annotate with respect to an ontology


Ontology transformations

Ontology Transformations

Transformations to and from all


Practicalities wok query interfaces

Practicalities: WoK Query Interfaces

(Future Work)

  • Advanced free-form queries with disjunction and negation

  • Form-based query language

  • Table-based query languages

  • Graphical query languages


Practicalities bootstrapping the wok

Practicalities: Bootstrapping the WoK

(Future Work)

  • Won’t just happen without sufficient content

  • Niche applications

    • Historical Data (e.g. Genealogy)

    • Topical Blogs

  • Local WoKs

    • Intra-organizational effort

    • Individual interests


Practicalities scalability

Practicalities: Scalability

(Future Work)

  • Potential Rapid growth

    • Thousands of ontologies

    • Millions of simultaneous queries

    • Billions of annotated pages

    • Trillions of facts

  • Search-engine-like caching & query processing


Key to success simplicity via automation

Key to Success:Simplicity via Automation

  • Automatic (or near automatic) creation of extraction ontologies

  • Automatic (or near automatic) annotation of web pages

  • Simple but accurate query specification without specialized training

www.deg.byu.edu


  • Login