future database needs sc 32 study period february 5 2007 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Future Database Needs SC 32 Study Period February 5, 2007 PowerPoint Presentation
Download Presentation
Future Database Needs SC 32 Study Period February 5, 2007

Loading in 2 Seconds...

play fullscreen
1 / 78

Future Database Needs SC 32 Study Period February 5, 2007 - PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on

Future Database Needs SC 32 Study Period February 5, 2007. JTC1 SC32N1633. Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905 bebargmeyer@lbl.gov. Topics. Study period purpose New challenges

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Future Database Needs SC 32 Study Period February 5, 2007' - brilliant


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
future database needs sc 32 study period february 5 2007
Future Database Needs

SC 32 Study Period

February 5, 2007

JTC1 SC32N1633

Bruce Bargmeyer,

Lawrence Berkley National Laboratory

University of California

Tel: +1 510-495-2905

bebargmeyer@lbl.gov

topics
Topics
  • Study period purpose
  • New challenges
  • A brief tutorial on Semantics and semantic computing
  • where XMDR fits
    • Semantic computing technologies
    • Traditional Data Administration
  • Some limitations of current relational technologies
  • Some input from other sources
future database needs study period
Future Database NeedsStudy Period
  • A one-year study period to identify and understand case studies related to this area.
  • Bring together a small group of experts in a meeting on “Case Studies on new Database Standards Requirements”.  
  • The workshop would provide input to existing SC32 projects and may provide background material for new proposals for upgrades or for new work within SC32 in time for 2007 SC32 Plenary

--Document 32N1451

the internet revolution

The Internet Revolution

A world wide web of diverse content:

The information glut is nothing new. The access to it is astonishing.

challenge find and process non explicit data
Challenge: Find and process non-explicit data

Analgesic Agent

For example…

Patient data on drugs contains brand names (e.g. Tylenol, Anacin-3, Datril,…);

However, want to study patients taking analgesic agents

Non-Narcotic Analgesic

Analgesic and Antipyretic

Nonsteroidal Antiinflammatory Drug

Acetominophen

Datril

Tylenol

Anacin-3

challenge specify and compute across relations e g within a food web in an arctic ecosystem
Challenge: Specify and compute across Relations, e.g., within a food web in an Arctic ecosystem

An organism is connected to another organism for which it is a source

of food energy and material by an arrow representing the direction of

biomass transfer.

Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)

challenge combine data metadata concept systems

Contamination

Biological

Radioactive

Chemical

mercury

lead

cadmium

Challenge: Combine Data, Metadata & Concept Systems

Inference Search Query:

“find water bodies downstream from Fletcher Creek where chemical contamination was over 10 micrograms per liter between December 2001 and March 2003”

Concept system:

Data:

Metadata:

challenge use data from systems that record the same facts with different terms

Dublin

Core

Registries

Software

Component

Registries

Common Content

Common Content

Challenge: Use data from systems that record the same facts with different terms

Database

Catalogs

Common Content

ISO 11179Registries

UDDIRegistries

Table

Column

Data

Element

Common Content

Common Content

Business

Specification

Country

Identifier

OASIS/ebXMLRegistries

CASE Tool

Repositories

XML Tag

Attribute

Common Content

Common Content

Business

Object

Coverage

TermHierarchy

OntologicalRegistries

Common Content

same fact different terms

Name: Country Identifiers

Context:

Definition:

Unique ID: 5769

Conceptual Domain:

Maintenance Org.:

Steward:

Classification:

Registration Authority:

Others

DataElementConcept

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Same Fact, Different Terms

Data Elements

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

L`Algérie

Belgique

Chine

Danemark

Egypte

La France

. . .

Zimbabwe

DZ

BE

CN

DK

EG

FR

. . .

ZW

DZA

BEL

CHN

DNK

EGY

FRA

. . .

ZWE

012

056

156

208

818

250

. . .

716

Name:

Context:

Definition:

Unique ID: 4572

Value Domain:

Maintenance Org.

Steward:

Classification:

Registration Authority:

Others

ISO 3166

3-Alpha Code

ISO 3166

English Name

ISO 3166

French Name

ISO 3166

2-Alpha Code

ISO 3166

3-Numeric Code

challenge gain common understanding of meaning between data creators and data users
Challenge: Gain Common Understanding of meaning between Data Creators and Data Users

text

text

data

data

environ

agriculture

climate

human health

industry

tourism

soil

water

air

ambiente

agricultura

tiempo

salud hunano

industria

turismo

tierra

agua

aero

123

345

445

670

248

591

308

123

345

445

670

248

591

308

3268

0825

1348

5038

2708

0000

2178

3268

0825

1348

5038

2708

0000

2178

123

345

445

670

248

591

308

123

345

445

670

248

591

308

3268

0825

1348

5038

2708

0000

2178

3268

0825

1348

5038

2708

0000

2178

A common interpretation of what the data

represents

EEA

USGS

text

data

environ

agriculture

climate

human health

industry

tourism

soil

water

air

DoD

123

345

445

670

248

591

308

123

345

445

670

248

591

308

3268

0825

1348

5038

2708

0000

2178

3268

0825

1348

5038

2708

0000

2178

Users

text

data

environ

agriculture

climate

human health

industry

tourism

soil

water

air

EPA

123

345

445

670

248

591

308

123

345

445

670

248

591

308

3268

0825

1348

5038

2708

0000

2178

3268

0825

1348

5038

2708

0000

2178

text

data

3268

0825

1348

5038

2708

0000

2178

123

345

445

670

248

591

308

ambiente

agricultura

tiempo

salud huno

industria

turismo

tierra

agua

aero

123

345

445

670

248

591

308

3268

0825

1348

5038

Others . . .

Users

Information

systems

Data Creation

challenge drawing together dispersed data
Challenge: Drawing Together Dispersed Data

text

text

data

data

environ

agriculture

climate

human health

industry

tourism

soil

water

air

ambiente

agricultura

tiempo

salud hunano

industria

turismo

tierra

agua

aero

123

345

445

670

248

591

308

123

345

445

670

248

591

308

3268

0825

1348

5038

2708

0000

2178

3268

0825

1348

5038

2708

0000

2178

123

345

445

670

248

591

308

123

345

445

670

248

591

308

3268

0825

1348

5038

2708

0000

2178

3268

0825

1348

5038

2708

0000

2178

A common interpretation of what the data

represents

EEA

USGS

text

data

environ

agriculture

climate

human health

industry

tourism

soil

water

air

DoD

123

345

445

670

248

591

308

123

345

445

670

248

591

308

3268

0825

1348

5038

2708

0000

2178

3268

0825

1348

5038

2708

0000

2178

Users

text

data

environ

agriculture

climate

human health

industry

tourism

soil

water

air

EPA

123

345

445

670

248

591

308

123

345

445

670

248

591

308

3268

0825

1348

5038

2708

0000

2178

3268

0825

1348

5038

2708

0000

2178

text

data

3268

0825

1348

5038

2708

0000

2178

123

345

445

670

248

591

308

ambiente

agricultura

tiempo

salud huno

industria

turismo

tierra

agua

aero

123

345

445

670

248

591

308

3268

0825

1348

5038

Others . . .

Users

Information

systems

Data Creation

semantic computing
Semantic Computing
  • We are laying the foundation to make a quantum leap toward a substantially new way of computing: Semantic Computing
  • How can we make use of semantic computing?
  • What do organizations need to do to prepare for and stimulate semantic computing?
coming a semantic revolution

Coming: A Semantic Revolution

  • Searching and ranking
  • Pattern analysis
  • Knowledge discovery
  • Question answering
  • Reasoning
  • Semi-automated
  • decision making
the nub of it
The Nub of It
  • Processing that takes “meaning” into account
  • Processing based on the relations between things not just computing about the things themselves.
  • Computing that takes people out of the processing, reducing the human toil
    • Data access, extraction, mapping, translation, formatting, validation, inferencing, …
  • Delivering higher-level results that are more helpful for the user’s thought and action
semantics challenges
Semantics Challenges
  • Managing, harmonizing, and vetting semantics is essential to enable enterprise semantic computing
  • Managing, harmonizing and vetting semantics is important for traditional data management.
    • In the past we just covered the basics
  • Enabling “community intelligence” through efforts similar to Wikipedia, Wikitionary, Flickr
a brief tutorial on semantics
A Brief Tutorial on Semantics
  • What is meaning?
  • What are concepts?
  • What are relations?
  • What are concept systems?
  • What is “reasoning”?
meaning the semiotic triangle

Thought or Reference (Concept)

Refers to

Symbolises

Symbol

Referent

Stands for

“Rose”, “ClipArt”

Meaning: The Semiotic Triangle

C.K Ogden and I. A. Richards. The Meaning of Meaning.

semiotic triangle concepts definitions and signs

CONCEPT

Refers To

Symbolizes

“Rose”,

“ClipArt”

Stands For

Referent

Semiotic Triangle:Concepts, Definitions and Signs

Definition

Sign

slide20

Definitions in the EPA

Environmental Data Registry

http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress

The exact address where a mail piece is intended to be delivered,

including urban-style address, rural route, and PO Box

Mailing

Address:

State

USPS

Code:

http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode

The U.S. Postal Service (USPS) abbreviation that represents a state

or state equivalent for the U.S. or Canada

Mailing

Address

State

Name:

http://www.epa/gov/edr/sw/AdministeredItem#StateName

The name of the state where mail is delivered

computable meaning
Computable Meaning

rdfs:subClassOf

owl:equivalentClass

owl:disjointWith

CONCEPT

Refers To

Symbolizes

“Rose”,

“ClipArt”

Stands For

Referent

If “rose” is owl:disjointWith “daffodil”, then a computer can determine that an

assertion is invalid, if it states that a rose is also a daffodil (e.g., in a knowledgebase).

what are relations
What are Relations?

WaterBody

Relation

Merced River

Fletcher Creek

isA

isA

Merced Lake

Merced

Lake

Fletcher Creek

Concepts and relations can be represented

as nodes and edges in formal graph structures, e.g., “is-a” hierarchies.

concept systems have nodes and may have relations
Concept Systems have Nodes and may have Relations

Nodes represent concepts

A

Lines (arcs) represent relations

1

2

a

b

c

d

Concept systems are concepts and the relations between them.

Concept systems can be represented & queried as graphs

a more complex concept graph

Linear

Large

Non-linear

Non-linear

Large linear

Small linear

Small non- linear

Deep

Natural

Flowing

Shallow

Stagnant

Artificial

River

Stream

Canal

Reservoir

Lake

Marsh

Pond

A More Complex Concept Graph

Concept lattice of inland water features

From Supervaluation Semantics for an Inland Water Feature Ontology

Paulo Santos

and Brandon Bennett http://ijcai.org/papers/1187.pdf#search=%22terminology%20water%20ontology%22

slide26

Directed Acyclic Graph

Tree

Bipartite Graph

Partial Order Graph

Partial Order Tree

Clique

Powerset of 3 element set

Ordered Tree

Compound Graph

Faceted Classification

Types of Concept System Graph Structures

graph taxonomy
Graph Taxonomy

Graph

Directed Graph

Undirected Graph

Directed Acyclic Graph

Clique

Bipartite Graph

Partial Order Graph

Faceted Classification

Lattice

Partial Order Tree

Note: not all bipartite graphs

are undirected.

Tree

Ordered Tree

what kind of relations are there lots
What Kind of Relations are There?Lots!

Relationship class: A particular type of connection existing between people related to or having dealings with each other.

  • acquaintanceOf - A person having more than slight or superficial knowledge of this person but short of friendship.
  • ambivalentOf - A person towards whom this person has mixed feelings or emotions.
  • ancestorOf - A person who is a descendant of this person.
  • antagonistOf - A person who opposes and contends against this person.
  • apprenticeTo - A person to whom this person serves as a trusted counselor or teacher.
  • childOf - A person who was given birth to or nurtured and raised by this person.
  • closeFriendOf - A person who shares a close mutual friendship with this person.
  • collaboratesWith - A person who works towards a common goal with this person.
example of relations in a food web in an arctic ecosystem
Example of relations in a food web in an Arctic ecosystem

An organism is connected to another organism for which it is a source

of food energy and material by an arrow representing the direction of

biomass transfer.

Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)

ontologies are a type of concept system
Ontologies are a type of Concept System
  • Ontology: explicit formal specifications of the terms in the domain and relations among them (Gruber 1993)
  • An ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine-interpretable definitions of basic concepts in the domain and relations among them.
  • Why would someone want to develop an ontology? Some of the reasons are:
    • To share common understanding of the structure of information among people or software agents
    • To enable reuse of domain knowledge
    • To make domain assumptions explicit
    • To separate domain knowledge from the operational knowledge
    • To analyze domain knowledge

http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology101-noy-mcguinness.html

what is reasoning inference
What is Reasoning?Inference

Disease

is-a

is-a

Infectious Disease

Chronic Disease

is-a

is-a

is-a

is-a

Heart disease

Diabetes

Polio

Smallpox

Signifies inferred is-a relationship

reasoning taxonomies partonomies can be used to support inference queries

California

part-of

part-of

Alameda County

Santa Clara County

part-of

part-of

part-of

part-of

San Jose

Berkeley

Santa Clara

Oakland

Reasoning: Taxonomies & partonomies can be used to support inference queries

E.g., if a database contains

information on events by city,

we could query that database for events that happened in a particular county or state,

even though the event data does not contain explicit state or county codes.

reasoning relationship metadata can be used to infer non explicit data
Reasoning: Relationship metadata can be used to infer non-explicit data

Analgesic Agent

  • For example…
  • patient data on drugs currently being taken contains brand names (e.g. Tylenol, Anacin-3, Datril,…);
  • (2)concept system connects different drug types and names with one another (via is-a, part-of, etc. relationships);
  • (3) so… patient data can be linked and searched by inferred terms like “acetominophen” and “analgesic” as well as trade names explicitly stored as text strings in the database

Non-Narcotic Analgesic

Analgesic and Antipyretic

Nonsteroidal Antiinflammatory Drug

Acetominophen

Datril

Tylenol

Anacin-3

reasoning least common ancestor query

Analgesic Agent

Opioid

Non-Narcotic Analgesic

Opiate

Morphine Sulfate

Codeine Phosphate

Nonsteroidal Antiinflammatory Drug

Acetominophen

Reasoning: Least Common Ancestor Query

What is the least common ancestor concept in the NCI Thesaurus for

AcetominophenandMorphine Sulfate? (answer = Analgesic Agent)

Analgesic and Antipyretic

reasoning example sibling queries concepts that share a common ancestor
Reasoning: Example “sibling” queries: concepts that share a common ancestor
  • Environmental:
    • "siblings" of Wetland (in NASA SWEET ontology)
  • Health
    • Siblings of ERK1 finds all 700+ other kinase enzymes
    • Siblings of Novastatin finds all other statins
  • 11179 Metadata
    • Sibling values in an enumerated value domain
reasoning more complex sibling queries concepts with multiple ancestors
Reasoning: More complex “sibling” queries: concepts with multiple ancestors

site neoplasms

breast disorders

  • Health
    • Find all the siblings of Breast Neoplasm
  • Environmental
    • Find all chemicals that are a
    • carcinogen (cause cancer) and
    • toxin (are poisonous) and
    • terratogenic (cause birth defects)

Breast

neoplasm

Non-Neoplastic Breast Disorder

Eye

neoplasm

Respiratory

System

neoplasm

slide37
End of Tutorial about concept systems

What are the “Database Language” challenges?

metadata registries database technologies which does what
Metadata Registries & Database Technologies – Which Does What?

Traditional Data Registries (11179 Edition 2)

  • Register metadata which describes data—in databases, applications, XML Schemas, data models, flat files, paper
  • Assist in harmonizing, standardizing, and vetting metadata
  • Assist data engineering
  • Provide a source of well formed data designs for system designers
  • Record reporting requirements
  • Assist data generation, by describing the meaning of data entry fields and the potential valid values
  • Register provenance information that can be provided to end users of data
  • Assist with information discovery by pointing to systems where particular data is maintained.
traditional mdr manage code sets
Traditional MDR:Manage Code Sets

Name: Country Identifiers

Context:

Definition:

Unique ID: 5769

Conceptual Domain:

Maintenance Org.:

Steward:

Classification:

Registration Authority:

Others

DataElementConcept

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Data Elements

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

L`Algérie

Belgique

Chine

Danemark

Egypte

La France

. . .

Zimbabwe

DZ

BE

CN

DK

EG

FR

. . .

ZW

DZA

BEL

CHN

DNK

EGY

FRA

. . .

ZWE

012

056

156

208

818

250

. . .

716

Name:

Context:

Definition:

Unique ID: 4572

Value Domain:

Maintenance Org.

Steward:

Classification:

Registration Authority:

Others

ISO 3166

3-Alpha Code

ISO 3166

English Name

ISO 3166

French Name

ISO 3166

2-Alpha Code

ISO 3166

3-Numeric Code

what can xmdr do
What Can XMDR Do?

Support a new generation of semantic computing

  • Concept system management
  • Harmonizing and vetting concept systems
  • Linkage of concept systems to data
  • Interrelation of multiple concept systems
  • Grounding ontologies and RDF in agreed upon semantics
  • Reasoning across XMDR content (concept systems and metadata)
  • Provision of Semantic Services
we are trying to manage semantics in an increasingly complex content space

We are trying to manage semantics in an increasingly complex content space

Structured data

Semi-structured data

Unstructured data

Text

Pictographic

Graphics

Multimedia

Voice video

case study
Case Study
  • Combining Concept Systems, Data, and Metadata to answer queries.
linking concepts text document

Title 40--Protection of Environment

CHAPTER I--ENVIRONMENTAL PROTECTION AGENCY

PART 141--NATIONAL PRIMARY DRINKING WATER REGULATIONS

§ 141.62 40 CFR Ch. I (7–1–02 Edition)

§ 141.62 Maximum contaminant levels

for inorganic contaminants.

(a) [Reserved]

(b) The maximum contaminant levels

for inorganic contaminants specified in

paragraphs (b) (2)–(6), (b)(10), and (b)

(11)–(16) of this section apply to community

water systems and non-transient,

non-community water systems.

The maximum contaminant level specified

in paragraph (b)(1) of this section

only applies to community water systems.

The maximum contaminant levels

specified in (b)(7), (b)(8), and (b)(9)

of this section apply to community

water systems; non-transient, noncommunity

water systems; and transient

non-community water systems.

Contaminant MCL (mg/l)

(1) Fluoride ............................ 4.0

(2) Asbestos .......................... 7 Million Fibers/liter (longer

than 10 μm).

(3) Barium .............................. 2

(4) Cadmium .......................... 0.005

(5) Chromium ......................... 0.1

(6) Mercury ............................ 0.002

(7) Nitrate ............................... 10 (as Nitrogen)

Linking Concepts: Text Document
thesaurus concept system from gemet

Chemical Contamination

Definition The addition or presence of chemicals to, or in, another substance to such a degree as to render it unfit for its intended purpose.

Broader Term contamination

Narrower Terms cadmium contamination, lead contamination,

mercury contamination

Related Terms chemical pollutant, chemical pollution

Deutsch: Chemische Verunreinigung

English (US): chemical contamination

Español: contaminación química

SOURCE General Multi-Lingual Environmental Thesaurus (GEMET)

Thesaurus Concept System(From GEMET)

concept system thesaurus
Concept System (Thesaurus)

Contamination

chemical pollutant

Biological

Radioactive

Chemical

chemical pollution

cadmium

lead

mercury

slide47

X

Merced River

B

Fletcher Creek

A

Merced Lake

Data

Monitoring Stations

Measurements

metadata
Metadata

Contaminants

Metadata

relations among inland bodies of water
Relations among Inland Bodies of Water

Fletcher Creek

feeds into

Merced River

Merced River

feeds into

fed from

feeds into

Fletcher Creek

Merced Lake

Merced Lake

combining data metadata concept systems

Contamination

Biological

Radioactive

Chemical

mercury

lead

cadmium

Combining Data, Metadata & Concept Systems

Inference Search Query:

“find water bodies downstream from Fletcher Creek where chemical contamination was over 2 parts per billion between December 2001 and March 2003”

Concept system

Data

Metadata

example environmental text corpus
Example – Environmental Text Corpus
  • Idea: Develop an environmental research corpus that could attract R&D efforts. Include the reports and other material from over $1b EPA sponsored research.
    • Prepare the corpus and make it available
      • Research results from years of ORD R&D
    • Publish associated metadata and concept systems in XMDR
    • Use open source software for EPA testing
slide52

Information Extraction & Semantic Computing

Extraction

Engine

Segment

Classify

Associate

Normalize

Deduplicate

Discover patterns

Select models

Fit parameters

Inference

Report results

11179-3

(E3)

XMDR

Actionable Information

Decision Support

metadata registries are useful
Metadata Registries are Useful

Registered semantics

  • For “training” extraction engines
  • The“Normalize” function can make use of standard code sets that have mapping between representation forms.
  • The “Classify” function can interact with pre-established concept systems.

Provenance

  • High precision for proper nouns, less precision (e.g., 70%) for other concepts -> impacts downstream processing, Need to track precision
normalize need registered and mapped concepts code sets
Normalize – Need Registered and Mapped Concepts/Code Sets

Name: Country Identifiers

Context:

Definition:

Unique ID: 5769

Conceptual Domain:

Maintenance Org.:

Steward:

Classification:

Registration Authority:

Others

DataElementConcept

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Data Elements

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

L`Algérie

Belgique

Chine

Danemark

Egypte

La France

. . .

Zimbabwe

DZ

BE

CN

DK

EG

FR

. . .

ZW

DZA

BEL

CHN

DNK

EGY

FRA

. . .

ZWE

012

056

156

208

818

250

. . .

716

Name:

Context:

Definition:

Unique ID: 4572

Value Domain:

Maintenance Org.

Steward:

Classification:

Registration Authority:

Others

ISO 3166

3-Alpha Code

ISO 3166

English Name

ISO 3166

French Name

ISO 3166

2-Alpha Code

ISO 3166

3-Numeric Code

challenge for database languages
Challenge for Database Languages
  • The extraction database can contain graphs with > a billion nodes.
    • Types of queries that can be done
    • Query performance
    • Linkage of “extract database” concepts and relations to same concepts and relations in traditional databases.
example 11179 3 e3 support semantic web applications

Subject

Node

Predicate

Edge

Address

State Code

Node

Object

AB

Example – 11179-3 (E3) Support Semantic Web Applications

XMDR may be used to “ground” the Semantics

of an RDF Statement.

The address state code is “AB”. This can be expressed as a directed

Graph e.g., an RDF statement:

Graph

RDF

example grounding rdf nodes and relations uris reference a metadata registry
Example: Grounding RDF nodes and relations: URIs Reference a Metadata Registry

dbA:e0139

ai: MailingAddress

dbA:ma344

ai: StateUSPSCode

“AB”^^ai:StateCode

@prefix dbA: “http:/www.epa.gov/databaseA”

@prefix ai: “http://www.epa.gov/edr/sw/AdministeredItem#”

slide58

Definitions in the EPA

Environmental Data Registry

http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress

The exact address where a mail piece is intended to be delivered,

including urban-style address, rural route, and PO Box

Mailing

Address:

State

USPS

Code:

http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode

The U.S. Postal Service (USPS) abbreviation that represents a state

or state equivalent for the U.S. or Canada

Mailing

Address

State

Name:

http://www.epa/gov/edr/sw/AdministeredItem#StateName

The name of the state where mail is delivered

ontologies for data mapping

Concept

Concept

Concept

Concept

Geographic Area

Geographic Sub-Area

Country

Country Identifier

Country Name

Country Code

ISO 3166

2-Character Code

ISO 3166

3- Character Code

Short Name

Long Name

Mailing Address

Country Name

ISO 3166

3-Numeric Code

FIPS Code

Distributor

Country Name

Ontologies for Data Mapping

Ontologies can help to capture and express semantics

example content mapping service
Example: Content Mapping Service
  • Collect data from many sources – files contain data that has the same facts represented by different terms. E.g., one system responds with Danemark, DK, another with DNK, another with 208; map all to Denmark.
  • XMDR could accept XML files with the data from different code sets and return a result mapped to a single code set.
actions to manage enterprise semantics
Actions to Manage Enterprise Semantics
  • Define, data, concepts, and relations
  • Harmonize and vet data and concept systems
  • Ground semantics for RDF, concept systems, ontologies
  • Provide semantics services
challenge concept system store

Users

Metadata Registry

Concept SystemThesaurus Themes

Ontology

GEMET

Data

Standards

Structured

Metadata

Challenge: Concept System Store

Concept systems:

Keywords

Controlled Vocabularies

Thesauri

Taxonomies

Ontologies

Axiomatized Ontologies

(Essentially graphs: node-relation-node + axioms)

}

ISO/IEC 11179 Metadata Registry

challenge management of concept systems

Users

Metadata Registry

Concept SystemThesaurus Themes

Ontology

GEMET

Data

Standards

Structured

Metadata

Challenge: Management of Concept Systems

Concept system:

Registration

Harmonization

Standardization

Acceptance (vetting)

Mapping

(correspondences)

}

ISO/IEC 11179 Metadata Registry

challenge life cycle management

Users

Metadata Registry

Concept SystemThesaurus Themes

Ontology

GEMET

Data

Standards

Structured

Metadata

Challenge: Life Cycle Management

}

Life cycle

management:

Data and

Concept systems

(ontologies)

ISO/IEC 11179 Metadata Registry

challenge grounding semantics

Users

Metadata Registry

Concept SystemThesaurus Themes

Ontology

GEMET

Data

Standards

Structured

Metadata

Challenge: Grounding Semantics

Metadata

Registries

Semantic Web

RDF Triples

Subject (node URI)

Verb (relation URI)

Object (node URI)

Ontologies

ISO/IEC 11179 Metadata Registry

some limitations of relational technologies sql
Some Limitations of Relational Technologies & SQL
  • Limited graph computations
    • Weak graph query language
  • Limited object computations
    • Weak object query language
  • Inadequate linkage of metadata to data (underspecified “catalog”)
    • CASE tools also disable, rather than enable data administration & semantics management
limitations cont
Limitations (Cont.)
  • Limited linkage of concept system (graphs) to data (relational, graph, object)
some input from wg 2 and xmdr
Some Input From WG 2 and XMDR
  • Look at recent work on a graph query language by David Silberberg of Johns Hopkins University Applied Physics Lab.
input from wg 2 and xmdr
Input from WG 2 and XMDR
  • David Jensen, of the University of Massachussetts Amherst ( http://kdl.cs.umass.edu/people/jensen/ ) has been developing a very interesting Proximity system and in the process has worked with complex patterns in very large data sets, including alternative query languages and database technologies. ( http://kdl.cs.umass.edu/proximity/index.html ). QGRAPH is a new visual language for querying and updating graph databases. A key feature of QGRAPH is that the user can draw a query consisting of vertices and edges with specified relations between their attributes. The response will be the collection of all subgraphs of the database that have the desired pattern.
input from wg 2 and xmdr1
Input from WG 2 and XMDR
  • Query languages are necessary to extract useful information from massive data sets. Moreover, annotated corpora require thousands of hours of manual annotation to create, revise and maintain. Query languages are also useful during this process. For example, queries can be used to find parse errors or to transform annotations into different schemes. However, they suffer from several problems.
    • First, updates are not supported as query languages focus on the needs of linguists searching for syntactic constructions.
    • Second, their relationship to existing database query languages is poorly understood, making it difficult to apply standard database indexing and query optimization techniques. As a consequence they do not scale well.
    • Finally, linguistic annotations have both a sequential and a hierarchical organization. Query languages must support queries that refer to both of these types of structure simultaneously. Such hybrid queries should have a concise syntax. The interplay between these factors has resulted in a variety of mutually-inconsistent approaches.

Catherine Lai and Steven Bird

Department of Computer Science and Software Engineering

University of Melbourne, Victoria 3010, Australia

input from wg 2 and xmdr2
Input from WG 2 and XMDR
  • Try to keep an eye on companies that are grappling with advanced database, knowledge management,  information extraction, and analysis requirements, such as Metamatrix, I2, NetViz, Top Quadrant, OntologyWorks, Franz, Cogito, or Objectivity, with new ones cropping up very often.    
  • Check out the EU sites given the large investments being made there in areas of interest. For example, KAON. 
  • Watch the outcome of an NSF funded project on querying linguistic databases,including annotated corpora ( http://projects.ldc.upenn.edu/QLDB/ ). Steven Bird at U. Melbourne is one of the principals on that project.
input from wg 2 and xmdr3
Input from WG 2 and XMDR
  • Need for graph query languages that go beyond RDF and XML
  • Frank Olken: Make SQL a strongly typed language with respect to measurement dimensionality.
  • Performance: project graph structured queries against graph structured data. Express with great difficulty the query in SQL. Complex objects. Model gets complex.  Putting humpty dumpty together again at query time.
  • Political problem in govt. Vendors on board, hard to pursue other technologies.
  • Object systems. OMG working on it? (OQL?). JAVA has ugly layer that maps into relational system. Franz has SPARQL built on top of a graph store.
input from wg 2 and xmdr4
Input from WG 2 and XMDR
  • Link Mining Applications: Progress and Challenges - Ted E. Senator

Link mining is a fairly new research area that lies at the intersection of link analysis, hypertext and web mining, relational learning and inductive logic programming, and graph mining. However, and perhaps more important, it also represents an important and essential set of techniques for constructing useful applications of data mining in a wide variety of real and important domains, especially those involving complex event detection from highly structured data. Imagine a complete “link mining

toolkit.” What would such a toolkit look like?

input from wg 2 and xmdr5
Input from WG 2 and XMDR

Link Mining Applications: Progress and Challenges - Ted E. Senator

  • Most important, it would require a language that enabled the natural representation of entities and links. Such a language would also allow for the representation of pattern templates and for specifying matches between the templates and their instantiations.
  • The language would have to accept an arbitrary database schema as input, with a specified mapping between relations in the database and fundamental link types in the language.
  • It would have to compile into efficient and rapidly executable database queries.
  • It would need to be able to represent grouped entities and multiple abstraction hierarchies and reason at all levels.
  • It would have to enable the creation of new schema elements in the database to represent newly discovered concepts.
input from wg 2 and xmdr6
Input from WG 2 and XMDR

Link Mining Applications: Progress and Challenges - Ted E. Senator

  • It would need to represent both pattern templates and pattern instances, and to have a mechanism for tracking matches between the two.
  • It would have to have constructs for representing fundamental relationships such as part-of, is-a, and connected-to (the most generic link relationship), as well as perhaps other high-level link types such as temporal relationships (e.g., before, after, during, overlapping, etc.), geo-spatial relationships, organizational relationships, trust relationships, and activities and events.
  • The toolkit would include at least one and possibly many pattern matchers. It would require tools for creating and editing patterns. It would have to include visualizations for many different types of structured data.
  • It would need mechanisms for handling uncertainty and confidence.
  • It would have to track the dependence of any conclusion (e.g., pattern match or discovered pattern) back to the underlying data, and perhaps incorporate backtracking so the impact of data corrections could be detected.
input from wg 2 and xmdr7
Input from WG 2 and XMDR

Link Mining Applications: Progress and Challenges - Ted E. Senator

  • It would need configuration management tools to track the history of discovered and matched patterns.
  • It would need workflow mechanisms to support multiple users in an organizational structure.
  • It would need mechanisms for ingesting domain-specific knowledge.
  • It would have to be able to deal with multiple data types including text and imagery.
  • And it would have to be able to rapidly incorporate new link mining techniques as they are developed.
  • Finally, it would need to include mechanisms for maximum privacy protection.
where to progress semantics management
Where to Progress Semantics Management?
  • SC 32 in WG 2 and WG 3 as extensions to ongoing work or as New Work Items
  • W3C as XQuery, SPARQL, Semantic Web Deployment WG (RDF vocabularies, SKOS)
  • OMG as extensions to the MOF
thanks acknowledgements
Thanks & Acknowledgements
  • John McCarthy
  • Karlo Berket
  • Kevin Keck
  • Frank Olken
  • Harold Solbrig
  • L8 and SC 32/WG 2 Standards Committees
  • Major XMDR Project Sponsors and Collaborators
    • U.S. Environmental Protection Agency
    • Department of Defense
    • National Cancer Institute
    • U.S. Geological Survey
    • Mayo Clinic
    • Apelon