scalable knowledge extraction from legacy sources with seek l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Scalable Knowledge Extraction from Legacy Sources with SEEK PowerPoint Presentation
Download Presentation
Scalable Knowledge Extraction from Legacy Sources with SEEK

Loading in 2 Seconds...

play fullscreen
1 / 22

Scalable Knowledge Extraction from Legacy Sources with SEEK - PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on

Scalable Knowledge Extraction from Legacy Sources with SEEK. Joachim Hammer Dept. of CISE University of Florida jhammer@cise.ufl.edu 3-June-2003. Outline of Talk. Motivation SEEK Information Architecture Knowledge Extraction Schema Extraction Code Analysis Status and Future Work.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Scalable Knowledge Extraction from Legacy Sources with SEEK' - albert


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
scalable knowledge extraction from legacy sources with seek

Scalable Knowledge Extraction from Legacy Sources with SEEK

Joachim Hammer

Dept. of CISE

University of Florida

jhammer@cise.ufl.edu

3-June-2003

outline of talk
Outline of Talk
  • Motivation
  • SEEK Information Architecture
  • Knowledge Extraction
    • Schema Extraction
    • Code Analysis
  • Status and Future Work
seek project
SEEK Project

Faculty

Joachim Hammer

Mark Schmalz

William O’Brien

Ray Issa

Joe Geunes

Sherman Bai

Current Students

Oguzhan Topsakal

Mingxi Wu

Ivan Mutis

Haiyan Xie

Bibo Yang

Bo Lu

Computer Science

Building Construction

Industrial Engineering

  • Sponsored by NSF
  • Year 2 of 4
motivation
Motivation
  • Need for integrated access to intelligence sources in support of national security related applications
    • Ability to rapidly sift through massive volumes of data for situational analysis and planning
  • Hard … sources have unique and often incompatible information systems, varying levels of sophistication
    • E.g., Web, public records, sensor networks, state and federal databases, etc.
  • Current integration approaches rely on manual coding of connection software - Not scalable
  • Development of a toolkit to facilitate integration of heterogeneous legacy data and knowledge
information environment

public/

information

access

agency 1

agency 2

external data sources

Information Environment

reporting authority/

collaborative analysis

Many computers, many users, many information needs

application areas
Application Areas
  • Homeland Defense
    • Threat prediction and detection
  • Emergency Management
    • Emergency response planning, damage assessment
  • Extended Enterprise/Supply Network
    • Decision/negotiation support to improve performance and customization
seek environment context
SEEK Environment & Context

coordinator/

lead

Agency

Agency

Agency

SEEK

SEEK

SEEK

Decision

Support/

Analysis

Secure Hosting Infrastructure

seek information architecture
SEEK Information Architecture

Connection Toolkit

Legacy Source

SEEK Components

Domain Expert

Application

Knowledge

Extraction

Module

Analysis

Module

Legacy Data

and Systems

Secure, value-added

extraction of source

data

Wrapper

AM: query analysis, knowledge composition (mediation)

W: source connection and translation

KEM: configuration of W and AM at build-time

run time querying and analysis
Run-Time: Querying and Analysis
  • Different information contexts: application, analysis module, source
    • Translator needed to convert between information contexts
    • Assume existence of translator between AM and application contexts
  • Analysis module provides robust (value-added) mediation
    • Solution strategy based on information available in source
    • Capable of composing final answer out of multiple source results
  • SEEK wrapper responsible for syntactic and semantic conversions
    • Formulates source queries based on capabilities of source
    • Restructures source results to conform to information context of AM
build time knowledge extraction
Build-Time: Knowledge Extraction
  • Extract information about legacy source to facilitate development of wrapper and configuration of AM
    • Produces “description” of accessible knowledge in source
  • Schema extraction from data source
  • Analysis of application code to augment schema with semantics and extract business rules
  • Schema Matching to infer mappings between information context of AM/application with that of legacy source
  • Quality and accuracy of extracted knowledge (and hence the wrapper and AM) improves over time and with human input
architectural overview
Architectural Overview

Domain

Model

Domain

Ontology

Data Reverse Engineering (DRE)

revise,

validate

Schema

Information

Schema

Extractor

(SE)

Semantic

Analyzer

(SA)

Embedded

Queries

train,

validate

Schema

Matching

(SM)

Schema,

semantics

business rules

Legacy

Application

Code

Legacy DB

to wrapper

generator

Mapping

rules

Legacy Source

schema extraction
Schema Extraction
  • Based on data reverse engineering algorithms, e.g., Chiang 94/95, Petit et al. 96
    • Reduced dependency on human input
    • Eliminated limitations (e.g., consistent naming, legacy schema in 3-NF)
  • Use database catalog to directly extract concepts and simple constraints
  • Use database instances to infer relationships and constraints
  • Interact with code analysis to augment schema with semantics
  • Produces E/R-like representation of the entities, relationships, and constraints
semantic analysis
Semantic Analysis
  • Identify semantic descriptions for schema items in database in application code
    • E.g., trace database schema names back to output statements
  • Using code slicing to reduce application code to only those statements that are of interest to the analyzer (Horwitz, Reps 92)
  • Apply pattern matcher
    • discover associations among variables
    • identify patterns that encode business information
      • E.g., business rules encoded in IF-THEN-ELSE statements
  • Versions for C, C++, and Java
dre implementation
DRE Implementation

Legacy

Source

Application Code

DB Interface

Module

Data

configuration

AST Generation

1

Dictionary Extraction

2

Queries

AST

Code

Analysis

3

Metadata

Repository

Inclusion

Dependency Mining

4

Business Knowledge

Relation

Classification

5

Schema

Attribute

Classification

6

Knowledge

Encoder

XML DTD

Entity

Identification

7

XML DOC

Relationship

Classification

8

To Schema Matcher

legacy database schema
Legacy Database Schema

Proj [P_ID,P_NAME,DES_S,DES_F,A_S,A_F, …]

Avail[PROJ_ID,AVAIL_UID,RES_ID,AVAIL_FROM,AVAIL_TO,UNITS]

Res [PROJ_ID,RES_UID,RES_NAME,R_ACWP,R_BCWP,R_BCWS, …]

T [PROJ_ID,T_UID,T_ID,T_NAME,T_DUR,T_ST_D,T_FIN_D, …]

Assn [PROJ_ID,ASSN_UID,T_UID,R_ID,ASSN_BASE_C,ASSN_ACT_W, …]

.

.

scheduling application
Scheduling Application

/* program for task scheduling */

char *aValue;

char *cValue;

int bValue = 0;

/* more code … */

EXEC SQL SELECT T_ST_D,T_FIN_D INTO :aValue, :cValue FROM T WHERE T_PRITY = :bValue;

/* more code … */

int flag = 0;

IF (cValue <= aValue)

{

flag = 1; /* exception handling */

}

/* more code … */

printf (“Task Start Date %d“, aValue);

printf (“Task Finish Date %d”, cValue);

/* more code … */

extracted conceptual schema
Extracted Conceptual Schema

Proj_ID

P_Name

P_ID

Des_S

Res_UID

N

has

Proj

Res

1

1

1

Res_Name

N

has

Assn

has

Res_ID

N

M

N

T

Avail

Avail_UID

Proj_ID

T_ID

Proj_ID

T_UID

extracted business rules
Extracted Business Rules
  • Variables have been replaced by their extracted meaning (to the extent that they are known)
current status future research
Current Status & Future Research

Current

  • Implemented interactive knowledge extraction prototype consisting of SE and SA (supply chain & construction domains)
  • Developing schema matching module
  • Application of SEEK toolkit to emergency response system
    • Data collection in cooperation with City of Gainesville Fire & Rescue
    • Application to management of EOC planned

Future

  • Development of analysis module
  • Enhance DRE with ability to improve with time and usage cases
summary and conclusion
Summary and Conclusion
  • SEEK is a structured approach to integrating domain-specific legacy sources
  • Modular architecture provides several important capabilities
  • (Semi)automatic knowledge extraction
    • DRE, semantic analysis, schema matching
  • Important contributions to theory of knowledge capture and integration
  • Requirement for building scalable sharing architectures
  • Enabling technology for (semi)automatic ontology creation
    • Enabler for Semantic Web?
more info
More Info
  • M. S. Schmalz, J. Hammer, M. Wu, and O. Topsakal, "EITH - A unifying representation for database schema and application code in enterprise knowledge extraction." To be presented at 22nd International Conference on Conceptual Modeling (ER 2003), Chicago, IL, 2003.
  • “Scalable Extraction of Enterprise Knowledge.” Conditionally accepted for publication in Research Frontiers in Supply Chain Management and E-Commerce, E. Akcaly, J. Geunes, P.M. Pardalos, H.E.Romeijn, and Z.J. Shen, (eds). Kluwer Science Series in Applied Optimization. (accepted for publication in 2004.)
  • “SEEKing knowledge in legacy information systems to support interoperability.” ECAI-02 Workshop on Ontologies and Semantic Interoperability, Lyon, France, July 21-26, 2002.
  • “SEEK: accomplishing enterprise information integration across heterogeneous sources,” ITCON – Journal of Information Technology in Construction – Special Edition on Knowledge Management, 7, pp. 101-124, 2002.
  • “Robust mediation of supply chain information.” ASCE Specialty Conference on Fully Integrated and Automated Project Processes (FIAPP) in Civil Engineering, Blacksburg, VA, September 26-28, 2001, 415-425.

Web Site: http://www.dbcenter.cise.ufl.edu/seek/