Data mining research and applications l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

Data Mining Research and Applications PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

Data Mining Research and Applications. Workshop on Cyberinfrastructure For Environmental Research and Education October 31, 2002 Steve Tanner Information Technology and Systems Center University of Alabama in Huntsville [email protected] 256.824.5143 www.itsc.uah.edu. Key Questions:.

Download Presentation

Data Mining Research and Applications

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data mining research and applications l.jpg

Data MiningResearch and Applications

Workshop on Cyberinfrastructure

For Environmental Research and Education

October 31, 2002

Steve Tanner

Information Technology and Systems Center

University of Alabama in Huntsville

[email protected]

256.824.5143

www.itsc.uah.edu


Slide2 l.jpg

Key Questions:

  • What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure?

  • What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens?

  • How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system?

  • How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?


Data mining l.jpg

Data Mining

  • Data Mining is an interdisciplinary field drawing from areas such as statistics, machine learning, pattern recognition and others

  • Automated discovery of patterns, anomalies, etc. from vast observational and model data sets

  • Derived knowledge for decision making, predictions and disaster response

  • ADaM – Algorithm Development and Mining System

    datamining.itsc.uah.edu


Slide4 l.jpg

Techniques used for Data Mining

Data Mining systems usually involve a toolbox of many different techniques and a means for combining them

  • Clustering Techniques

    • K Means

    • Isodata

    • Maximum

  • Pattern Recognition

    • Bayes Classifier

    • Minimum Distribution Classifier

  • Image Analysis

    • Boundary Detection

    • Cooccurrence Matrix

    • Dilation and Erosion

    • Histogram Operations

    • Polygon Circumscript

    • Spatial Filtering

    • Texture Operations

  • Genetic Algorithms

  • Neural Networks

  • Etc.


Slide5 l.jpg

Typical Everyday Encounters with Data Mining

  • Google

    • Complex algorithm sequence to decide order

  • Amazon.Com

    • Additional purchase suggestions

  • Credit Card Fraud

    • Event notification of odd usage

Most current Data Mining applications are text based. Text provides an easily readable source of heterogeneous data. Mining of scientific data sets is more complex.


User perspective and data perspective of the data mining process l.jpg

User Perspective and Data Perspective of the Data Mining Process

Analysis

Decision

Value

Volume

Transformation

Knowledge

Preprocessing

Information

Dataset

Specific

Algorithms

Domain

Specific

Algorithms

Data

Calibration

& Navigation

Data

Stores

Dataset

User Perspective

Data Perspective


Scientific analysis l.jpg

Data Mining

Scientific Analysis

  • Provides automation of the analysis process

  • Can be used for dimensionality reduction when manual examination of data is impossible

  • Can have limitations

    • May not utilize domain knowledge

    • May be difficult to prove validity of the results

  • There may not be a physical basis

  • Should be viewed as complimentary tool and not a replacement for scientific analysis

  • Harnesses human analysis capabilities

    • Highly creative

  • Based on theory and hypothesis formulation

    • Physical basis is normally used for algorithms

  • Drawing insights about the underlying phenomena

  • Rapidly widening gap between data collection capabilities and the ability to analyze data

  • Potential of vast amounts of data to be unused


Similarity between data mining and scientific analysis process l.jpg

Similarity between Data Mining and Scientific Analysis Process


Slide9 l.jpg

Mining Environments

Mining Framework (ADaM)

  • Complete System (Client and Engine)

  • Mining Engine (User provides its own client)

  • Application Specific Mining Systems

  • Operations Tool Kit

  • Stand Alone Mining Algorithms

  • Data Fusion

    Distributed/Federated Mining

  • Distributed services

  • Distributed data

  • Chaining using Interchange Technologies

    On-board Mining (EVE)

  • Real time and distributed mining

  • Processing environment constraints


Using the mining framework focusing on the information in data l.jpg

Using the Mining Framework: Focusing on the information in data


The adam processing model l.jpg

Processing

Input

Preprocessing

Analysis

Output

PIP-2

SSM/I Pathfinder

SSM/I TDR

SSM/I NESDIS Lvl 1B

SSM/I MSFC Brightness Temp

US Rain

Landsat

ASCII Grass

Vectors (ASCII Text)

HDF

HDF-EOS

GIF

Intergraph Raster

Others...

Selection and Sampling

Subsetting

Subsampling

Select by Value

Coincidence Search

Grid Manipulation

Grid Creation

Bin Aggregate

Bin Select

Grid Aggregate

Grid Select

Find Holes

Image Processing

Cropping

Inversion

Thresholding

Others...

GIF Images

HDF Raster Images

HDF Scientific Data Sets

HDF-ESO

Polygons (ASCII, DXF)

SSM/I MSFC

Brightness Temp

TIFF Images

GeoTIFF

Others...

Clustering

K Means

Isodata

Maximum

Pattern Recognition

Bayes Classifier

Min. Dist. Classifier

Image Analysis

Boundary Detection

Cooccurrence Matrix

Dilation and Erosion

Histogram Operations

Polygon Circumscript

Spatial Filtering

Texture Operations

Genetic Algorithms

Neural Networks

Others…

The ADaM Processing Model

Preprocessed

Data

Patterns/

Models

Results

Raw Data

Translated

Data


Iterative nature of the data mining process l.jpg

Iterative Nature of the Data Mining Process

EVALUATION

And

PRESENTATION

KNOWLEDGE

DISCOVERY

MINING

SELECTION

And

TRANSFORMATION

CLEANING

And

INTEGRATION

PREPROCESSING

DATA


Distributed federated mining meshing data and algorithms to generate knowledge l.jpg

Distributed/Federated Mining: Meshing data and algorithms to generate knowledge


Adam mining environment for scientific data l.jpg

ADaM : Mining Environment for Scientific Data

  • The system provides knowledge discovery, feature detection and content-based searching for data values, as well as for metadata.

    • contains over 120 different operations

    • Operations vary from specialized science data-set specific algorithms to various digital image processing techniques, processing modules for automatic pattern recognition, machine perception, neural networks, genetic algorithms and others


Classification based on texture features and edge density l.jpg

Classification Based on Texture Features and Edge Density

  • Science Rationale: Man-made changes to land use cause changes in weather patterns, especially cumulus clouds

  • Comparison based on

    • Accuracy of detection

    • Amount of time required to classify

Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery


Parallel version of cloud extraction l.jpg

Parallel Version of Cloud Extraction

  • GOES images can be used to recognize cumulus cloud fields

  • Cumulus clouds are small and do not show up well in 4km resolution IR channels

  • Detection of cumulus cloud fields in GOES can be accomplished by using texture features or edge detectors

Master

Slave 1

Slave 2

Slave 3

GOES Image

Laplacian Filter

Sobel Horizontal

Filter

Sobel Vertical

Filter

Energy

Computation

Energy

Computation

Energy

Computation

Energy

Computation

Classifier

Cloud Image

GOES Image

Cumulus Cloud

Mask

  • Three edge detection filters are used together to detect cumulus clouds which lends itself to implementation on a parallel cluster


Automated data analysis for boundary detection and quantification l.jpg

Automated Data Analysis for Boundary Detection and Quantification

  • Analysis of polar cap auroras in large volumes of spacecraft UV images

  • Science Rationale: Indicators to predict geomagnetic storm

    • Damage satellites

    • Disrupt radio connection

  • Developing different mining algorithms to detect and quantify polar cap boundary

Polar Cap Boundary


Detecting signatures l.jpg

Detecting Signatures

  • Science Rationale: Mesocyclone signatures in Radar data are indicators of Tornadic activity

  • Developing an algorithm based on wind velocity shear signatures

    • Improve accuracy and reduce false alarm rates


Genetic subtyping using hierarchical clustering l.jpg

Genetic Subtyping Using Hierarchical Clustering

  • Biologists are interested in comparing DNA sequences to see how closely related they are to one another

  • Phylogenetic trees are constructed by performing hierarchical clustering on DNA sequences using genetic distance as a distance measure

  • Such trees show which organisms are most likely share common ancestors, and may provide information about how various subtypes of organisms evolved

  • This information is useful when studying disease causing organisms such as viruses and bacteria, because genetically similar types should behave in similar ways


Slide20 l.jpg

Mining on Data Ingest: Tropical Cyclone Detection

Advanced Microwave Sounding Unit (AMSU-A) Data

  • Mining Plan:

  • Water cover mask to eliminate land

  • Laplacian filter to compute temperature gradients

  • Science Algorithm to estimate wind speed

  • Contiguous regions with wind speeds above a desired threshold identified

  • Additional test to eliminate false positives

  • Maximum wind speed and location produced

Further Analysis

Calibration/

Limb Correction/

Converted to Tb

Knowledge

Base

Data Archive

Hurricane Floyd

Mining

Environment

Result

Results are placed on the web, made available to

National Hurricane Center & Joint Typhoon Warning Center,

and stored for further analysis

pm-esip.msfc.nasa.gov/


Slide21 l.jpg

Visualization & Exploration

Web Interfaces & Applications

Temperature Trends

STT Application

Data Ordering

FTP

AMSU-A Images

Cyclone Winds

In-

put

Process

Subset//Grid/Format

Out

put

ADaM Servers

Multiple Mining Environments:Passive Microwave ESIP Information System

AMSU Product

Generation

ADaM-based

Processing

PM-ESIP

Catalog

Order

Staging

Custom Processing

AMSU-A Ingest

TMI

AMSU-A

SSM/I

SSM/T2

TMI Ingest and

Product Generation

Distributed Data Stores

Data Ingest & Processing


Interoperability accessing heterogeneous data l.jpg

The Problem

Interoperability: Accessing Heterogeneous Data

DATA

FORMAT 3

DATA

FORMAT 1

DATA

FORMAT 2

  • Science data comes in:

    • Different formats, types and structures

    • Different states of processing (raw, calibrated, derived, modeled or interpreted)

    • Enormous volumes

  • Heterogeneity leads to data usability problems

  • One approach: Standard data formats

    • Difficult to implement and enforce

    • Can’t anticipate all needs

      • Some data can’t be modeled or is lost in translation

    • The cost of converting legacy data

  • A better approach: Interchange Technologies

    • Earth Science Markup Language

FORMAT

CONVERTER

READER 1

READER 2

APPLICATION

The Solution

DATA

FORMAT 1

DATA

FORMAT 2

DATA

FORMAT 3

ESML

FILE

ESML

FILE

ESML

FILE

ESML

LIBRARY

APPLICATION


Chained image processing services l.jpg

Chained Image Processing Services

WMS

(Java/Windows)

Service Chaining is used to integrate modules – or services – developed on distributed platforms and different languages for a single processing solution.

Format

(Perl/Linux)

Resample

(Perl/C – Linux)

GeoCrop

(Perl/Linux)

Chained Services

Draw Image

(PERL/C – Linux)

Data Streams

Data

Reader

(Java/C+

Windows)

Data Files

ESML

ESML Lib

Knowledge

Base

Data

Files


Data integration using web mapping services l.jpg

Data Integration using Web Mapping Services

Countries

Cyclone Events

AMSU-A

Channel 01

MCS Events

Coastlines

Knowledge

Base

AMSU-A

ITSC

Globe

AMSU-A data overlaid with MCS and Cyclone events for September 2000, merged with world boundaries from Globe.


Fused displays from multiple servers l.jpg

Fused Displays from Multiple Servers

Analysis: Correlate MCSs and cyclones with atmospheric temperatures for September 2000.


Slide26 l.jpg

MULTI-LEVEL MINING

CONCEPT MINING

DECISION

SUPPORT

EVENT A

EVENT B

CONCEPTUAL LEVEL

FEATURE

SET I

FEATURE

I

FEATURE

II

FEATURE

III

FEATURE

X

FEATURE

Y

Model and Observation Data

DATA FILE

LEVEL

Concept Hierarchy for Data Mining and Fusion


On board real time processing sensor control targeting l.jpg

On-Board Real-Time Processing Sensor Control/Targeting

EVE – Environment for On-board Processing

  • Anomaly detection

  • Data Mining

  • Autonomous Decision Making

  • Immediate response

  • Direct satellite to Earth delivery of results

www.itsc.uah.edu/eve


Slide28 l.jpg

A Reconfigurable Web of Interacting Sensors

Communications

Weather

Satellite

Constellations

Military

Ground Network

Ground Network

Ground Network


Example plan threshold events in amsu a streaming data l.jpg

Example Plan: Threshold events in AMSU-A Streaming Data

EVE


Slide30 l.jpg

Data Integration and Mining:

From Global Information to Local Knowledge

Emergency Response

Precision Agriculture

Urban

Environments

Weather

Prediction


Slide31 l.jpg

Key Questions:

  • What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure?

  • What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens?

  • How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system?

  • How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?


  • Login