distributed data mining research in the nasa intelligent systems program
Download
Skip this Video
Download Presentation
Distributed Data Mining Research in the NASA Intelligent Systems Program

Loading in 2 Seconds...

play fullscreen
1 / 34

Distributed Data Mining Research in the NASA Intelligent Systems Program - PowerPoint PPT Presentation


  • 212 Views
  • Uploaded on

NASA AISRP Mtg NASA Ames, Moffett Field, CA April 4 – 6, 2005. Distributed Data Mining Research in the NASA Intelligent Systems Program. Kirk D. Borne. School of Computational Sciences, George Mason University Fairfax, Virginia [email protected] Outline. What is ISP?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Distributed Data Mining Research in the NASA Intelligent Systems Program' - vine


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
distributed data mining research in the nasa intelligent systems program
NASA AISRP Mtg NASA Ames, Moffett Field, CA April 4 – 6, 2005

Distributed Data Mining Research in the NASA Intelligent Systems Program

Kirk D. Borne

School of Computational Sciences, George Mason University

Fairfax, Virginia

[email protected]

outline
Outline
  • What is ISP?
  • Short Descriptions
  • Space Science Project
  • Earth Science Project
what is was isp
What is (was) ISP?
  • NASA Code R / Ames ISP (Intelligent Systems Program) had 2 components:
    • Open competition (low TRL = pure research)
    • NASA Center-to-Center Mission Infusion Tasks (mid-TRL = development and application)
      • Mission Infusion Teams were tasked with taking the low-TRL technology to mid-TRL, by infusing the technology into a mission, project, or existing program.
funded isp mission infusion projects
Funded ISP Mission Infusion Projects
  • Two funded projects (both are now completed):
    • Distributed Data Mining in the NVO*
      • PI: K. Borne, GMU
      • Co-I: Cynthia Cheung, NASA
      • Collaborator: Hillol Kargupta, UMBC
      • Space Science infusion task
    • Automated Wildfire Detection (and Prediction) through Artificial Neural Networks
      • PI: Jerry Miller, NASA
      • Co-I: K. Borne, GMU
      • Collaborators: NOAA-NESDIS staff
      • Earth Science infusion task
    • Projects were funded by the IDU (Intelligent Data Understanding) component of ISP

*NVO = National Virtual Observatory

aisrp projects still more data mining
AISRP Projects – still more data mining
  • Machine Learning and Data Mining for Automatic Detection and Interpretation of Solar Events
    • Develop an automatic system for CME detection, tracking, characterization, and source region location
    • Discover specific associations between solar events (CME) and Earth events (Space Weather)
    • PI: Art Poland, GMU
    • Co-I’s: Jie Zhang, K. Borne, Harry Wechsler
  • Novel Approaches to Semi-supervised Data Exploration
    • Develop an efficient and effective automated system for astronomical object classification, with an emphasis on star-galaxy discrimination and morphological galaxy classification
    • PI: David Bazell, Eureka
    • Co-I’s: K. Borne (GMU), David Miller (Penn St.)
short descriptions of isp projects
Short Descriptions of ISP Projects
  • Distributed Data Mining (DDM) in the NVO
    • Search for examples of interacting/colliding/merging galaxies across multiple distributed databases
    • Apply Distributed Learning
    • Apply Distributed Classification
    • Use DDM algorithms being developed by UMBC group
    • Apply algorithms within NVO data environment
  • Automated Wildfire Detection (and Prediction) through Artificial Neural Networks (ANN)
    • Identify all wildfires in Earth-observing satellite images
    • Train ANN to mimic human analysts’ classifications
    • Apply ANN to new data (from 3 remote-sensing satellites: GOES, AVHRR, MODIS)
    • Extend NOAA fire product from USA to the whole Earth
slide7
Searching, retrieving, mining, integrating, and analyzing geographically distributed data repositories is one of the major challenges in data mining today
why so many telescopes and databases
Why so many telescopes and databases? …

Because …

  • Many great astronomical
  • discoveries have come
  • from inter-comparisons
  • of various wavelengths:
  • Quasars
  • Gamma-ray bursts
  • Ultraluminous IR galaxies
  • X-ray black-hole binaries
  • Radio galaxies
  • . . .
slide9
Distributed Data Mining – 2 perspectives:– robust statistical analysis of “typical” events– automated search for “rare” events

Figure: The clustering of data clouds (dc#) within a multidimensional parameter space (p#).

Such a mapping can be used to search for and identify clusters, voids, outliers, one-of-kinds, relationships, and associations among arbitrary parameters in a database (or among various parameters in geographically distributed databases).

Credit: S. G. Djorgovski

slide10
Space Science Project:Distributed Data Mining in the NVO –A case study to find colliding, interacting, and merging galaxies among the IR-luminous galaxy population
  • Examine several distributed databases (HST, 2MASS, Sloan, FIRST, IRAS)
  • Solve a particular science problem (a NVO science scenario)

NASA

Information

Power Grid

(IPG)

slide11
NVO Science Cases & Drivers(from Aspen 2001 NVO Workshop)
  • Solar System : NEOs, Long-Period Comets, TNOs, Killer Asteroids!!!
  • The Digital Galaxy : Find star streams and populations -- relics of past/present assembly phase. Identify components of disk, thick disk, bulge, halo, arms, ??
  • The Low-Surface Brightness Universe : spatial filtering, multi-wavelength searches, intersection of the image and catalog domains
  • Panchromatic Census of AGN (Active Galactic Nuclei) : Complete sample of the AGN zoo, their emission mechanisms, and their environments
  • Precision Cosmology & Large-Scale Structure : **Hierarchical Assembly History of Galaxies and Structure**, Cosmological Parameters, Dark Matter and Galaxy Biasing as f(z)
  • Precision science of any kind that depends on very large sample sizes
  • "Survey Science Deluxe"
  • Search for rare and exotic objects (e.g., high-z QSOs, high-z Sne, L/T dwarfs)
  • Serendipity : Explore new domains of parameter space (e.g., time domain, or "color-color space" of all kinds)

**This is the scientific goal of the ISP-funded project described here.

slide13
Ultra-Luminous Infrared Galaxies (ULIRGs)and other IR-Luminous Galaxies (LIRGs): Nearly 100% are involved in collisions and mergers

ULIRGs:

the most luminous

galaxies in the

Universe

slide14
Merger Tree - Galaxy Merger Family History

Past

Present

The goal of this study is to identify collision and merger remnant candidates at increasing redshift, in order to measure the galaxy hierarchical mass assembly rate as a function of cosmic epoch.

distributed data mining in the nvo
Distributed Data Mining in the NVO
  • 1. Identify classes of galaxies among several large photometric catalogs (e.g., 2MASS, Sloan DSS, FIRST, NVSS, etc.): the galaxy class is either normal or IR-luminous (the latter being indicative of collision/merger activity)
  • 2. Identify all known examples of ULIRGs:
    • linked to Starburst Galaxies, Gamma-Ray Bursts, Quasars, Hierarchical Galaxy Assembly, etc.
  • 3. Learn new properties of ULIRGs (e.g., Association Rule Mining) by examining multiple distributed databases.
  • 4. Build a classifier from these rules.
  • 5. Find new cases of ULIRGs in the distributed databases.
  • 6. Results will contribute to understanding of many classes of astronomical phenomena.
  • 7. Techniques will be applicable to NVO, LWS, other VxOs, JWST science program, ..., and E/PO projects (e.g., mining Kepler mission catalog by students; or [email protected])
slide16
An example of clustering in a 3-dimensional color-color parameter space using data from two different (distributed) astronomical databases. In this case, the 3 colors are pairings of 2MASS near-IR and Sloan optical magnitudes.

Plot provided by H.Kargupta (UMBC)

science result
Science Result

IRAS F12509+3122

  • Successfully completed one true proof-of-concept science case within small subset of the IRAS, HST, and FIRST databases.
  • We re-discovered exactly the type of object that we are hoping to find automatically with our data mining tools:
    • We found a very distant hyper-luminous infrared galaxy, one of the brightest galaxies in the known Universe.
  • This particular galaxy was previously known (catalogued), but we re-discovered it serendipitously.
  • References:
    • K.Borne, "Distributed Data Mining in the National Virtual Observatory", SPIE Data Mining & Knowledge Discovery V, vol. 5098, p. 211 (2003).
    • K.Borne, "A National Virtual Observatory (NVO) Science Case: Properties of Very Luminous IR Galaxies (VLIRGs)", in "The Emergence of Cosmic Structure", p. 307 (2003).

Redshift = 0.780

additional application areas of isp funded nvo data mining project
Additional application areas of ISP-funded NVO data mining project
  • Application of XML to distributed data mining:
    • ADQL (Astronomical Data Query Language)
    • XMLA (XML for Analysis)
    • PMML (Predictive Modeling Markup Language)
  • Application of different data mining techniques:
    • Bayes classification
    • Neural nets
    • Decision trees
    • Association rule mining
    • Genetic Algorithms for rapid data modeling
    • Supervised and Unsupervised Learning algorithms for robust classification
  • Application of Beowulfs to parallel high-performance data mining
  • Application to new mission data sets: GALEX, Spitzer, WISE, JWST, LWS, Sensor Webs, Constellations (distributed Sciencecraft)
slide19
NASA Intelligent Systems (IS) Project

Intelligent Data Understanding (IDU)

Earth Science Project

Automated Wildfire Detection Through Artificial Neural Networks

Jerry Miller (P.I.), NASA, GSFC

Dr. Kirk Borne (Co-I), GMUDr. Brian Thomas, University of Maryland

Dr. Zhenping Huang, University of Maryland

Yuechen Chi, GMU

Donna McNamara, NOAA-NESDIS, Camp Springs, MD

George Serafino , NOAA-NESDIS, Camp Springs, MD

noaa s hazard mapping system
NOAA’S HAZARD MAPPING SYSTEM

NOAA’s Hazard Mapping System (HMS) is an interactive processing system that allows trained satellite analysts to manually integrate data from 3 automated fire detection algorithms corresponding to the GOES, AVHRR and MODIS sensors. The result is a quality controlled fire product in graphic (Fig 1), ASCII (Table 1) and GIS formats for the continental US.

Figure – Hazard Mapping System (HMS) Graphic Fire Product for day 5/19/2003

overall task objectives
OVERALL TASK OBJECTIVES
  • To mimic the NOAA-NESDIS Fire Analysts’ subjective decision-making and fire detection algorithms with a Neural Network in order to:
    • remove subjectivity in results
    • improve automation & consistency
    • allow NESDIS to expand coverage globally
slide22
Hazard Mapping System (HMS) ASCII Fire Product

OLD FORMAT NEW FORMAT (as of May 16, 2003)

Lon, Lat Lon, Lat, Time, Satellite, Method of Detection

-80.531, 25.351 -80.597, 22.932, 1830, MODIS AQUA, MODIS

-81.461, 29.072 -79.648, 34.913, 1829, MODIS, ANALYSIS

-83.388, 30.360 -81.048, 33.195, 1829, MODIS, ANALYSIS

-95.004, 30.949 -83.037, 36.219, 1829, MODIS, ANALYSIS

-93.579, 30.459 -83.037, 36.219, 1829, MODIS, ANALYSIS

-108.264, 27.116 -85.767, 49.517, 1805, AVHRR NOAA-16, FIMMA

-108.195, 28.151 -84.465, 48.926, 2130, GOES-WEST, ABBA

-108.551, 28.413 -84.481, 48.888, 2230, GOES-WEST, ABBA

-108.574, 28.441 -84.521, 48.864, 2030, GOES-WEST, ABBA

-105.987, 26.549 -84.557, 48.891, 1835, MODIS AQUA, MODIS

-106.328, 26.291 -84.561, 48.881, 1655, MODIS TERRA, MODIS

-106.762, 26.152 -84.561, 48.881, 1835, MODIS AQUA, MODIS

-106.488, 26.006 -89.433, 36.827, 1700, MODIS TERRA, MODIS

-106.516, 25.828 -89.750, 36.198, 1845, GOES, ANALYSIS

goes ch2 3 78 4 03 m northern florida fire
GOES CH2 (3.78 - 4.03 μm) – Northern Florida Fire

2003: Day 126 , –82.10 Deg West Longitude, 30.49 Deg North Latitude File: florida_ch2.png

noaa nesdis fire detection system
NOAA-NESDIS FIRE DETECTION SYSTEM

WF-ABBA = Wildfire Automated Biomass Burning Alg

FIMMA = Fire Identification Mapping and Monitoring Alg

NOAA S/C

NASA TAP-OFF POINT

FOR IMAGERY

WF-ABBA FIRE DET CH’s 1, 2, 4

(0.62, 3.9, 10.7 μm)

GOES EAST-WEST

IMAGER

5 CHAN

10-BIT WDS

10-bit

MCIDAS

(COTS)

HAZARD

MAPPING

SYSTEM

(HMS)

-------

ENVI

GVAR

FORMAT

CH’S 1, 2, 4 ( 0.62, 3.9, 10.7 μm )

8-BIT WDS, LCC

FIRE

ANALYSTS

NOAA 14-17

AVHRR

5 CHAN

10-BIT WDS

Geo-correction

FIMMA FIRE DET

CH’s 2, 3b, 4, 5

(0.91, 3.7, 10.8, 12 μm)

DAILY

NOAA

FIRE

PRODUCT

(automated

algorithms

and manual

additions)

HRPT

FORMAT

TERASCAN

(COTS)

10-bit

CH’S 1, 2, 3b (0.63, 0.91, 3.7 μm)

8-BIT WDS, LCC

MODIS MOD14 FIRE PRODUCT

CH’s 2, 22, 31 (0.86, 03.9, 11 μm)

NASA S/C

TERRA-AQUA

MODIS

36 CHAN

12-BIT WDS

Bow-Tie Effect Removal

MCIDAS

(COTS)

CH’S 1, 2, 22 ( 0.66, 0.86, 3.96 μm )

8-BIT WDS, LCC

HDF

FORMAT

LCC = Lambert Conformal Conic Projection

MCIDAS = Man Computer Interactive Data Access System

simplified data extraction procedure
SIMPLIFIED DATA EXTRACTION PROCEDURE

DATA:

GOES (96 Files/day)

AVHRR (25 Files/day)

MODIS (14 Files/day)

Daily

HMS ASCII

Fire Product

Geographic Coords (lat/lon)

SpectralData

Image

Coords

Neural Network

Training Set

ENVI Function Call

Conversion to Image

Coords (row/col)

Image Ref’s

Filter Out

Bad data points

decision regions and boundaries for highly ideal scatter plot clustering patterns
DECISION REGIONS AND BOUNDARIES FOR HIGHLY IDEAL SCATTER PLOT CLUSTERING PATTERNS

Single Fire Signature

Multiple Fire Signatures

X2

X2

Surface Fire

Crown Fire

Fire

Ground Fire

Background

Background

X1

X1

scatter plot of background subtracted goes ch 1 vs ch 2
Scatter Plot of Background-Subtracted GOES CH 1 vs. CH 2

Fire (lower) and non-fire (upper) separation of clusters

2003: June 2 Northern Florida File: scatter_fires12.png

(GOES CH1, CH2, CH4 are input to neural network)

scatter plot of background subtracted goes ch 2 vs ch 4
Scatter Plot of Background –Subtracted GOES CH 2 vs. CH 4

Fire (left) and non-fire (right) separation of clusters

2003: June 2 Northern Florida File:scatter_fires22.png (GOES CH1, CH2, CH4 are input to neural network)

neural network configuration
Neural Network Configuration

Connections

(weights)

Connections

(weights)

Band A

Inputs:1 - 49

Band B

Inputs: 50 - 98

Output

Classification

Output

Layer 2

Band C

Inputs: 99 - 147

Input

Layer 0

Hidden

Layer 1

typical error matrix for modis instrument
Typical Error Matrix(for MODIS instrument)

True Positive False Positive

False Negative True Negative

TRAINING DATA

Fire NonFire Totals

2834

(TP)

173

(FP)

3007

Fire

NonFire

Totals

Neural Network Classification

3421

318

(FN)

3103

(TN)

3152

3276

6428

slide31
Typical Measures of Accuracy
  • Overall Accuracy = (TP+TN)/(TP+TN+FP+FN)
  • Producer’s Accuracy (fire) = TP/(TP+FN)
  • Producer’s Accuracy (nonfire) = TN/(FP+TN)
  • User’s Accuracy (fire) = TP/(TP+FP)
  • User’s Acuracy (nonfire) = TN/(TN+FN)

Accuracy of our NN Classification

  • Overall Accuracy = 92.4%
  • Producer’s Accuracy (fire) = 89.9%
  • Producer’s Accuracy (nonfire) = 94.7%
  • User’s Accuracy (fire) = 94.2%
  • User’s Acuracy (nonfire) = 90.7%
slide33
Summary – NVO Data Mining Applications

Data Mining Resource Guide for Space Science:

http://nvo.gsfc.nasa.gov/nvo_datamining.html

Sample Data Mining Applications within the NVO:

  • Discover data stored in geographically distributed heterogeneous systems.
  • Search huge databases for trends and correlations in high-dimensional parameter spaces: identify new properties or new classes of objects.
  • Search for rare, one-of-a-kind, and exotic objects in huge databases.
  • Identify temporal variations in objects from millions or billions of observations.
  • Identify moving objects in huge survey catalogs and image databases.
  • Identify parameter glitches / anomalies / deviations either in static databases (e.g., archives) or in dynamic data (e.g., science / telemetry / engineering data streams from remote satellites).
  • Find clusters, nearest neighbors, outliers, and/or zones of avoidance in the distribution of astrophysical objects or other observables in arbitrary parameter spaces.
  • Serendipitously explore the huge databases that will be part of the NVO, through access to distributed, autonomous, federated, heterogeneous, multi-wavelength, multi-mission astrophysics data archives.

http://www.us-vo.org/

addressing nasa exploration challenges through intelligent data understanding
Addressing NASA Exploration Challengesthrough Intelligent Data Understanding
  • Autonomy: “making systems more intelligent”
  • Robotic Networks: “enabling networks of cooperating robotic systems”
  • Data-Rich Virtual Presence: “local and remote, both real-time and asynchronous virtual presence to enable effective science and robust operations (including tele-presense , tele-science, tele-supervision)”

Source: Human & Robotic Technology,

Program Formulation Plan, 15 May 2004

ad