Environmental Data Warehousing and Mining - PowerPoint PPT Presentation

Environmental data warehousing and mining l.jpg
Download
1 / 46

Environmental Data Warehousing and Mining Nabil R. Adam Vijay Atluri, Dihua Guo, Songmei Yu Rutgers University CIMIC NSF Workshop on Next Generation Data Mining NGDM02 November 1-3, 2002 Outline Setting A Real-world Lab – The NJ Meadowlands Area Motivating Examples

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Environmental Data Warehousing and Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Environmental data warehousing and mining l.jpg

Environmental Data Warehousing and Mining

Nabil R. Adam

Vijay Atluri, Dihua Guo, Songmei Yu

Rutgers University CIMIC

NSF Workshop on Next Generation Data Mining NGDM02

November 1-3, 2002


Outline l.jpg

Outline

  • Setting

    • A Real-world Lab – The NJ Meadowlands Area

  • Motivating Examples

  • Environmental Data Warehousing

  • Environmental Data Mining

RUTGERS-CIMIC/MERI


Meri meadowlands environmental research institute l.jpg

MERI (Meadowlands Environmental Research Institute)

  • Established in 1998 as a Collaboration between The New Jersey Meadowlands Commission and Rutgers CIMIC.

  • Provides a world class environmental research institute for urban and coastal wetlands focused on the district

  • Administered by Rutgers-CIMIC

    Mission

  • Conduct and sponsor research in ecology, environmental science and information technology to monitor, preserve and improve ecological and human health and welfare in the Meadowlands District, NJ.

RUTGERS-CIMIC/MERI


Slide4 l.jpg

MERI

  • Budget $1.6 Million/year (2002-2007)

  • Staff

    • Faculty, Students, FT NJMC/Rutgers Scientists/Staff

  • Disciplines

    • Biology, Ecology, Geology, Environmental Sc., Hydrological Modeling, Remote Sensing/Geographic Information Systems, and Information Technology

  • Work closely with the NJ Meadowlands Commission to disseminate research results:

    • To the scientific community and the various government agencies

    • Information and technology transfer to local municipalities

    • Develop scientific content for education and exhibits

    • Provide high school and college students with science internships


Digital meadowlands l.jpg

Processed

Satellite Images

Digital Meadowlands

3D visualization

NASA archives

EnvironmentalParameters

Reports

Fly-by/ Drill-down

Radar

Digital Meadowlands

  • Users

Interactive Maps

Monitoring Stations

Sensors

Satellite Imagery: AVHRR

Aerial Photos

documents

Maps


Slide6 l.jpg

Visualization Drill-down

RUTGERS-CIMIC/MERI


Slide7 l.jpg

  • ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer)

  • Ground Resolution: 15 m (bands 1-2), 30 m (bands 4-7), 90 m (bands 10-14)

  • Spectral Bands: 14

  • Swath width: 60 Km

  • Application:

  • Daily monitoring of flood prone areas. Flood prone areas are shown in red. Under flood conditions sensor would detect water (blue) covering flood prone areas (red).

RUTGERS-CIMIC/MERI


Satellite images l.jpg

Satellite Images

  • Various data sources and data types

    • various types of satellite images with different resolutions captured by different sensors

      • AVHRR: direct downloads from polar orbiting satellites(NOAA 12, NOAA 14 and NOAA 15), 1km resolution, 5 bands

      • LANDSAT and RADAR: obtained from NASA archives, 30m resolution, 7 bands

      • Hyper-spectral images: 34 bands images from AISA (Airborne Imaging Spectrometer for Applications) sensor, 250-1000m resolution

      • Aerial ortho-photographs: high resolution (1m) images (IKONOS, QUICKBIRD),

      • MODIS: images for global dynamics and processes occurring on the land, in the oceans, and in the lower atmosphere from NASA’S satellites Terra and Aqua , 90m resolution, 36 bands

      • ASTER: detailed maps of land surface temperature, emissivity, reflectance and elevation from NASA’s satellite Terra, 14 bands, 15-90m resolution

RUTGERS-CIMIC/MERI


Slide9 l.jpg

Users access database through the Worldwide Web

Automated, near real-time monitoring system

Weather station with

data logger

Water monitor wired to data logger

WWW interface

modem link

Central computer ingests data and stores it in a database


Real time data from monitoring stations l.jpg

Real Time Data from Monitoring Stations

RUTGERS-CIMIC/MERI


Slide11 l.jpg

Use of Water Quality Data: Tracking the effectiveness of pollution control measures

Regulatory minimum level


One example of satellite imagery avhrr l.jpg

One Example of Satellite Imagery: AVHRR

RUTGERS-CIMIC/MERI


Slide13 l.jpg

RUTGERS-CIMIC/MERI


Information sources traditional l.jpg

Information Sources -- Traditional

  • Include

  • A Library of a large variety of Documents

    • Scientific Publication

    • Guidelines and Regulations

    • Measurements and Impact Studies

  • Documents contain Text, Tables, Pictures, Drawings and Maps

  • Census information that describes the socio-economic and health characteristics of the population

RUTGERS-CIMIC/MERI


The user community l.jpg

The User Community

  • Researchers, faculty, graduate students in a variety of disciplines including biology, ecology, geology, environmental science, and IT:

    • make scientific observations such as the changes in vegetation pattern and its effect on temperature over the years

  • Policy Makers:

    • query various critical parameters such as ambient air and water quality and visualize the results in a graphical form

    • gain help in the evaluation and formulation of environmental policies

  • The Public:

    • learn information about their county, community, home on such issues as environment, health, and infrastructure

  • K-12 Educators and students

RUTGERS-CIMIC/MERI


The data volume l.jpg

The Data Volume

  • Satellite images

    • AVHRR: 50MB each image, 2-4 images per satellite per day, CIMIC is downloading images from 3 satellites and generate 15GB data per month

    • MODIS images and ASTER images are available everyday or every other day.

    • IKONOS, QUICKBIRD 1m resolution images (7.5 quad), each image would roughly be 15000*12000*8, which means 1.44GB.

  • Our Mass Storage

    • EMC CLARiiON FC4500

    • Capacity – up to 18TB

    • Good cost per MB, excellent performance, scalability, and flexibility

      • It satisfies the needs of the Online Querying Information System

      • One GB cache: optimized for r/w at different times by a script

      • Backplane provides a data transfer rate of 200 MB/Second from the disks to the fiber channel port which transfers the data over the fiber channel cables to the host at 100 MB/Second.

      • Additional fiber channel

      • Very flexible configuration capabilities

RUTGERS-CIMIC/MERI


Environmental knowledge discovery l.jpg

Environmental Knowledge Discovery

Examples

Data Warehousing

Data Mining

How and Why

Hypothesis testing


Motivating examples 1 l.jpg

Motivating Examples (1)

  • Identify a natural disturbance affecting wetland vegetation such as fire, pathogen infestation or wilting by drought in the New Jersey Meadowlands?

    • What should we have?

      • A time series of satellite images (a few years)

      • Calculated soil and vegetation indices for images

      • Digital elevation models (DEM) of Meadowlands

      • Precipitation record for time series

      • Zoning designation for area being observed

    • What do we need to do?

      • Identify the sudden drop in the vegetation index (NDVI) in areas where NDVI has been consistently high through time (outlier detection)

      • Determine areas where suddenly the soil index is high due to the exposure of bare mineral soil. (classification)

      • Combine high soil index with low NDVI and precipitation record to determine the occurrence of vegetation disturbance (characterization)

RUTGERS-CIMIC/MERI


Motivating examples 2 l.jpg

Motivating Examples (2)

  • Find bird resting patterns along the eastern seaboard migration corridor

    • Data needs

      • Extent of ecosystems that support invertebrates along the migration corridor

      • Availability of invertebrates in water and sediments through the migrating period.

    • What we need to know

      • The number of birds and bird types as related to the availability of food at each rest stop. (trends detection)

      • Detect abnormal bird populations (low or high) which are not explained by availability of food at specific resting stops. (outlier detection)

RUTGERS-CIMIC/MERI


Motivating examples 3 l.jpg

Motivating Examples (3)

  • Investigate the associations between change in forest cover and illegal exploitation of protected tropical forests

    • Data we need

      • Satellite image/maps

      • Calculated deforestation rates using NDVI indices

      • Data on truck movement

      • Records on ship movements from local ports

      • Data on migrant worker camps

    • What we can get?

      • Relate deforestation rate to new road construction and truck traffic in areas where the topography and local ecosystems support exotic tropical trees.. (association detection)

RUTGERS-CIMIC/MERI


Other motivate examples l.jpg

Other Motivate Examples

  • hydroclimatological study(Praveen Kumar and Amanda BT. White, 2002)

    • How can we link the changes in NDVI to changes in the hydrologic condition?

    • Can we distinguish between the changes due to various factors, such as inter-annual climate variability and human action impact?

    • Is it possible to distinguish between variabilities related to inter-annual and long-term trends?

    • Is there correlation between NDVI variations and ecoregion, or between NDVI with other parameters, such as climate, physiography, topography, or hydrology?

    • Are the trends confined to certain regions? Is the nature of the variability and trend different in different regions?

    • Are there any systematic changes over last 10 years?

    • Are there regions where changes are attributable to human impact, such as logging?

RUTGERS-CIMIC/MERI


Environmental data warehousing edw l.jpg

Environmental Data Warehousing (EDW)

  • Poses a number of challenging requirements with respect to

    • The design of the data model due to the nature of analytical operations to be performed

    • The nature of the views to be maintained by the environmental warehouse.

RUTGERS-CIMIC/MERI


Edw challenges nature of the environmental data l.jpg

EDW – ChallengesNature of the Environmental Data

  • Each dimension in itself is multi-dimensional in nature, e.g.,

    • raster images such as satellite downloads

      • used to generate various images of different types including land-use, water, temperature, NDVI

      • each of them have multiple dimensions

        • the geographic extent and coordinates

        • the time and date of its capture

        • resolution, ...

    • regional maps represented as vector data

      • temporal and spatial

    • streaming data collected from various sensors

      • Temperature, air quality, atmospheric pressure, water quality: dissolved oxygen, mineral contents, salinity

      • geographic location (spatial dimension)

      • temporal dimension

RUTGERS-CIMIC/MERI


Nature of the environmental data l.jpg

Year

Year

Date

Date

Timestamp

Timestamp

(LX, LY)

(LX, LY)

(LX, LY)

(UX, UY)

(UX, UY)

(UX, UY)

Resolution

Resolution

Resolution

Land-use

Category

Time

Vector Map

Spatial

Land-use

Themes

Attributes

Time

Temperature

Vegetation

Spatial

Water

Developed

Types

Barren

Vector Map

Water

Forested upland

Etc.

Image ID

Dot

Time

Line

Spatial

Polygon

Attributes

NDWT Index

Chlorophyll

Year

Temperature

Date

Timestamp

Nature of the Environmental data

RUTGERS-CIMIC/MERI


Nature of the environmental data25 l.jpg

Nature of the Environmental data

  • Each dimensional table is itself multi-dimensional by nature

  • Traditional data warehouse models are not suitable for an environmental data warehouse

  • Our Proposal: cascaded star schema

RUTGERS-CIMIC/MERI


Edw challenges complex nature of queries 1 l.jpg

EDW ChallengesComplex Nature of Queries (1)

  • Retrieve changes in the vegetation pattern over a certain region during last 10 years, and their effect on the regional maps over that time period

    • requires

      • layering of the images representing the vegetation patterns with those of the maps whose time intervals of validity overlap

      • traverse along this temporal dimension with the overlaid image

    • In the traditional data warehouse sense,

      • first construct two data cubes along the time dimensions for each of the vegetation images and maps

      • then fuse these two cubes into one

RUTGERS-CIMIC/MERI


Slide27 l.jpg

Demo

  • http://cimic.rutgers.edu/~songmei/dw.html

RUTGERS-CIMIC/MERI


Slide28 l.jpg

RUTGERS-CIMIC/MERI


Slide29 l.jpg

RUTGERS-CIMIC/MERI


Slide30 l.jpg

RUTGERS-CIMIC/MERI


Edw challenges complex nature of queries 2 l.jpg

EDW ChallengesComplex Nature of Queries (2)

  • Observe the changes in the surface water and population due to the changes in the vegetation pattern

    • fusion of multiple cubes is required

  • Simulate a fly-by over a region starting with a specific point and elevation, and traverse the region on a specific path with reducing elevation levels at a certain speed, and reaching a destination (a 3-dimensional trajectory)

    • Requires

      • retrieving images that span adjacent regions that overlap the spatial trajectory, but with increasing resolution levels to simulate the effect of reduced elevation level

      • display them at a speed that matches the desired velocity of the fly-by.

  • RUTGERS-CIMIC/MERI


    Edw challenges efficient software and mature technology l.jpg

    EDW ChallengesEfficient Software and Mature Technology

    • We need software applications to efficiently manage and manipulate images either by pre-setting or by ad -hoc

      • Example of calculating NDVI

        select (char) ( 255.0 * (band2 - band1)/(band2 + band1))

        [1000:1500, 1000:1500]

        from landsat_band1 as band1, landsat_band2 as band2

    • Example in the area of DB -- RasDaMan

      • A basic research project sponsored by the European Community to develop comprehensive MDD database technology

      • Multi-dimensional data models (MDD) to store images

      • Interacts with Oracle for meta data and blob management

    RUTGERS-CIMIC/MERI


    Rasdaman 1 l.jpg

    RasDaMan (1)

    RUTGERS-CIMIC/MERI


    Rasdaman 2 l.jpg

    RasDaMan (2)

    • Distinguished Features

      • A clear distinction is made between the logical (query) level and the physical (storage organization and data transmission) level of array management.

      • On the conceptual level, arrays are treated as a general data abstraction, they can be of any dimensionality, they can have an arbitrary (fixed or variable) number of elements per dimension, and both primitive and derived types are admissible as array base types.

      • The model has formal set-algebraic semantics based on AFATL Image Algebra, a rigid mathematical framework able to express any image transformation.

      • On the physical level, a novel combination of tiling and spatial indexing allows for the efficient execution of queries on MDD while offering the benefits of conventional database technology, such as query performance depending on the result set (and not on the overall data set size), concurrency control, support for crash recovery, and transaction management.

      • A data definition language for multidimensional arrays, together with a SQL-based and optimized query language called RasQL allows for powerful associative retrieval and data manipulation

    RUTGERS-CIMIC/MERI


    Ongoing work l.jpg

    Ongoing Work

    • Formulating the necessary primitives for the specification and execution of queries

    • Extending the OLAP operations for the cascaded star

      • roll-up: aggregating on a specific dimension, i.e., summarize data

      • drill-down: from higher level summary to lower level detailed

      • slicing: projecting data along a subset of dimensions with an equality selection of other dimensions

      • dicing: similar to slicing except that instead of equality selection of other dimensions, a range selection is used

      • pivoting: reorient the multidimensional cube

      • zoom-in, zoom-out, aggregation of views using the above OLAP operations

    RUTGERS-CIMIC/MERI


    Environmental data mining challenges 1 l.jpg

    Environmental Data Mining – Challenges (1)

    • How can we mine spatial data and non-spatial data from multispectral satellite images and thematic maps. (Krzysztof Koerski, Junas Adhikary, and Jiawei Han, 1996)

      • Currently research uses only single type of map or image

      • Mine them at the same time

      • Resolutions are different

      • The representation of the thematic maps are different

    • How to deal with the complex relationships among objects (Krzysztof Koerski, Junas Adhikary, and Jiawei Han, 1996)

      • Relationships

        • Spatial relationship: distance

        • Topological relationship: disjoint, overlap, far away, etc

        • Direction

      • Current clustering represent the big object using centroid, e.g., objects of similar size and regular shape, only one of them is very narrow, long band shape

    RUTGERS-CIMIC/MERI


    Environmental data mining challenges 2 l.jpg

    Environmental Data Mining – Challenges (2)

    • How to utilize the various data seamlessly

      • The diverse data types

        • Structured data: vector, raster, relational database

        • Unstructured data: text, multimedia, and geo-referenced stream data.

      • Needs supporting data

        • Some can be found in the Data Warehouse: summary, average

        • Some need to be created on the fly: variation, etc.

    • How to utilize the geographic visualization tool

      • Can it replace the statistical visualizations tools at some area?

    RUTGERS-CIMIC/MERI


    Data mining techniques l.jpg

    Data Mining Techniques

    • From the motivating examples we notice that several data mining techniques are involved

      • Segmentation

        • Clustering

        • Classification

      • Rule detection

      • Trend detection

      • Outlier detection

    RUTGERS-CIMIC/MERI


    Edm techniques rule detection l.jpg

    EDM Techniques – Rule detection

    • Examples:

      • Can we distinguish between the changes due to various factors, such as inter-annual climate variability and human action impact?

      • Can we link the changes in NDVI to changes in the hydrologic conditions, or changes in population?

    • Various rules

      • Characteristic rules: one characteristic of data

      • Discriminant rules: the feature discriminating or contrasting a class of data from other classes

      • Association rules: one set of feature is correlated with another set of data

    RUTGERS-CIMIC/MERI


    Edm techniques association rule detection l.jpg

    EDM Techniques – Association Rule detection

    • Algorithms

      • Classic algorithms

        • Apriori: for Boolean association rules to find frequent itemset (Jiawei Han and Micheline Kamber, 2000)

        • Statistic techniques: regression model

      • Spatial data mining algorithm (Krzysztof Koperski and Jiawei Han, 1996)

        • a top-down search technique

        • Use spatial approximation

        • Pre-process is require for object recognition

    • Needs comprehensive algorithm for mining a combination of spatial and non-spatial data at the same time

    RUTGERS-CIMIC/MERI


    Edm techniques segmentation l.jpg

    EDM Techniques - Segmentation

    • Example

      • Are the trends confined to certain regions? Is the nature of the variability and trend different in different regions?

      • Is it possible to distinguish between variability related to inter-annual and long term trends.

    • Clustering:

      • groups spatial objects such that objects in the same groups are similar and objects in different groups are unlike each other .

    • Classification:

      • Selects a relevant set of attributes and attribute values that determine an effective mapping of spatial objects into pre-defined target classes. (H. J. Miller and J.Han, 2001)

      • Name a set of pre-determined classes (inter-annual changes, long term changes)

    RUTGERS-CIMIC/MERI


    Edm techniques segmentation cont d l.jpg

    EDM Techniques – Segmentation (Cont’d)

    • Algorithms

      • Classification: the classes are pre-defined

        • Decision tree induction

        • Bayesian classification

      • Cluster

        • Partitioning algorithms: k-means method, k-medoids method

          • The problem here is that the result is strongly depends on the initial guess of the centroid

        • Hierarchy algorithms: AGNES, DIANA, BIRCH, CURE

          • The hierarchy algorithms are not optimal for large datasets

        • Density –based: DBSCAN, OPTICS, DENCLUE

          • Only dot, without meaningful interpretation

        • Grid based: STRING, WaveCluster, CLIQUE

          • How to partition high-dimensional data

    RUTGERS-CIMIC/MERI


    Edm techniques outlier detection l.jpg

    EDM Techniques – Outlier Detection

    • To find inconsistency and abnormal

    • Example

      • Can we identify the abnormal changes in NDVI or particular species?

      • Has is it been usually hot for this October

    • Algorithms (Raymond T.Ng, 2001)

      • Distribution-based approach: the one not follow the standard distribution.

        • Hard to know the distribution

        • Not suitable for high-dimensional datasets

      • Depth-based method: represent the data at k-dimensional space, assign depth to each object.

        • Does not scale up for more than 3-D

      • Distance-based outlier detection

        • Require the existence of an appropriate distance function

    RUTGERS-CIMIC/MERI


    References l.jpg

    References

    [1] Praveen Kumar and Amanda BT. White, Scalable Knowledge discovery for hydroclimatological studies , University of Illinois, 2002

    [2] H. J. Miller and J.Han, “Geographic Data Mining and Knowledge Discovery”, Taylor & Francis, 2001

    [3] Nabil Adam, Vijay Atluri, Songmei Yu and Yelena Yesha, “Efficient Storage and Management of Environmental Information”, presented in 11th Mass Storage Conference hold by IEEE and NASA, Maryland, April 2002.

    [4] Wendolin Bosques, Ricardo Rodriguez, Angelica Rondon and Ramon Vasquez, "A Spatial Data Retrieval and Image Processing Expert System for the World Wide Web," 21st International Conference on Computers and Industrial Engineering, 1997, pages 433-436.

    [5]. Krzysztof Koperski and Jiawei Han, “Discovery of spatial association rules in Geographic Information Database”, Proceedings of 4th International Symp. Advances, in Spatial Database, (SSD). Vol 951, Springer-Verlag, 47-66.

    [6] Kirk Barrett, “The Meadowlands Environmental Research Institute”, Science on the Semantic Web (SWS) Workshop, Oct 2002.

    [7] Jiawei Han, Russ B. Altman, Vipin Kumar, Heikki Mannila, and Daryl Pregibon, “Emerging Scientific Applications in Data Mining”, Communications of ACM, August, 2002, Vol. 45, No. 8, Page 54-58

    [8] Krzysztof Koerski, Junas Adhikary, and Jiawei Han, “Spatial data mining: progress and Challenges Survey paper”, SIGMOD ’96 workshop on Research Issures in Data Mining and Knowledge discover.

    [9] Jiawei Han and Micheline Kamber, Data Mining – Concepts and Techniques, Morgan Kaufmann Publishers, 2000

    [10] Raymond T.Ng, “Detecting outliers from large datasets”, “Geographic Data Mining and Knowledge Discovery”, Taylor & Francis, 2001

    RUTGERS-CIMIC/MERI


    Focus areas l.jpg

    Focus Areas

    • Environmental monitoring

    • Remote sensing/GIS for land use planning

    • Plant and animal inventory and assessment

    • Salt-marsh and Landfill Characterization and Restoration

    • Assessment and Remediation of Contaminated Sediments

    • Land use information management for planning and engineering (predict land use trends for planner, code enforcement for engineers)

    • Scientific data warehousing for efficient management of environmental and remote sensing data

    • Scientific data mining for discovering trends, patterns and relationships among land use and environmental data

    • Automating land use permit processing workflows through transparent inter-agency interaction


    Introduction to environmental data cont d l.jpg

    Introduction to Environmental Data (cont’d.)

    • Value-added products:

      • water

      • vegetation

      • temperature

      • true colors (composites)

    • models of the topography and spatial attributes of the landscape

      • roads, rivers, parcels, schools, zip code areas, city streets and administrative boundaries

      • Maps, reports, data sets from government agencies

    • census information that describes the socio-economic and health characteristics of the population

    • real-time data from ground monitoring stations

    RUTGERS-CIMIC/MERI


  • Login