Environmental Data Warehousing and Mining - PowerPoint PPT Presentation

environmental data warehousing and mining l.
Skip this Video
Loading SlideShow in 5 Seconds..
Environmental Data Warehousing and Mining PowerPoint Presentation
Download Presentation
Environmental Data Warehousing and Mining

play fullscreen
1 / 46
Download Presentation
Environmental Data Warehousing and Mining
Download Presentation

Environmental Data Warehousing and Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Environmental Data Warehousing and Mining Nabil R. Adam Vijay Atluri, Dihua Guo, Songmei Yu Rutgers University CIMIC NSF Workshop on Next Generation Data Mining NGDM02 November 1-3, 2002

  2. Outline • Setting • A Real-world Lab – The NJ Meadowlands Area • Motivating Examples • Environmental Data Warehousing • Environmental Data Mining RUTGERS-CIMIC/MERI

  3. MERI (Meadowlands Environmental Research Institute) • Established in 1998 as a Collaboration between The New Jersey Meadowlands Commission and Rutgers CIMIC. • Provides a world class environmental research institute for urban and coastal wetlands focused on the district • Administered by Rutgers-CIMIC Mission • Conduct and sponsor research in ecology, environmental science and information technology to monitor, preserve and improve ecological and human health and welfare in the Meadowlands District, NJ. RUTGERS-CIMIC/MERI

  4. MERI • Budget $1.6 Million/year (2002-2007) • Staff • Faculty, Students, FT NJMC/Rutgers Scientists/Staff • Disciplines • Biology, Ecology, Geology, Environmental Sc., Hydrological Modeling, Remote Sensing/Geographic Information Systems, and Information Technology • Work closely with the NJ Meadowlands Commission to disseminate research results: • To the scientific community and the various government agencies • Information and technology transfer to local municipalities • Develop scientific content for education and exhibits • Provide high school and college students with science internships

  5. Processed Satellite Images Digital Meadowlands 3D visualization NASA archives EnvironmentalParameters Reports Fly-by/ Drill-down Radar Digital Meadowlands • Users Interactive Maps Monitoring Stations Sensors Satellite Imagery: AVHRR Aerial Photos documents Maps

  6. Visualization Drill-down RUTGERS-CIMIC/MERI

  7. ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer) • Ground Resolution: 15 m (bands 1-2), 30 m (bands 4-7), 90 m (bands 10-14) • Spectral Bands: 14 • Swath width: 60 Km • Application: • Daily monitoring of flood prone areas. Flood prone areas are shown in red. Under flood conditions sensor would detect water (blue) covering flood prone areas (red). RUTGERS-CIMIC/MERI

  8. Satellite Images • Various data sources and data types • various types of satellite images with different resolutions captured by different sensors • AVHRR: direct downloads from polar orbiting satellites(NOAA 12, NOAA 14 and NOAA 15), 1km resolution, 5 bands • LANDSAT and RADAR: obtained from NASA archives, 30m resolution, 7 bands • Hyper-spectral images: 34 bands images from AISA (Airborne Imaging Spectrometer for Applications) sensor, 250-1000m resolution • Aerial ortho-photographs: high resolution (1m) images (IKONOS, QUICKBIRD), • MODIS: images for global dynamics and processes occurring on the land, in the oceans, and in the lower atmosphere from NASA’S satellites Terra and Aqua , 90m resolution, 36 bands • ASTER: detailed maps of land surface temperature, emissivity, reflectance and elevation from NASA’s satellite Terra, 14 bands, 15-90m resolution RUTGERS-CIMIC/MERI

  9. Users access database through the Worldwide Web Automated, near real-time monitoring system Weather station with data logger Water monitor wired to data logger WWW interface modem link Central computer ingests data and stores it in a database

  10. Real Time Data from Monitoring Stations RUTGERS-CIMIC/MERI

  11. Use of Water Quality Data: Tracking the effectiveness of pollution control measures Regulatory minimum level

  12. One Example of Satellite Imagery: AVHRR RUTGERS-CIMIC/MERI


  14. Information Sources -- Traditional • Include • A Library of a large variety of Documents • Scientific Publication • Guidelines and Regulations • Measurements and Impact Studies • Documents contain Text, Tables, Pictures, Drawings and Maps • Census information that describes the socio-economic and health characteristics of the population RUTGERS-CIMIC/MERI

  15. The User Community • Researchers, faculty, graduate students in a variety of disciplines including biology, ecology, geology, environmental science, and IT: • make scientific observations such as the changes in vegetation pattern and its effect on temperature over the years • Policy Makers: • query various critical parameters such as ambient air and water quality and visualize the results in a graphical form • gain help in the evaluation and formulation of environmental policies • The Public: • learn information about their county, community, home on such issues as environment, health, and infrastructure • K-12 Educators and students RUTGERS-CIMIC/MERI

  16. The Data Volume • Satellite images • AVHRR: 50MB each image, 2-4 images per satellite per day, CIMIC is downloading images from 3 satellites and generate 15GB data per month • MODIS images and ASTER images are available everyday or every other day. • IKONOS, QUICKBIRD 1m resolution images (7.5 quad), each image would roughly be 15000*12000*8, which means 1.44GB. • Our Mass Storage • EMC CLARiiON FC4500 • Capacity – up to 18TB • Good cost per MB, excellent performance, scalability, and flexibility • It satisfies the needs of the Online Querying Information System • One GB cache: optimized for r/w at different times by a script • Backplane provides a data transfer rate of 200 MB/Second from the disks to the fiber channel port which transfers the data over the fiber channel cables to the host at 100 MB/Second. • Additional fiber channel • Very flexible configuration capabilities RUTGERS-CIMIC/MERI

  17. Environmental Knowledge Discovery Examples Data Warehousing Data Mining How and Why Hypothesis testing

  18. Motivating Examples (1) • Identify a natural disturbance affecting wetland vegetation such as fire, pathogen infestation or wilting by drought in the New Jersey Meadowlands? • What should we have? • A time series of satellite images (a few years) • Calculated soil and vegetation indices for images • Digital elevation models (DEM) of Meadowlands • Precipitation record for time series • Zoning designation for area being observed • What do we need to do? • Identify the sudden drop in the vegetation index (NDVI) in areas where NDVI has been consistently high through time (outlier detection) • Determine areas where suddenly the soil index is high due to the exposure of bare mineral soil. (classification) • Combine high soil index with low NDVI and precipitation record to determine the occurrence of vegetation disturbance (characterization) RUTGERS-CIMIC/MERI

  19. Motivating Examples (2) • Find bird resting patterns along the eastern seaboard migration corridor • Data needs • Extent of ecosystems that support invertebrates along the migration corridor • Availability of invertebrates in water and sediments through the migrating period. • What we need to know • The number of birds and bird types as related to the availability of food at each rest stop. (trends detection) • Detect abnormal bird populations (low or high) which are not explained by availability of food at specific resting stops. (outlier detection) RUTGERS-CIMIC/MERI

  20. Motivating Examples (3) • Investigate the associations between change in forest cover and illegal exploitation of protected tropical forests • Data we need • Satellite image/maps • Calculated deforestation rates using NDVI indices • Data on truck movement • Records on ship movements from local ports • Data on migrant worker camps • What we can get? • Relate deforestation rate to new road construction and truck traffic in areas where the topography and local ecosystems support exotic tropical trees.. (association detection) RUTGERS-CIMIC/MERI

  21. Other Motivate Examples • hydroclimatological study(Praveen Kumar and Amanda BT. White, 2002) • How can we link the changes in NDVI to changes in the hydrologic condition? • Can we distinguish between the changes due to various factors, such as inter-annual climate variability and human action impact? • Is it possible to distinguish between variabilities related to inter-annual and long-term trends? • Is there correlation between NDVI variations and ecoregion, or between NDVI with other parameters, such as climate, physiography, topography, or hydrology? • Are the trends confined to certain regions? Is the nature of the variability and trend different in different regions? • Are there any systematic changes over last 10 years? • Are there regions where changes are attributable to human impact, such as logging? RUTGERS-CIMIC/MERI

  22. Environmental Data Warehousing (EDW) • Poses a number of challenging requirements with respect to • The design of the data model due to the nature of analytical operations to be performed • The nature of the views to be maintained by the environmental warehouse. RUTGERS-CIMIC/MERI

  23. EDW – ChallengesNature of the Environmental Data • Each dimension in itself is multi-dimensional in nature, e.g., • raster images such as satellite downloads • used to generate various images of different types including land-use, water, temperature, NDVI • each of them have multiple dimensions • the geographic extent and coordinates • the time and date of its capture • resolution, ... • regional maps represented as vector data • temporal and spatial • streaming data collected from various sensors • Temperature, air quality, atmospheric pressure, water quality: dissolved oxygen, mineral contents, salinity • geographic location (spatial dimension) • temporal dimension RUTGERS-CIMIC/MERI

  24. Year Year Date Date Timestamp Timestamp (LX, LY) (LX, LY) (LX, LY) (UX, UY) (UX, UY) (UX, UY) Resolution Resolution Resolution Land-use Category Time Vector Map Spatial Land-use Themes Attributes Time Temperature Vegetation Spatial Water Developed Types Barren Vector Map Water Forested upland … Etc. Image ID Dot Time Line Spatial Polygon Attributes NDWT Index Chlorophyll Year Temperature Date Timestamp Nature of the Environmental data RUTGERS-CIMIC/MERI

  25. Nature of the Environmental data • Each dimensional table is itself multi-dimensional by nature • Traditional data warehouse models are not suitable for an environmental data warehouse • Our Proposal: cascaded star schema RUTGERS-CIMIC/MERI

  26. EDW ChallengesComplex Nature of Queries (1) • Retrieve changes in the vegetation pattern over a certain region during last 10 years, and their effect on the regional maps over that time period • requires • layering of the images representing the vegetation patterns with those of the maps whose time intervals of validity overlap • traverse along this temporal dimension with the overlaid image • In the traditional data warehouse sense, • first construct two data cubes along the time dimensions for each of the vegetation images and maps • then fuse these two cubes into one RUTGERS-CIMIC/MERI

  27. Demo • http://cimic.rutgers.edu/~songmei/dw.html RUTGERS-CIMIC/MERI




  31. EDW ChallengesComplex Nature of Queries (2) • Observe the changes in the surface water and population due to the changes in the vegetation pattern • fusion of multiple cubes is required • Simulate a fly-by over a region starting with a specific point and elevation, and traverse the region on a specific path with reducing elevation levels at a certain speed, and reaching a destination (a 3-dimensional trajectory) • Requires • retrieving images that span adjacent regions that overlap the spatial trajectory, but with increasing resolution levels to simulate the effect of reduced elevation level • display them at a speed that matches the desired velocity of the fly-by. RUTGERS-CIMIC/MERI

  32. EDW ChallengesEfficient Software and Mature Technology • We need software applications to efficiently manage and manipulate images either by pre-setting or by ad -hoc • Example of calculating NDVI select (char) ( 255.0 * (band2 - band1)/(band2 + band1)) [1000:1500, 1000:1500] from landsat_band1 as band1, landsat_band2 as band2 • Example in the area of DB -- RasDaMan • A basic research project sponsored by the European Community to develop comprehensive MDD database technology • Multi-dimensional data models (MDD) to store images • Interacts with Oracle for meta data and blob management RUTGERS-CIMIC/MERI


  34. RasDaMan (2) • Distinguished Features • A clear distinction is made between the logical (query) level and the physical (storage organization and data transmission) level of array management. • On the conceptual level, arrays are treated as a general data abstraction, they can be of any dimensionality, they can have an arbitrary (fixed or variable) number of elements per dimension, and both primitive and derived types are admissible as array base types. • The model has formal set-algebraic semantics based on AFATL Image Algebra, a rigid mathematical framework able to express any image transformation. • On the physical level, a novel combination of tiling and spatial indexing allows for the efficient execution of queries on MDD while offering the benefits of conventional database technology, such as query performance depending on the result set (and not on the overall data set size), concurrency control, support for crash recovery, and transaction management. • A data definition language for multidimensional arrays, together with a SQL-based and optimized query language called RasQL allows for powerful associative retrieval and data manipulation RUTGERS-CIMIC/MERI

  35. Ongoing Work • Formulating the necessary primitives for the specification and execution of queries • Extending the OLAP operations for the cascaded star • roll-up: aggregating on a specific dimension, i.e., summarize data • drill-down: from higher level summary to lower level detailed • slicing: projecting data along a subset of dimensions with an equality selection of other dimensions • dicing: similar to slicing except that instead of equality selection of other dimensions, a range selection is used • pivoting: reorient the multidimensional cube • zoom-in, zoom-out, aggregation of views using the above OLAP operations RUTGERS-CIMIC/MERI

  36. Environmental Data Mining – Challenges (1) • How can we mine spatial data and non-spatial data from multispectral satellite images and thematic maps. (Krzysztof Koerski, Junas Adhikary, and Jiawei Han, 1996) • Currently research uses only single type of map or image • Mine them at the same time • Resolutions are different • The representation of the thematic maps are different • How to deal with the complex relationships among objects (Krzysztof Koerski, Junas Adhikary, and Jiawei Han, 1996) • Relationships • Spatial relationship: distance • Topological relationship: disjoint, overlap, far away, etc • Direction • Current clustering represent the big object using centroid, e.g., objects of similar size and regular shape, only one of them is very narrow, long band shape RUTGERS-CIMIC/MERI

  37. Environmental Data Mining – Challenges (2) • How to utilize the various data seamlessly • The diverse data types • Structured data: vector, raster, relational database • Unstructured data: text, multimedia, and geo-referenced stream data. • Needs supporting data • Some can be found in the Data Warehouse: summary, average • Some need to be created on the fly: variation, etc. • How to utilize the geographic visualization tool • Can it replace the statistical visualizations tools at some area? RUTGERS-CIMIC/MERI

  38. Data Mining Techniques • From the motivating examples we notice that several data mining techniques are involved • Segmentation • Clustering • Classification • Rule detection • Trend detection • Outlier detection RUTGERS-CIMIC/MERI

  39. EDM Techniques – Rule detection • Examples: • Can we distinguish between the changes due to various factors, such as inter-annual climate variability and human action impact? • Can we link the changes in NDVI to changes in the hydrologic conditions, or changes in population? • Various rules • Characteristic rules: one characteristic of data • Discriminant rules: the feature discriminating or contrasting a class of data from other classes • Association rules: one set of feature is correlated with another set of data RUTGERS-CIMIC/MERI

  40. EDM Techniques – Association Rule detection • Algorithms • Classic algorithms • Apriori: for Boolean association rules to find frequent itemset (Jiawei Han and Micheline Kamber, 2000) • Statistic techniques: regression model • Spatial data mining algorithm (Krzysztof Koperski and Jiawei Han, 1996) • a top-down search technique • Use spatial approximation • Pre-process is require for object recognition • Needs comprehensive algorithm for mining a combination of spatial and non-spatial data at the same time RUTGERS-CIMIC/MERI

  41. EDM Techniques - Segmentation • Example • Are the trends confined to certain regions? Is the nature of the variability and trend different in different regions? • Is it possible to distinguish between variability related to inter-annual and long term trends. • Clustering: • groups spatial objects such that objects in the same groups are similar and objects in different groups are unlike each other . • Classification: • Selects a relevant set of attributes and attribute values that determine an effective mapping of spatial objects into pre-defined target classes. (H. J. Miller and J.Han, 2001) • Name a set of pre-determined classes (inter-annual changes, long term changes) RUTGERS-CIMIC/MERI

  42. EDM Techniques – Segmentation (Cont’d) • Algorithms • Classification: the classes are pre-defined • Decision tree induction • Bayesian classification • Cluster • Partitioning algorithms: k-means method, k-medoids method • The problem here is that the result is strongly depends on the initial guess of the centroid • Hierarchy algorithms: AGNES, DIANA, BIRCH, CURE • The hierarchy algorithms are not optimal for large datasets • Density –based: DBSCAN, OPTICS, DENCLUE • Only dot, without meaningful interpretation • Grid based: STRING, WaveCluster, CLIQUE • How to partition high-dimensional data RUTGERS-CIMIC/MERI

  43. EDM Techniques – Outlier Detection • To find inconsistency and abnormal • Example • Can we identify the abnormal changes in NDVI or particular species? • Has is it been usually hot for this October • Algorithms (Raymond T.Ng, 2001) • Distribution-based approach: the one not follow the standard distribution. • Hard to know the distribution • Not suitable for high-dimensional datasets • Depth-based method: represent the data at k-dimensional space, assign depth to each object. • Does not scale up for more than 3-D • Distance-based outlier detection • Require the existence of an appropriate distance function RUTGERS-CIMIC/MERI

  44. References [1] Praveen Kumar and Amanda BT. White, Scalable Knowledge discovery for hydroclimatological studies , University of Illinois, 2002 [2] H. J. Miller and J.Han, “Geographic Data Mining and Knowledge Discovery”, Taylor & Francis, 2001 [3] Nabil Adam, Vijay Atluri, Songmei Yu and Yelena Yesha, “Efficient Storage and Management of Environmental Information”, presented in 11th Mass Storage Conference hold by IEEE and NASA, Maryland, April 2002. [4] Wendolin Bosques, Ricardo Rodriguez, Angelica Rondon and Ramon Vasquez, "A Spatial Data Retrieval and Image Processing Expert System for the World Wide Web," 21st International Conference on Computers and Industrial Engineering, 1997, pages 433-436. [5]. Krzysztof Koperski and Jiawei Han, “Discovery of spatial association rules in Geographic Information Database”, Proceedings of 4th International Symp. Advances, in Spatial Database, (SSD). Vol 951, Springer-Verlag, 47-66. [6] Kirk Barrett, “The Meadowlands Environmental Research Institute”, Science on the Semantic Web (SWS) Workshop, Oct 2002. [7] Jiawei Han, Russ B. Altman, Vipin Kumar, Heikki Mannila, and Daryl Pregibon, “Emerging Scientific Applications in Data Mining”, Communications of ACM, August, 2002, Vol. 45, No. 8, Page 54-58 [8] Krzysztof Koerski, Junas Adhikary, and Jiawei Han, “Spatial data mining: progress and Challenges Survey paper”, SIGMOD ’96 workshop on Research Issures in Data Mining and Knowledge discover. [9] Jiawei Han and Micheline Kamber, Data Mining – Concepts and Techniques, Morgan Kaufmann Publishers, 2000 [10] Raymond T.Ng, “Detecting outliers from large datasets”, “Geographic Data Mining and Knowledge Discovery”, Taylor & Francis, 2001 RUTGERS-CIMIC/MERI

  45. Focus Areas • Environmental monitoring • Remote sensing/GIS for land use planning • Plant and animal inventory and assessment • Salt-marsh and Landfill Characterization and Restoration • Assessment and Remediation of Contaminated Sediments • Land use information management for planning and engineering (predict land use trends for planner, code enforcement for engineers) • Scientific data warehousing for efficient management of environmental and remote sensing data • Scientific data mining for discovering trends, patterns and relationships among land use and environmental data • Automating land use permit processing workflows through transparent inter-agency interaction

  46. Introduction to Environmental Data (cont’d.) • Value-added products: • water • vegetation • temperature • true colors (composites) • models of the topography and spatial attributes of the landscape • roads, rivers, parcels, schools, zip code areas, city streets and administrative boundaries • Maps, reports, data sets from government agencies • census information that describes the socio-economic and health characteristics of the population • real-time data from ground monitoring stations RUTGERS-CIMIC/MERI