1 / 17

Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Discovering Communicable Scientific Knowledge from Spatio-Temporal Data. Mark Schwabacher NASA Ames Research Center Computational Sciences Division mark.schwabacher@arc.nasa.gov http://ic-www.arc.nasa.gov/people/schwabacher/

bena
Download Presentation

Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovering Communicable Scientific Knowledge from Spatio-Temporal Data Mark Schwabacher NASA Ames Research Center Computational Sciences Division mark.schwabacher@arc.nasa.gov http://ic-www.arc.nasa.gov/people/schwabacher/ Joint work with Pat Langley and Jeff Shrager (ISLE) and Chris Potter, Steve Klooster, Lisy Torregrosa, and Vanessa Brooks (NASA Earth Science)

  2. Outline • Description of Earth science problem • Choice of representation and algorithm • Results • Visualizations • Discovery of an error in the data • Future Work

  3. Earth Science Problem • The Normalized Difference Vegetation Index (NDVI) is a measure of vegetation across the globe derived from satellite data • NDVI is used in various Earth-science models • Unfortunately, NDVI is only available for the years since 1983, when a satellite with these sensors was launched • We would like to predict NDVI at a point on the globe from ground-based climate variables representing temperature, precipitation, and moisture

  4. Choice of Representation For scientific applications, the learned models should be • Understandable • Communicable

  5. Representation used by scientists Our Earth Science collaborators had built the following model with an “if” statement to select between two linear models, one for warmer locations and one for cooler locations: if GDD<3000 then ln(NDVI) = 0.715 ln(GDD) + 0.377 ln(PPT) – 0.448 if GDD>= 3000 then NDVI = 189.89 AMI + 44.02 ln(PPT) + 227.99

  6. Choice of Algorithm • We selected regression rules as a generalization of the Earth scientists’ representation • We selected Cubist to learn themhttp://www.rulequest.com

  7. First Results Cubist produced better accuracy, but model was hard to understand.

  8. Varying the Cubist minimumrule cover parameter

  9. 2-rule Cubist model if PPT <= 25.457 then NDVI = -3.225 + 7.07 PPT + 0.0521 CDD - 84 AMI+ 0.4 ln(PPT) + 0.0001 GDD if PPT > 25.457 then NDVI = 386.3 + 316 AMI + 0.0294 GDD - 0.99 PPT + 0.2 ln(PPT)

  10. Visualization #1:Cubist model in one variable

  11. Visualization #2: Activity of Cubist Rules

  12. Visualization #2:Error of Cubist model

  13. Testing the model across years • We trained Cubist using one year’s data • We tested the resulting model on other years’ data • If it transfers, it’s useful for Earth scientists • If it sometimes doesn’t transfer, that could point to a scientific discovery

  14. Discovery of an error in the data Cross-validate 1985 Train 1984, test 1985

  15. Related Work • Regression trees: Breiman et al’s CART (1984) • Classification applied to Earth science: Brodley & Friedl (1999); Ester, Kriegel, & Xu (1996) • Visualizing classes on map: Brodley & Friedl (1999); Smyth, Ghil, & Ide (1999) • Detecting and correcting faulty class labels in data: John (1995); Brodley and Friedl (1999) • Detecting and correcting calibration problems in remote-sensing systems using predefined model: Chen (1997)

  16. Future Work • Cubist/NDVI work • Incorporate time explicitly • Include other variables (e.g. elevation) • Test understandability • Other work • Improve CASA model (next talk) • Implement an interactive system that lets scientists direct high-level search for improved ecosystem models

  17. Lessons Learned We’ve identified three problems that arise in scientific applications of ML, and proposed initial solutions: Communicability: Use the same representation as the scientists. Understandability: When using spatial data, spatially visualize the model’s errors and the activity of its components. Quantitative errors: When using time-series data, quantitative errors can be identified by testing a model trained on one time period against data from other time periods.

More Related