Nima Asgharbeygi, Pat Langley, Stephen Bay Center for the Study of Language and Information Stanford University Kevin

Computational Revision of Ecological Process Models Nima Asgharbeygi, Pat Langley, Stephen Bay Center for the Study of Language and Information Stanford University Kevin Arrigo Department of Geophysics Stanford University Thanks to S. Dzeroski, J. Sanchez, K. Saito, J. Shrager, and L. Todorovski for their contributions to this research, which is funded by the US National Science Foundation.

Data Mining vs. Scientific Discovery There exist two computational paradigms for discovering explicit knowledge from data. The data mining movement develops computational methods that: In contrast, computational scientific discovery focuses on: induce predictive models from large (often business) data sets; represent models in notations invented by AI researchers. • constructing models from (often small) scientific data sets; • stated in formalisms invented by scientists themselves. This talk focuses on applications of the second framework to environmental and ecosystem modeling.

Observations from the Ross Sea

A Model of Ross Sea Ecosystem model RossSeaEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[residue,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  residue d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue

observations Inductive Revision of Ecosystem Models revised model model RossSeaEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[residue,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  residue d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue Revision model RossSeaEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[residue,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  residue d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue initial model

                                                                                                                                                                                                                                                  A Space of Ecosystem Models                          Model revision requires ways to constrain search through this space.

Phytoplankton Loss in Ross Sea Ecosystem model RossSeaEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro d[phyto,t,1] =  0.307  phyto 0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[residue,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  residue d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue Phytoplankton loss is a process that affects two variables; no model should include one influence without the other.

Grazing in the Ross Sea Ecosystem model RossSeaEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] = 0.251  zoo + 0.615  0.495  zoo d[residue,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo 0.005  residue d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue We can view an ecosystem model as a set of processes that provide an alternative way to encode its assumptions.

model RossSeaEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro process phyto_loss equations: d[phyto,t,1] =  0.307  phyto d[residue,t,1] = 0.307  phyto process zoo_loss equations: d[zoo,t,1] =  0.251  zoo d[residue,t,1] = 0.251  zoo process zoo_phyto_grazing equations: d[zoo,t,1] = 0.615  0.495  zoo d[residue,t,1] = 0.385  0.495  zoo d[phyto,t,1] =  0.495  zoo process nitro_uptake equations: d[phyto,t,1] = 0.411  phyto d[nitro,t,1] =  0.098  0.411  phyto process nitro_remineralization; equations: d[nitro,t,1] = 0.005  residue d[residue,t,1 ] =  0.005  residue Process Model of Ross Sea Ecosystem

observations revised model model RossSeaEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[residue,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  residue d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue process exponential_growth variables: P {population} equations: d[P,t] = [0, 1,]  P process logistic_growth variables: P {population} equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ]) process constant_inflow variables: I {inorganic_nutrient} equations: d[I,t] = [0, 1, ] process consumption variables: P1 {population}, P2 {population}, nutrient_P2 equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2, d[P2,t] =  [0, 1, ]  P1  nutrient_P2 process no_saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P process saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P / (P + [0, 1, ]) Inductive Revision of Process Models Revision model RossSeaEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[residue,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  residue d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue initial model generic processes

generic process exponential_loss generic process remineralization variables: S{species}, D{detritus} variables: N{nutrient}, D{detritus} parameters:  [0, 1] parameters:  [0, 1] equations: d[S,t,1] = 1  S equations: d[N, t,1] =  D d[D,t,1] =  S d[D, t,1] = 1  D generic process grazing generic process constant_inflow variables: S1{species}, S2{species}, D{detritus} variables: N{nutrient} parameters:  [0, 1],  [0, 1] parameters:  [0, 1] equations: d[S1,t,1] =  S1 equations: d[N,t,1] =  d[D,t,1] = (1 )  S1 d[S2,t,1] = 1  S1 generic process nutrient_uptake variables: S{species}, N{nutrient} parameters:  [0, ],  [0, 1],  [0, 1] conditions: N >  equations: d[S,t,1] =  S d[N,t,1] = 1  S Generic Processes for Aquatic Ecosystems

We have implemented RPM, an algorithm that revises an initial process model in four main stages: A Method for Process Model Revision 1. Find all ways to instantiate available generic processes with specific variables, subject to type constraints; 2. Generate candidate model structures by deleting the current processes and adding new ones, subject to complexity limits; 3. For each generic model, carry out search through parameter space to find good coefficients [difficult]; 4. Return a list of revised models ordered by their overall scores. The evaluation metric can be squared error or description length based on error and distance from the initial model.

Observations from the Ross Sea

Revised Model of Ross Sea Ecosystem model RossSeaEcosystem variables: phyto, zoo, nitro, residue, light, G, growth_rate, nitro_rate, light_rate observables: phyto, nitro, light d[phyto,t,1] =  0.307  phyto  G  zoo + growth_rate  phyto d[zoo,t,1] = 0.615  G  zoo d[residue,t,1] = 0.307  phyto +0.385  G  zoo 0.083  residue d[nitro,t,1] =  1  n_to_c  growth_rate phyto + 0.083  n_to_c residue G = 0.415  (1 – exp(– 1  0.27  phyto) growth_rate = r_max  min(nitro_rate, light_rate) nitro_rate = nitro / (nitro + 4.33) light_rate = light / (light + 11.67) n_to_c = 0.251, r_max = 0.194, remin_rate = 0.0676

Initial Results on Ross Sea Training Data The best revised model reproduces the observations quite well.

Initial Results on Ross Sea Test Data But the model predicts nearly the same behavior for both years.

Revised Results on Ross Sea Test Data Refitting initial values for zooplankton gives better generalization.

Results on Data from Protist Study

Results on Data from Rinkobing Fjord

Interfacing with Scientists Because few scientists want to be replaced, we are developing PROMETHEUS, an interactive environment that lets users: specify a quantitative process model of the target system; display and edit the model’s structure and details graphically; simulate the model’s behavior over time and situations; compare the model’s predicted behavior to observations; invoke a revision module in response to detected anomalies. The environment offers computational assistance in forming and evaluating models but lets the user retain control.

Viewing and Editing a Process Model

Our approach to computational discovery incorporates ideas from many traditions: Intellectual Influences • computational scientific discovery (e.g., Langley et al., 1983); • theory revision in machine learning (e.g., Towell, 1991); • qualitative physics and simulation (e.g., Forbus, 1984); • languages for scientific simulation (e.g., STELLA, MATLAB); • interactive tools for data analysis (e.g., Schneiderman, 2001). Our work combines ideas from machine learning, AI, programming languages, and human-computer interaction.

Despite our progress to date, we need further work in order to: Directions for Future Research produce additional results on other ecosystem modeling tasks develop improved methods for fitting model parameters implement heuristic methods for searching the structure space utilize knowledge of subsystems to further constrain search augment the modeling environment to make it more usable Process modeling has great potential to aid model development in environmental science.

In summary, our work on computational discovery has produced: Contributions of the Research a new formalism for representing scientific process models; an encoding for background knowledge as generic processes; an algorithm for revising process models with time-series data; an interactive environment for model construction/utilization. We have demonstrated this approach to model revision on both ecosystem modeling and an environmental domain. The PROMETHEUS modeling/revision environment is available at: http://www.isle.org/process.html

End of Presentation

The Challenge of Systems Science Disciplines like Earth science differ from traditional disciplines by: focusing on synthesis rather than analysis in their operation; using computer modeling as one of their central methods; developing system-level models with many variables / relations; evaluating models on observational, not experimental, data. Constructing such models are complex tasks that would benefit from computational aids, but existing methods are insufficient.

Why Are Process Models Interesting? Process models are a crucial target for machine learning because: they incorporate scientific formalisms rather than AI notations; that are easily communicable to scientists and engineers; they move beyond descriptive generalization to explanation; while retaining the modularity needed to support induction. These reasons point to process models as an ideal representation for scientific and engineering knowledge. Process models are an important alternative to formalisms used currently in machine learning.

Advantages of Quantitative Process Models Process models offer scientists a promising framework because: they embed quantitative relations within qualitative structure; that refer to notations and mechanisms familiar to experts; they provide dynamical predictions of changes over time; they offer causal and explanatory accounts of phenomena; while retaining the modularity needed to support induction. Quantitative process models provide an important alternative to formalisms used currently in ecosystem modeling.

Our response is to design, construct, and evaluate computational methods for inductive process modeling, which: Inductive Process Modeling represent scientific models as sets of quantitative processes; use these models to predict and explain observational data; search a space of process models to find good candidates; utilize background knowledge to constrain this search. This framework has great potential to aid environmental science, but it raises new computational challenges.

Challenges of Inductive Process Modeling Process model induction differs from typical learning tasks in that: process models characterize behavior of dynamical systems; variables are continuous but can have discontinuous behavior; observations are not independently and identically distributed; models may contain unobservable processes and variables; multiple processes can interact to produce complex behavior. Compensating factors include a focus on deterministic systems and the availability of background knowledge.

To utilize or evaluate a given process model, we must simulate its behavior over time: Generating Predictions and Explanations specify initial values for input variables and time step size; on each time step, determine which processes are active; solve active algebraic/differential equations with known values; propagate values and recursively solve other active equations; when multiple processes influence the same variable, assume their effects are additive. This performance method makes specific predictions that we can compare to observations.

Generic Processes as Background Knowledge Our framework casts background knowledge as generic processes that specify: the variables involved in a process and their types; the parameters appearing in a process and their ranges; the forms of conditions on the process; and the forms of associated equations and their parameters. Generic processes are building blocks from which one can compose a specific process model.

To estimate the parameters for each generic model structure, the IPM algorithm: Estimating Parameters in Process Models 1. Selects random initial values that fall within ranges specified in the generic processes; 2. Improves these parameters using the Levenberg-Marquardt method until it reaches a local optimum; 3. Generates new candidate values through random jumps along dimensions of the parameter vector and continue search; 4. If no improvement occurs after N jumps, it restarts the search from a new random initial point. This multi-level method gives reasonable fits to time-series data from a number of domains, but it is computationally intensive.

model Ross_Sea_Ecosystem variables: phyto, nitro, residue, light, growth_rate, effective_light, ice_factor observables: phyto, nitro, light, ice_factor process phyto_loss equations: d[phyto,t,1] =  0.1  phyto d[residue,t,1] = 0.1  phyto process phyto_growth equations: d[phyto,t,1] = growth_rate  phyto process phyto_uptakes_nitro conditions: nitro > 0 equations: d[nitro,t,1] =  1  0.204  growth_rate  phyto process growth_limitation equations: growth_rate = 0.23  min(nitrate_rate, light_rate) process nitrate_availability equations: nitrate_rate = nitrate / (nitrate + 5) process light_availability equations: light_rate = effective_light / (effective_light + 50) process light_attenuation equations: effective_light = light  ice_factor A Process Model for an Aquatic Ecosystem

generic process exponential_loss generic process remineralization variables: S{species}, D{detritus} variables: N{nutrient}, D{detritus} parameters:  [0, 1] parameters:  [0, 1] equations: d[S,t,1] = 1  S equations: d[N, t,1] =  D d[D,t,1] =  S d[D, t,1] = 1  D generic process grazing generic process constant_inflow variables: S1{species}, S2{species}, D{detritus} variables: N{nutrient} parameters:  [0, 1],  [0, 1] parameters:  [0, 1] equations: d[S1,t,1] =  S1 equations: d[N,t,1] =  d[D,t,1] = (1 )  S1 d[S2,t,1] = 1  S1 generic process nutrient_uptake variables: S{species}, N{nutrient} parameters:  [0, ],  [0, 1],  [0, 1] conditions: N >  equations: d[S,t,1] =  S d[N,t,1] = 1  S Generic Processes for Aquatic Ecosystems

training data process model model AquaticEcosystem variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto observables: nitro, phyto, zoo process phyto_exponential_growth equations: d[phyto,t] = 0.1  phyto process zoo_logistic_growth equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5) process phyto_nitro_consumption equations: d[nitro,t] = 1  phyto  nutrient_nitro, d[phyto,t] = 1  phyto  nutrient_nitro process phyto_nitro_no_saturation equations: nutrient_nitro = nitro process zoo_phyto_consumption equations: d[phyto,t] = 1  zoo  nutrient_phyto, d[zoo,t] = 1  zoo  nutrient_phyto process zoo_phyto_saturation equations: nutrient_phyto = phyto / (phyto + 0.5) Induction process exponential_growth variables: P {population} equations: d[P,t] = [0, 1,]  P process logistic_growth variables: P {population} equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ]) process constant_inflow variables: I {inorganic_nutrient} equations: d[I,t] = [0, 1, ] process consumption variables: P1 {population}, P2 {population}, nutrient_P2 equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2, d[P2,t] =  [0, 1, ]  P1  nutrient_P2 process no_saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P process saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P / (P + [0, 1, ]) generic processes Inductive Process Modeling

The NPPc Portion of CASA NPPc = Smonthmax (E·IPAR, 0) E = 0.56 · T1 · T2 · W T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2 T2 = 1.18 / [(1 + e0.2 · (Topt – Tempc – 10) ) · (1 + e0.3 · (Tempc – Topt – 10) )] W = 0.5 + 0.5 · EET / PET PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0 PET = 0 if Tempc < 0 A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49 IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95] SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)

Initial model: E = 0.56 · T1 · T2 · W T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )] PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M SR  {3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05, 4.05, 5.09, 4.05} RMSE on training data = 465.212 and r2 = 0.799 Revised model: E = 0.353 · T10.00 · T2 0.08 · W 0.00 T2 = 0.83 / [(1 + e 1.0 · (Topt – Tempc – 6.34) ) · (1 + e 1.0 · (Tempc – Topt – 11.52) )] PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M SR  {0.61, 3.99, 2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85, 1.61} Cross-validated RMSE = 397.306 and r2 = 0.853 [ 15% reduction ] Results of Revising the NPP Model • • •

generic process translation generic process transcription variables: P{protein}, M{mRNA} variables: M{mRNA}, R{rate} parameters:  [0, 1] parameters: equations: d[P,t,1] =  M equations: d[M,t,1] = R generic process regulate_one generic process regulate_two variables: R{rate}, S{signal} variables: R{rate}, S{signal} parameters:  [1 , 1] parameters:  [1 , 1],  [0, 1] equations: R =  S equations: R =  S d[S, t,1] = 1  S generic process automatic_degradation generic process controlled_degradation variables: C{concentration} variables: D{concentration}, E{concentration} conditions: C > 0 conditions: D > 0, E > 0 parameters:  [0, 1] parameters:  [0, 1] equations: d[C,t,1] = 1  C equations: d[D,t,1] = 1  E d[E,t,1] = 1  E generic process photosynthesis variables: L{light}, P{protein}, R{redox}, S{ROS} parameters:  [0, 1],  [0, 1] equations: d[R,t,1] =  L  P d[S,t,1] =  L  P Generic Processes for Photosynthesis Regulation

model photo_regulation variables: light, mRNA_protein, ROS, redox, transcription_rate observables: light, mRNA process photosynthesis; equations: d[redox,t,1] = 0.0155  light  protein d[ROS,t,1] = 0.019  light  protein process protein_translation process mRNA_transcription equations: d[protein,t,1] = 7.54  mRNA equations: d[mRNA,t,1] = transcription_rate process regulate_one_1 process regulate_two_2 equations: transcription_rate = 0.99  light equations: transcription_rate = 1.203  redox d[redox,t,1] = 0.0002  redox process automatic_degradation_1 process controlled_degradation_1 conditions: protein > 0 conditions: redox > 0, ROS > 0 equations: d[protein,t,1] =  1.91  protein equations: d[redox,t,1] =  0.0003  ROS d[ROS,t,1] =  0.0003  ROS A Process Model for Photosynthetic Regulation

Nima Asgharbeygi, Pat Langley, Stephen Bay Center for the Study of Language and Information Stanford University Kevin

Nima Asgharbeygi, Pat Langley, Stephen Bay Center for the Study of Language and Information Stanford University Kevin

Presentation Transcript

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stan

Pat Langley Seth Rogers Computational Learning Laboratory Center for the Study of Language and Information Stanford Univ

Pat Langley Institute for the Study of Learning and Expertise and Center for the Study of Language and Information Stanf

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Center for the Study of Language and Information

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

David C. Wilkins Center for Study of Language and Expertise Stanford University David Fried

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Stephen Bay Pat Langley Mei Wang Marker Daniel Shapiro

Pat Langley Dileep George Stephen Bay Computational Learning Laboratory

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Prof Stephen Langley

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Arizona State University and Institute for the Study of Learning and Expertise

Pat Langley Center for the Study of Language and Information

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and