Computational Discovery in Systems Science: Challenges and Opportunities

Challenges in the Computational Discovery of Scientific Knowledge Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California http://cll.stanford.edu/~langley langley@csli.stanford.edu Thanks to K. Arrigo, S. Bay, L. Chrisman, D. George, A. Pohorille, C. Potter, J. Sanchez, K. Saito, and J. Shrager.

The Challenge of Systems Science Disciplines like Earth science and computational biology differ from traditional fields in that they: focus on synthesis rather than analysis in their operation; rely on computer modeling as one of their central methods; develop system-level models with many variables and relations; evaluate their models on observational, not experimental, data. Developing and testing such models are complex tasks that would benefit from computational aids. Our research goal is to design, construct, evaluate, and understand such computational tools for systems science.

One approach to computational discovery, known as datamining: The Data Mining Paradigm emphasizes the availability of vast amounts of data; focuses on business data, with some scientific applications; uses formalisms like decision trees, association rules, and Bayesian networks to encode learned knowledge. I.e., data mining researchers favor their own formalisms over those used by scientists and engineers. As a result, their discoveries are seldom communicable to members of those communities.

Computational Scientific Discovery An older paradigm, computational scientific discovery, instead: emphasizes use of heuristic search in the discovery process; focuses on discovery of knowledge in scientific domains; uses formalisms like numeric equations, structural models, and reaction pathways to describe regularities. I.e., researchers in this framework favor representations used by scientists and engineers. As a result, their systems’ discoveries are usually communicable to members of those communities.

Successes of Computational Scientific Discovery Over the past decade, systems of this type have helped discover new knowledge in many scientific fields: • qualitative chemical factors in mutagenesis (King et al., 1996) • quantitative laws of metallic behavior (Sleeman et al., 1997) • qualitative conjectures in number theory (Colton et al., 2000) • temporal laws of ecological behavior (Todorovski et al., 2000) • reaction pathways in catalytic chemistry (Valdes-Perez, 1994) Each has led to publications in the refereed scientific literature (e.g., Langley, 2000), but they did not focus on systems science.

Given Find Two Discovery Problems in Systems Science Data on climate and organism variables over space and time An ecosystem model that fits these data and explains them Given Find Time-series data on gene expressions for specific organisms A model of gene regulation that fits and explains these data These problems raise new challenges that require advances in our methods for computational discovery.

Challenge 1: Representing Scientific Models To assist system scientists’ modeling efforts, we must first encode candidate models that: address observational rather than experimental data; deal with dynamic systems that change over time; have an explanatory rather than a descriptive character; are causal in that they describe chains of effects; contain quantitative relations and qualitative structure. We need some formal way to represent such models that can be interpreted computationally.

regression trees B>6 C>0 C>4 systems of equations 14.3 18.7 11.5 16.9 d[ice_mass,t] =  (18  heat) / 6.02 d[water_mass,t] = (18  heat) / 6.02 hidden Markov models x=16,x=2 y=13,x=1 0.7 1.0 x=12,x=1 y=18,x=2 x=19,x=1 y=11,x=2 Horn clause programs x=12,x=1 y=10,x=2 0.3 1.0 gcd(X,X,X). gcd(X,Y,D) :- X<Y,Z is Y–X,gcd(X,Z,D). gcd(X,Y,D) :- Y<X,gcd(Y,X,D). Why Are Existing Formalisms Inadequate?

model AquaticEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro process phyto_exponential_decay equations: d[phyto,t,1] =  0.307  phyto d[residue,t,1] = 0.307  phyto process zoo_exponential_decay equations: d[zoo,t,1] =  0.251  zoo d[residue,t,1] = 0.251 process zoo_phyto_predation equations: d[zoo,t,1] = 0.615  0.495  zoo d[residue,t,1] = 0.385  0.495  zoo d[phyto,t,1] =  0.495  zoo process nitro_uptake conditions: nitro > 0 equations: d[phyto,t,1] = 0.411  phyto d[nitro,t,1] =  0.098  0.411  phyto process nitro_remineralization; equations: d[nitro,t,1] = 0.005  residue d[residue,t,1 ] =  0.005  residue A Process Model for an Aquatic Ecosystem

Advantages of Quantitative Process Models Process models are a good target for discovery systems because: they embed quantitative relations within qualitative structure; that refer to notations and mechanisms familiar to scientists; they provide dynamical predictions of changes over time; they offer causal and explanatory accounts of phenomena; while retaining the modularity needed to support induction. Quantitative process models provide an important alternative to formalisms used currently in computational discovery.

To utilize or evaluate a given process model, we must simulate its behavior over time: Challenge 2: Making Predictions from Models specify initial values for input variables and time step size; on each time step, determine which processes are active; solve active algebraic/differential equations with known values; propagate values and recursively solve other active equations; when multiple processes influence the same variable, assume their effects are additive. This performance method makes specific predictions that we can compare to observations.

Predictions from the Ecosystem Model

Challenge 3: Encoding Background Knowledge To constrain candidate models, we can utilize available backround knowledge about the domain. Previous work has cast background knowledge in terms of: Horn clause programs (e.g., Towell & Shavlik, 1990) context-free grammars (e.g., Dzeroski & Todorovski, 1997) prior probability distributions (e.g., Friedman et al., 2000) However, none of these notations are familiar to domain scientists, which suggests the need for another approach.

Generic Processes as Background Knowledge Our framework casts background knowledge as generic processes that specify: the variables involved in a process and their types; the parameters appearing in a process and their ranges; the forms of conditions on the process; and the forms of associated equations and their parameters. Generic processes are building blocks from which one can compose a specific process model.

generic process exponential_decay generic process remineralization variables: S{species}, D{detritus} variables: N{nutrient}, D{detritus} parameters:  [0, 1] parameters:  [0, 1] equations: d[S,t,1] = 1  S equations: d[N, t,1] =  D d[D,t,1] =  S d[D, t,1] = 1  D generic process predation generic process constant_inflow variables: S1{species}, S2{species}, D{detritus} variables: N{nutrient} parameters:  [0, 1],  [0, 1] parameters:  [0, 1] equations: d[S1,t,1] =  S1 equations: d[N,t,1] =  d[D,t,1] = (1 )  S1 d[S2,t,1] = 1  S1 generic process nutrient_uptake variables: S{species}, N{nutrient} parameters:  [0, ],  [0, 1],  [0, 1] conditions: N >  equations: d[S,t,1] =  S d[N,t,1] = 1  S Generic Processes for Aquatic Ecosystems

model AquaticEcosystem variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto observables: nitro, phyto, zoo process phyto_exponential_growth equations: d[phyto,t] = 0.1  phyto process zoo_logistic_growth equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5) process phyto_nitro_consumption equations: d[nitro,t] = 1  phyto  nutrient_nitro, d[phyto,t] = 1  phyto  nutrient_nitro process phyto_nitro_no_saturation equations: nutrient_nitro = nitro process zoo_phyto_consumption equations: d[phyto,t] = 1  zoo  nutrient_phyto, d[zoo,t] = 1  zoo  nutrient_phyto process zoo_phyto_saturation equations: nutrient_phyto = phyto / (phyto + 0.5) process exponential_growth variables: P {population} equations: d[P,t] = [0, 1,]  P process logistic_growth variables: P {population} equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ]) process constant_inflow variables: I {inorganic_nutrient} equations: d[I,t] = [0, 1, ] process consumption variables: P1 {population}, P2 {population}, nutrient_P2 equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2, d[P2,t] =  [0, 1, ]  P1  nutrient_P2 process no_saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P process saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P / (P + [0, 1, ]) training data Challenge 4: Inducing Process Models process model Induction generic processes

The IPM algorithm that constructs process models from generic components in four stages: A Method for Process Model Induction 1. Find all ways to instantiate known generic processes with specific variables, subject to type constraints; 2. Combine instantiated processes into candidate generic models that specify explanatory structures, with limits on the total number of processes. 3. For each generic model, carry out gradient descent search through parameter space to find good parameter values; 4. Return the parameterized model with the lowest description length: Md = (Mv + Mc )  log (n) + n  log (Me ) .

Evaluation of the IPM Algorithm To demonstrate IPM's ability to induce process models, we ran it on synthetic data for a known system. 1. We used the aquatic ecosystem model to generate data on 500 time steps for the variables nitro and phyto; 2. We replaced each ‘true’ value x with x(1 + r  0.05), where r came from a Gaussian distribution ( = 0 and  = 1); 3. We ran IPM on these noisy data, giving it type constraints and generic processes as background knowledge. The IPM algorithm examined a space of 256 generic models, each with an embedded parameter optimization.

Predictions from Induced Ecosystem Model

Issues in Process Model Induction Inductive process modeling raises a number of issues that have clear analogues in other paradigms: identifying conditions on processes (parameter optimization) inferring initial values of unobservables (parameter optimization) keeping the search space tractable (typing on variables) reducing variance to mitigate overfitting (min. desc. length) We have demonstrated promising responses to these four problems within the IPM framework.

Best Model Fit to Data from Ross Sea

Best Model Fit to Data on Protozoan Predation

Collecting Data on Photosynthetic Processes www.affymetrix.com/ Microarray Trace /wwwscience.murdoch.edu.au/teach Continuous Culture (Chemostat) External stimuli (e.g., light) Adaptation Period Sampling mRNA/cDNA Health of Culture Equlibrium Period www.affymetrix.com/ Time

Gene Expressions for Cyanobacteria

generic process translation generic process transcription variables: P{protein}, M{mRNA} variables: M{mRNA}, R{rate} parameters:  [0, 1] parameters: equations: d[P,t,1] =  M equations: d[M,t,1] = R generic process regulate_one generic process regulate_two variables: R{rate}, S{signal} variables: R{rate}, S{signal} parameters:  [1 , 1] parameters:  [1 , 1],  [0, 1] equations: R =  S equations: R =  S d[S, t,1] = 1  S generic process automatic_degradation generic process controlled_degradation variables: C{concentration} variables: D{concentration}, E{concentration} conditions: C > 0 conditions: D > 0, E > 0 parameters:  [0, 1] parameters:  [0, 1] equations: d[C,t,1] = 1  C equations: d[D,t,1] = 1  E d[E,t,1] = 1  E generic process photosynthesis variables: L{light}, P{protein}, R{redox}, S{ROS} parameters:  [0, 1],  [0, 1] equations: d[R,t,1] =  L  P d[S,t,1] =  L  P Generic Processes for Photosynthesis Regulation

model photo_regulation variables: light, mRNA_protein, ROS, redox, transcription_rate observables: light, mRNA process photosynthesis; equations: d[redox,t,1] = 0.0155  light  protein d[ROS,t,1] = 0.019  light  protein process protein_translation process mRNA_transcription equations: d[protein,t,1] = 7.54  mRNA equations: d[mRNA,t,1] = transcription_rate process regulate_one_1 process regulate_two_2 equations: transcription_rate = 0.99  light equations: transcription_rate = 1.203  redox d[redox,t,1] = 0.0002  redox process automatic_degradation_1 process controlled_degradation_1 conditions: protein > 0 conditions: redox > 0, ROS > 0 equations: d[protein,t,1] =  1.91  protein equations: d[redox,t,1] =  0.0003  ROS d[ROS,t,1] =  0.0003  ROS A Process Model for Photosynthetic Regulation

Predictions from Best Parameterized Model

Electric Power on the International Space Station

Telemetry Data from Space Station Batteries

model Battery variables: Rs, Vcb, soc , Vt, i, temperature observable: soc, Vt, i, temperature process voltage_charge process voltage_discharge conditions: i  0 conditions: i < 0 equations: Vt = Vcb + 6.105  Rs  i equations: Vt = Vcb  1.0 / (Rs + 1.0) process charge_transfer equations: d[soc,t,1] = i  Vcb/179.38 process quadratic_influence_Vcb_soc equations: Vcb = 41.32  soc  soc process linear_influence_Vcb_temp equations: Vcb = 0.2592  temperature process linear_influence_Rs_soc equations: Rs = 0.03894  soc Induced Process Model for Battery Behavior

Results on Battery Test Data

Steps in Applying Computational Scientific Discovery problem formulation representation engineering algorithm manipulation data collection/ manipulation algorithm invocation filtering and interpretation

Challenge 5: Interfacing with Scientists Because scientists do not want to be replaced, we are developing an interactive environment that lets users: specify a quantitative process model of the target system; display and edit the model’s structure and details graphically; simulate the model’s behavior over time and situations; compare the model’s predicted behavior to observations; invoke a revision module in response to detected anomalies. The environment offers computational assistance in forming and evaluating models but lets the user retain control.

Viewing and Editing a Process Model

Initial model: E = 0.56 · T1 · T2 · W T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )] PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M SR  {3.06, 4.35, 4.35, 4.05, 5.09, 3.06, 4.05, 4.05, 4.05, 5.09, 4.05} RMSE on training data = 465.212 and r2 = 0.799 Revised model: E = 0.353 · T10.00 · T2 0.08 · W 0.00 T2 = 0.83 / [(1 + e 1.0 · (Topt – Tempc – 6.34) ) · (1 + e 1.0 · (Tempc – Topt – 11.52) )] PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M SR  {0.61, 3.99, 2.44, 10.0, 2.21, 2.13, 2.04, 0.43, 1.35, 1.85, 1.61} Cross-validated RMSE = 397.306 and r2 = 0.853 [ 15% reduction ] Results of Revising the NPP Model • • •

Our approach to computational discovery incorporates ideas from many traditions: Intellectual Influences • computational scientific discovery (e.g., Langley et al., 1983); • theory revision in machine learning (e.g., Towell, 1991); • qualitative physics and simulation (e.g., Forbus, 1984); • languages for scientific simulation (e.g., STELLA, MATLAB); • interactive tools for data analysis (e.g., Schneiderman, 2001). Our work combines, in novel ways, insights from machine learning, AI, programming languages, and human-computer interaction.

In summary, our work on computational scientific discovery has, in responding to five challenges, produced: Contributions of the Research a new formalism for representing scientific process models; a computational method for simulating these models’ behavior; an encoding for background knowledge as generic processes; an algorithm for inducing process models from time-series data; an interactive environment for model construction/utilization. We have demonstrated this approach to model construction on four domains from Earth science, microbiology, and engineering.

Despite our progress to date, we need further work in order to: Directions for Future Research produce additional results on other scientific data sets develop more robust methods for fitting model parameters extend the approach to handle data sets with missing values implement heuristic methods for searching the model space utilize knowledge of subsystems to further constrain search augment the modeling environment to make it more usable Inductive process modeling has great potential to speed progress in system science and engineering.

End of Presentation

In Memoriam Two years ago, computational scientific discovery lost two of its founding fathers: Both contributed to the field in many ways: posing new problems, inventing methods, training students, and organizing meetings. Moreover, both were interdisciplinary researchers who contributed to computer science, psychology, philosophy, and statistics. Herb Simon and Jan Zytkow were excellent role models that we should all aim to emulate. Herbert A. Simon (1916 – 2001) Jan M. Zytkow (1945 – 2001)

Data Mining vs. Scientific Discovery There exist two computational paradigms for discovering explicit knowledge from data: Data mining generates knowledge cast as decision trees, logical rules, or other notations invented by AI researchers; Computational scientific discovery instead uses equations, structural models, reaction pathways, or other formalisms invented by scientists and engineers. Both approaches draw on heuristic search to find regularities in data, but they differ considerably in their emphases.

1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Bacon.1–Bacon.5 Abacus, Coper Fahrehneit, E*, Tetrad, IDSN Hume, ARC DST, GPN LaGrange SDS SSF, RF5, LaGramge AM Glauber NGlauber IDSQ, Live RL, Progol HR Dendral Dalton, Stahl Stahlp, Revolver Gell-Mann BR-3, Mendel Pauli BR-4 IE Coast, Phineas, AbE, Kekada Mechem, CDP Astra, GPM Numeric laws Qualitative laws Structural models Process models Time Line for Research on Computational Scientific Discovery Legend

Why Are Process Models Interesting? Process models are a crucial target for machine learning because: they incorporate scientific formalisms rather than AI notations; that are easily communicable to scientists and engineers; they move beyond descriptive generalization to explanation; while retaining the modularity needed to support induction. These reasons point to process models as an ideal representation for scientific and engineering knowledge. Process models are an important alternative to formalisms used currently in machine learning.

Challenges of Inductive Process Modeling Process model induction differs from typical learning tasks in that: process models characterize behavior of dynamical systems; variables are mainly continuous and data are unsupervised; observations are not independently and identically distributed; process models contain unobservable processes and variables; multiple processes can interact to produce complex behavior. Compensating factors include a focus on deterministic systems and the availability of background knowledge.

Making Predictions with Process Models Specify initial values for input variables and the size for time steps On each time step, check conditions to decide which processes are active Solve algebraic and differential equations with known values Propagate values and recurse to solve other equations Add the effects of different processes on each variable

Predictions from IPM’s Induced Model

Best Model Fit to Actual Nitrate Data

Best Model Fit to Actual Phytoplankton Data

training data Observed values for a set of continuous variables as they vary over time or situations Inductive Process Modeling learned model A specific process model that explains the observed values and predicts future data accurately Induction Generic processes that characterize causal relationships among variables in terms of conditional equations background knowledge

To construct a quantitative process model, we need an algorithm to search the space of models that assumes: Inductive Process Modeling as Search an initial state from which to start search; some operators that generate new states; an evaluation function that selects among states; an overall control regime for the search; and a halting criterion for ending the search. We have implemented a four-stage method that takes positions on these design decisions.

Computational Discovery in Systems Science: Challenges and Opportunities

Computational Discovery in Systems Science: Challenges and Opportunities

Presentation Transcript

Nima Asgharbeygi, Pat Langley, Stephen Bay Center for the Study of Language and Information Stanford University Kevin

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stan

Pat Langley Seth Rogers Computational Learning Laboratory Center for the Study of Language and Information Stanford Univ

Pat Langley Institute for the Study of Learning and Expertise and Center for the Study of Language and Information Stanf

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Center for the Study of Language and Information

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Arizona State University and Institute for the Study of Learning and Expertise

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and