Pat Langley Institute for the Study of Learning and Expertise Javier Sanchez

Inducing Process Models from Continuous Data Pat Langley Institute for the Study of Learning and Expertise Javier Sanchez CSLI / Stanford University Ljupco Todorovski Saso Dzeroski Jozef Stefan Institute Supported by NTT Communication Science Laboratories, by Grant NCC 2-1220 from NASA Ames Research Center, and by EU Grant IST-2000-26469.

Exploratory Research in Machine Learning Dietterich (1990) claims an exploratory research report should: define a challenging new problem for machine learning; show that established methods cannot solve the problem; present an initial approach that addresses the new task; and outline an agenda for future research efforts in the area. In this talk, we explore the problem of inducing process models from continuous data.

model AquaticEcosystem variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto observables: nitro, phyto, zoo process phyto_exponential_growth equations: d[phyto,t] = 0.1  phyto process zoo_logistic_growth equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5) process phyto_nitro_consumption equations: d[nitro,t] = 1  phyto  nutrient_nitro, d[phyto,t] = 1  phyto  nutrient_nitro process phyto_nitro_no_saturation equations: nutrient_nitro = nitro process zoo_phyto_consumption equations: d[phyto,t] = 1  zoo  nutrient_phyto, d[zoo,t] = 1  zoo  nutrient_phyto process zoo_phyto_saturation equations: nutrient_phyto = phyto / (phyto + 0.5) process exponential_growth variables: P {population} equations: d[P,t] = [0, 1,]  P process logistic_growth variables: P {population} equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ]) process constant_inflow variables: I {inorganic_nutrient} equations: d[I,t] = [0, 1, ] process consumption variables: P1 {population}, P2 {population}, nutrient_P2 equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2, d[P2,t] =  [0, 1, ]  P1  nutrient_P2 process no_saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P process saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P / (P + [0, 1, ]) training data Inductive Process Modeling learned knowledge Induction background knowledge

training data Observed values for a set of continuous variables as they vary over time or situations Inductive Process Modeling learned model A specific process model that explains the observed values and predicts future data accurately Induction Generic processes that characterize causal relationships among variables in terms of conditional equations background knowledge

temp ice_mass water_mass 0 Time A Process Model of an Ice-Water System model WaterPhaseChange variables: temp, heat, ice_mass, water_mass observables: temp, heat, ice_mass, water_mass process ice-warming conditions: ice_mass > 0, temp < 0 equations: d[temp,t] = heat / (0.00206  ice_mass) process ice-melting conditions: ice_mass > 0, temp == 0 equations: d[ice_mass,t] =  (18  heat) / 6.02, d[water_mass,t] = (18  heat) / 6.02 process water-warming conditions: ice_mass == 0, water_mass > 0, temp >= 0, temp < 100 equations: d[temp,t] = heat / (0.004184  water_mass)

Why Are Process Models Interesting? Process models are a crucial target for machine learning because: they incorporate scientific formalisms rather than AI notations; that are easily communicable to scientists and engineers; they move beyond descriptive generalization to explanation; while retaining the modularity needed to support induction. These reasons point to process models as an ideal representation for scientific and engineering knowledge. Process models are an important alternative to formalisms used currently in machine learning.

Challenges of Inductive Process Modeling Process model induction differs from typical learning tasks in that: process models characterize behavior of dynamical systems; variables are mainly continuous and data are unsupervised; observations are not independently and identically distributed; process models contain unobservable processes and variables; multiple processes can interact to produce complex behavior. Compensating factors include a focus on deterministic systems and the availability of background knowledge.

regression trees equation discovery B>6 explanation-based learning d[ice_mass,t] =  (18  heat) / 6.02 d[water_mass,t] = (18  heat) / 6.02 C>0 C>4 14.3 18.7 11.5 16.9 hidden Markov models x=16,x=2 y=13,x=1 0.7 1.0 x=12,x=1 y=18,x=2 x=19,x=1 y=11,x=2 inductive logic programming x=12,x=1 y=10,x=2 0.3 1.0 gcd(X,X,X). gcd(X,Y,D) :- X<Y,Z is Y–X,gcd(X,Z,D). gcd(X,Y,D) :- Y<X,gcd(Y,X,D). Can Existing Methods Induce Process Models?

Facets of Inductive Process Modeling To describe a system that learns process models, we must specify: characteristics of the data (observations to be explained); a representation for background knowledge (generic processes); a representation for learned knowledge (process models); a performance element that makes predictions (a simulator); a learning method that induces process models. We will use an example from population dynamics to illustrate an initial approach to inductive process modeling.

Data for an Aquatic Ecosystem

process exponential_growth process exponential_decay variables: P {population} variables: P {population} equations: d[P,t] = [0, 1,]  P equations: d[P,t] =  [0, 1, ]  P process logistic_growth variables: P {population} equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ]) process constant_inflow variables: I {inorganic_nutrient} equations: d[I,t] = [0, 1, ] process consumption variables: P1 {population}, P2 {population}, nutrient_P2 {number} equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2, d[P2,t] =  [0, 1, ]  P1  nutrient_P2 process no_saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P process saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P / (P + [0, 1, ]) Generic Processes for Population Dynamics

model AquaticEcosystem variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto observables: nitro, phyto, zoo process phyto_exponential_growth equations: d[phyto,t] = 0.1  phyto process zoo_logistic_growth equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5) process phyto_nitro_consumption equations: d[nitro,t] = 1  phyto  nutrient_nitro, d[phyto,t] = 1  phyto  nutrient_nitro process phyto_nitro_no_saturation equations: nutrient_nitro = nitro process zoo_phyto_consumption equations: d[phyto,t] = 1  zoo  nutrient_phyto, d[zoo,t] = 1  zoo  nutrient_phyto process zoo_phyto_saturation equations: nutrient_phyto = phyto / (phyto + 0.5) Process Model for an Aquatic Ecosystem

Making Predictions with Process Models Specify initial values for input variables and the size for time steps On each time step, check conditions to decide which processes are active Solve algebraic and differential equations with known values Propagate values and recurse to solve other equations Add the effects of different processes on each variable

Find all ways to instantiate known generic processes with specific variables The IPM Method for Process Model Induction Combine subsets of instantiated processes into generic models Remove candidates that are too complex or not connected graphs For each generic model, search for good parameter values Return parameterized model with the smallest error

Initial Evaluation of IPM Algorithm To demonstrate IPM's functionality at inducing process models, we ran it on synthetic data for a known system. 1. We used the aquatic ecosystem model to generate data for 100 time steps, setting nitrogen = 1.0, phyto = 0.01, zoo = 0.01; 2. We replaced each ‘true’ value x with x(1 + r  0.05), where r came from a Gaussian distribution ( = 0 and  = 1); 3. We ran IPM on these noisy data, giving it type constraints and generic processes as background knowledge. The IPM algorithm examined a space of 2196 generic models, each with an embedded parameter optimization.

Predictions from IPM’s Induced Model

model AquaticEcosystem variables: nitro, phyto, zoo, nutrient_nitro_1, nutrient_nitro_2, nutrient_phyto observables: nitro, phyto, zoo process phyto_exponential_growth equations: d[phyto,t] = 0.089  phyto process zoo_logistic_growth equations: d[zoo,t] = 0.013  zoo / (1  zoo / 0.469) process phyto_nitro_consumption equations: d[nitro,t] = 1.174  phyto  nutrient_nitro_1, d[phyto,t] = 1.058  phyto  nutrient_nitro_1 process phyto_nitro_no_saturation equations: nutrient_nitro_1 = nitro process zoo_phyto_consumption equations: d[phyto,t] = 0.986  zoo  nutrient_phyto, d[zoo,t] = 1.089  zoo  nutrient_phyto process zoo_phyto_saturation equations: nutrient_phyto = phyto / (phyto + 0.487) Process Model Generated by IPM

Process Model Generated by IPM(continued) process nitro_constant_inflow equations: d[nitro,t] = 0.067 process zoo_nitro_consumption equations: d[nitro,t] = 0.470  zoo  nutrient_nitro_2, d[zoo,t] = 1.089  zoo  nutrient_nitro_2 process zoo_nitro_saturation equations: nutrient_nitro_2 = nitro / (nitro + 0.020) These extra processes complicate the model but have little effect on its behavior or its predictive accuracy.

Future research on process modeling should explore methods that: A Proposed Research Agenda reduce variance and overfitting (e.g., through pruning); determine the conditions on processes from training data; associate variables with phyiscal entities to constrain search; use a taxonomy of process types to organize and limit search; use knowledge of dimensions and conservation to limit search; support the induction of qualitative process models; and revise existing process models rather than construct them. This work should draw on traditional induction methods, which have many relevant ideas.

Evaluation of Process Models Research on this new class of problems should follow the accepted standards; thus, papers should: make explicit claims about an induction method's abilities; support these claims with experimental or theoretical evidence; study behavior on natural data sets to ensure relevance; utilize synthetic data sets to vary dimensions of interest; and incorporate ideas from other tasks and utilize existing methods whenever sensible. In addition, the focus on communicability and use of background knowledge suggests collaborations with domain experts.

In this exploratory research contribution, we have: Concluding Remarks proposed a new problem that involves induction of process models from components to explain observations; argued that this task does not lend itself to established methods; proposed a formalism for models and background knowledge; presented an initial system that induces such process models; demonstrated its functionality in a population dynamics domain; outlined an agenda for future research in this new area. Process model induction has great potential to aid development of models in science and engineering.

In Memoriam Early last year, computational scientific discovery lost two of its founding fathers: Both contributed to the field in many ways: posing new problems, inventing methods, training students, and organizing meetings. Moreover, both were interdisciplinary researchers who contributed to computer science, psychology, philosophy, and statistics. Herb Simon and Jan Zytkow were excellent role models that we should all aim to emulate. Herbert A. Simon (1916 – 2001) Jan M. Zytkow (1945 – 2001)

The LaGramge Discovery System Our approach to inductive process modeling builds on LaGramge (Todorovski & Dzeroski, 1997), a discovery system that: • specifies a space of abstract numeric equations in terms of a context-free grammar; • searches exhaustively through this space, to a given depth, to generate candidate abstract equations; • calls on established optimization techniques to determine the parameters for each equation; and • uses either squared error or minimum description length to select its final equations. LaGramge has rediscovered an impressive class of differential and algebraic equations from noisy data.

Making Predictions with Process Models To simulate a given process model’s behavior over time, we can: specify initial values for input variables and time step size; on each time step, determine which processes are active; solve active algebraic/differential equations with known values; propagate values and recursively solve other active equations; when multiple processes influence the same variable, assume their effects are additive. This performance element makes specific predictions that we can compare to observations.

We have implemented IPM, an algorithm that constructs process models from generic components in four stages: A Method for Process Model Induction 1. Find all ways to instantiate known generic processes with specific variables; 2. Combine subsets of instantiated processes into generic models, each specifying an explanatory structure; 2a. Ensure that each candidate consists of a connected graph; 2b. Limit the maximum number of processes that can connect any two variables and the total number of processes; 3. Translate the candidate into a context-free grammar and invoke LaGramge to search for good parameter values; 4. Return the model with the least error produced by LaGramge.

Pat Langley Institute for the Study of Learning and Expertise Javier Sanchez

Pat Langley Institute for the Study of Learning and Expertise Javier Sanchez

Presentation Transcript

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stan

Pat Langley Institute for the Study of Learning and Expertise and Center for the Study of Language and Information Stanf

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Center for the Study of Language and Information

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Arizona State University and Institute for the Study of Learning and Expertise

Pat Langley Center for the Study of Language and Information

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and