Pat Langley Computational Learning Laboratory Center for the Study of Language and Information

Knowledge, Data, and Search in Computational Discovery Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California http://cll.stanford.edu/~langley Thanks to Kevin Arrigo, Stuart Borrett, Will Bridewell, and Ljupco Todorovski for their contributions to this work, and to the National Science Foundation for funding.

Qualitative Laws of Intelligence In their 1975 Turing Award speech, Newell and Simon claimed that intelligence depends on two factors: the ability to store, retrieve, and manipulate list structures since computers are general symbol manipulators the ability to solve novel problems by heuristic search with problem spaces defined by states and operators Moreover, one can constrain search with knowledge that is cast as symbolic list structures. These insights underlie the fields of artificial intelligence and cognitive science.

Two Basic Claims Newell and Simon’s insights suggest the two claims of this talk: knowledge structures are important results of machine learning and discovery knowledge structures are important inputs to machine learning and discovery In other words, knowledge plays as crucial a role as data in the automation of discovery. I will illustrate these ideas using recent work on induction of scientific process models.

The Mainstream View Learning/Discovery Process Predictive Model Training Data Nearly all current research in machine learning and data mining takes this perspective.

An Alternative View Existing Knowledge Learning/Discovery Process Acquired Knowledge Training Data This perspective is now uncommon, but the ideas themselves are not new to machine learning and discovery.

Historical Landmarks in Machine Learning 1980 – Machine learning launched as an outgrowth of symbolic AI 1983 – Early emphasis on knowledge-guided approaches to learning 1986 – First issue of the journal Machine Learning published 1989 – Advent of UCI repository and routine experimental evaluation 1989 – Introduction of statistical methods from pattern recognition 1993 – Workshop on fielded applications of machine learning 1995 – First conference on knowledge discovery and data mining 1997 – Explosion of the Web and associated research on text mining 2001 – Strong focus on predictive accuracy over understandability 2004 – Prevalence of statistical methods over symbolic approaches

Knowledge as Output of Discovery Systems Discovery systems produce models that are useful for prediction, but they should also produce models that: have been stated in some declarative format that can be communicated clearly and precisely which helps people understand observations in terms that they find plausible and familiar We typically refer to the content of such models as knowledge.

What is Knowledge? Knowledge can be cast in many different formalisms, such as: criteria tables (M of N rules) in diagnostic medicine molecular structures and reaction pathways in chemistry qualitative causal models in biology and geology structural equations in economics and sociology differential equations in physics and ecology Discovery systems should generate knowledge in a format that is familiar to domain users. Fortunately, computers can encode all such forms of knowledge.

Successes of Scientific Knowledge Discovery Over the past decade, computational discovery systems have helped uncover new knowledge in many scientific fields: • qualitative chemical factors in mutagenesis (King et al., 1996) • quantitative laws of metallic behavior (Sleeman et al., 1997) • qualitative conjectures in number theory (Colton et al., 2000) • temporal laws of ecological behavior (Todorovski et al., 2000) • reaction pathways in catalytic chemistry (Valdes-Perez, 1994) Each has led to publications in the refereed scientific literature, the key measure of academic success. For a review of these scientific results, see Langley (IJHCS, 2000).

Description vs. Explanation Traditional discovery systems have focused on descriptive models that summarize data and make accurate predictions. But many sciences are concerned with explanatory models that: move beyond superficial descriptive summaries to account for observations at a deeper theoretical level in terms of unobserved concepts and mechanisms that are familiar and plausible to domain experts Explanations may or may not have quantitative aspects, but they invariably have qualitative structure not captured by statistics.

As phytoplankton uptakes nitrogen, its concentration increases and nitrogen decreases. This continues until the nitrogen supply is exhausted, which leads to a phytoplankton die off. This produces detritus, which gradually remineralizes to replenish the nitrogen. Zooplankton grazes on phytoplankton, which slows the latter’s increase and also produces detritus. Two Accounts of the Ross Sea Ecosystem d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[detritus,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  detritus d[nitro,t,1] =  0.098  0.411  phyto + 0.005  detritus

As phytoplankton uptakes nitrogen, its concentration increases and nitrogen decreases. This continues until the nitrogen supply is exhausted, which leads to a phytoplankton die off. This produces detritus, which gradually remineralizes to replenish the nitrogen. Zooplankton grazes on phytoplankton, which slows the latter’s increase and also produces detritus. Relating Equation Terms to Processes d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[detritus,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  detritus d[nitro,t,1] =  0.098  0.411  phyto + 0.005  detritus

model Ross_Sea_Ecosystem variables: phyto, zoo, nitro, detritus observables: phyto, nitro process phyto_loss equations: d[phyto,t,1] =  0.307  phyto d[detritus,t,1] = 0.307  phyto process zoo_loss equations: d[zoo,t,1] =  0.251  zoo d[detritus,t,1] = 0.251  zoo process zoo_phyto_grazing equations: d[zoo,t,1] = 0.615  0.495  zoo d[detritus,t,1] = 0.385  0.495  zoo d[phyto,t,1] =  0.495  zoo process nitro_uptake equations: d[phyto,t,1] = 0.411  phyto d[nitro,t,1] =  0.098  0.411  phyto process nitro_remineralization; equations: d[nitro,t,1] = 0.005  detritus d[detritus,t,1 ] =  0.005  detritus A Process Model for the Ross Sea This model is equivalent to a standard differential equation model, but it makes explicit assumptions about which processes are involved. For completeness, we must also make assumptions about how to combine influences from multiple processes.

Advantages of Process Models Process models are a promising representational scheme because: they embed quantitative relations within qualitative structure; that refer to notations and mechanisms familiar to experts; they provide dynamical predictions of changes over time; they offer causal and explanatory accounts of phenomena; while retaining the modularity that is needed for induction. Quantitative process models provide an important alternative to formalisms typically used in modeling and discovery.

The Task of Inductive Process Modeling We can use these ideas to reformulate the modeling problem: Given: A set of variables of interest to the scientist; Given: Observations of how these variables change over time; Given: Background knowledge about plausible processes; Find: A process model that explains these variations and that generalizes well to future observations. The resulting model encodes new knowledge about the domain.

Challenges of Inductive Process Modeling We can use ideas from machine learning to induce process models, but this differs from typical learning tasks in that: process models characterize behavior of dynamical systems; variables are continuous but can have discontinuous behavior; observations are not independently and identically distributed; models may contain unobservable processes and variables; multiple processes can interact to produce complex behavior. Compensating factors include a focus on deterministic systems and ways to constrain the search for models.

Machine Learning as Heuristic Search Heuristic search depends on ways to guide exploration of the space.

Knowledge as Input to Discovery Systems One can also use knowledge to guide discovery mechanisms: by providing constraints on the space searched as in work on declarative bias for induction by providing operators used during search as in ILP research on relational cliches by providing a starting point for heuristic search as in work on theory revision and refinement Using knowledge to influence discovery can reduce prediction error but also improve model understandability.

Background Knowledge as Constraints We can use background knowledge about the domain to constrain search for candidate models. Previous work has encoded background knowledge in terms of: Horn clause programs (e.g., King et al., 1996) context-free grammars (e.g., Dzeroski & Todorovski, 1997) prior probability distributions (e.g., Friedman et al., 2000) However, none of these notations are familiar to domain scientists, which suggests the need for another approach.

Generic Processes as Background Knowledge We cast background knowledge as generic processes that specify: the variables involved in a process and their types; the parameters appearing in a process and their ranges; the forms of conditions on the process; and the forms of associated equations and their parameters. Generic processes are building blocks from which one can compose specific process models.

generic process exponential_loss generic process remineralization variables: S{species}, D{detritus} variables: N{nutrient}, D{detritus} parameters:  [0, 1] parameters:  [0, 1] equations: d[S,t,1] = 1  S equations: d[N, t,1] =  D d[D,t,1] =  S d[D, t,1] = 1  D generic process grazing generic process constant_inflow variables: S1{species}, S2{species}, D{detritus} variables: N{nutrient} parameters:  [0, 1],  [0, 1] parameters:  [0, 1] equations: d[S1,t,1] =  S1 equations: d[N,t,1] =  d[D,t,1] = (1 )  S1 d[S2,t,1] = 1  S1 generic process nutrient_uptake variables: S{species}, N{nutrient} parameters:  [0, ],  [0, 1],  [0, 1] conditions: N >  equations: d[S,t,1] =  S d[N,t,1] = 1  S Generic Processes for Aquatic Ecosystems Our current library contains about 20 generic processes, including ones with alternative functional forms for loss and grazing processes.

We have developed IPM, a system that constructs explanatory process models from generic components in four stages: A Method for Process Model Construction 1. Find all ways to instantiate known generic processes with specific variables, subject to type constraints; 2. Combine instantiated processes into candidate generic models subject to additional constraints (e.g., number of processes); 3. For each generic model, carry out search through parameter space to find good coefficients; 4. Return the parameterized model with the best overall score. Our typical evaluation metric is squared error, but we have also explored other measures of explanatory adequacy.

To estimate the parameters for each generic model structure, the IPM algorithm: Estimating Parameters in Process Models 1. Selects random initial values that fall within ranges specified in the generic processes; 2. Improves these parameters using the Levenberg-Marquardt method until it reaches a local optimum; 3. Generates new candidate values through random jumps along dimensions of the parameter vector and continue search; 4. If no improvement occurs after N jumps, it restarts the search from a new random initial point. This multi-level method gives reasonable fits to time-series data from a number of domains, but it is computationally intensive.

Results on Training Data from Ross Sea We provided IPM with 188 samples of phytoplankton, nitrogen, light, and ice measures for the Ross Sea. From 2035 distinct model structures, it found accurate models that limited phyto growth by the nitrate and the light available. Some high-ranking models incorporated zooplankton, whereas others did not.

Results on a Protist Ecosystem We also ran the system on protist data from Villeaux (1979), using 54 samples of two variables (P. aurelia and P. nasutum). In this run, IPM considered a space of 470 distinct model structures and reproduced basic trends.

Results on Rinkobing Fjord Data from a Danish fjord included measurements on fjord height, sea level, water inflow, and wind direction and speed. We used 1100 samples for training and 551 samples for testing over a space of 32 model structures.

Results on Battery Data from Space Station Data from the Space Station batteries included current, voltage, and temperature, with resistance and state of charge unobserved. We used 6000 samples for training and 2640 samples for testing over a space of 162 model structures.

Results on Biochemical Kinetics We also ran IPM on 14 samples of six chemicals involved in glycolysis from a pulse response study. Here the system considered some 172 model structures. The best model fit the data but reproduced only part of the known pathway.

Despite its success, we have observed IPM produce models that lack required components or include mutually exclusive ones. In response, we have developed an extended system, HIPM, that: Hierarchical Induction of Process Models organizes background knowledge into a hierarchy of processes; specifies required vs. optional components and mutual exclusion; associates variables with entities that occur in processes; carries out beam search through the resulting AND/OR space. We hypothesized this additional knowledge would reduce search effort and variance, thus improving generalization error. For more details about HIPM, see Todorovski et al. (AAAI-2005).

HIPM Results on Ross Sea Data HIPM examines fewer models and has better predictive accuracy.

Research on Theory Revision We can also use background knowledge to specify initial models from which to start search. Research on theory revision has applied this idea to models cast as: Horn clause programs (e.g., Ourston & Mooney, 1990) diagnostic fault hierarchies (e.g., Langley et al., 1994) qualitative causal models (e.g., Bay et al., 2003) sets of quantitative equations (e.g., Todorovski et al., 2003) This approach typically produces models that are more accurate and easier to comprehend than ones induced from scratch.

observations revised model model RossSeaEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[residue,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  residue d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue process exponential_growth variables: P {population} equations: d[P,t] = [0, 1,]  P process logistic_growth variables: P {population} equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ]) process constant_inflow variables: I {inorganic_nutrient} equations: d[I,t] = [0, 1, ] process consumption variables: P1 {population}, P2 {population}, nutrient_P2 equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2, d[P2,t] =  [0, 1, ]  P1  nutrient_P2 process no_saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P process saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P / (P + [0, 1, ]) Inductive Revision of Process Models Revision model RossSeaEcosystem variables: phyto, zoo, nitro, residue observables: phyto, nitro d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[residue,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  residue d[nitro,t,1] =  0.098  0.411  phyto + 0.005  residue initial model generic processes

We have seen HIPM produce models that fit the training data but generalize poorly, so we created another system – FUSE – that: Comprehensible Bagging of Process Models creates multiple training sets by sampling the original data; uses HIPM to induce one process model from each training set; creates a new model structure that includes common processes; estimates parameters for this structure from the original data. We hypothesized this method would reduce generalization error while keeping models understandable, unlike bagging. This shows one can combine ideas about knowledge and statistics. For more details about FUSE, see Bridewell et al. (ICML-2005).

FUSE Results on Ross Sea Data r2 SSE Cross-validation fold Five-fold cross validation on 188 measurements of two variables Cross-validation fold

Our initial algorithms assumed that the variables have no missing samples, so we have developed another extension to HIPM that: Process Modeling and Missing Data replaces missing values with interpolated estimates; uses HIPM to find the model that minimizes squared error; replaces the estimated values with ones the model predicts; if some values have changed, then return to Step 2. Experiments suggest that this expectation-maximization variant reduces error substantially on unseen data. This shows another way to combine knowledge with statistics. For more details, see Bridewell et al. (submitted to ICML-2006).

In summary, our work on computational discovery has produced: Contributions of the Research a formalism that states scientific knowledge as process models an encoding for background knowledge as generic processes a computational method for inducing process models a related technique for revising initial process models extended methods that combine knowledge with statistics Inductive process modeling has great potential to help scientists construct explanatory models of dynamical systems.

Despite our progress to date, we need further work in order to: Future Research on Process Modeling produce additional results on other scientific data sets develop more efficient methods for fitting model parameters extend framework to handle partial differential equations explore evaluation metrics like match to trajectory shape introduce subsystems to support large-scale modeling Taken together, these will make inductive process modeling a more robust approach to scientific knowledge discovery.

Knowledge can also assist in the search for useful features by: Relevance for Feature Selection We can apply these ideas to any representation of discovered knowledge, since they must include features as components. placing constraints on acceptable combinations of features provide an initial set of features from which to start search biasing selection to produce understandable models

We hope to extend our methods for inducing process models to: Feature Selection in Process Modeling construct initial models that include only a few variables use generic processes, type constraints, and available terms to expand the best-scoring models by adding new terms between ones in the current model by adding new terms to the fringe of the current model continue this forward selection scheme to construct ever more inclusive process models This strategy mirrors the incremental way that scientists improve their models over time.

Concluding Remarks In summary, ideas from symbolic AI remain highly relevant to machine learning and discovery. These ideas revolve around using structural knowledge that can: serve as understandable results of discovery systems provide useful inputs to discovery systems that guide search One can combine knowledge-based approaches with statistical techniques to gain the benefits of both paradigms. Taken together, they offer a balanced and productive approach to computational induction.

End of Presentation

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information