Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and

Challenges in the Computational Modeling of Gene Regulation Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University, Stanford, California http://cll.stanford.edu/~langley langley@csli.stanford.edu Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, A. Pohorille, J. Shrager, and H. Spencer.

Themes in Computational Biology Three distinctive themes in computational biology have been: manual development of knowledge bases about biological systems (e.g., E-Cell, EcoCyc, KEGG); automated analysis of available genomic and express data (e.g., clustering, inferring regulatory networks); interactive tools for visualizing genomic and expression data (e.g., SpotFire). However, each approach by itself is incomplete, and a complete solution must combine knowledge, data, and user interaction.

× × NBLR NBLA PBS + - + + DFR psbA1 Health + - - - RR psbA2 Photo + - Light cpcB NBLR NBLA PBS + - + - DFR psbA1 Health + - - + - RR psbA2 Photo + + - Light cpcB Biologist Knowledge, Data, and the Biologist Discovery Experimental data Domain knowledge Updated model

Biologist Knowledge, Data, and the Biologist Experimental data Domain knowledge Observed expression levels from cDNA microarrays or from other sources Initial model of gene regulatory processes, gene ontologies, and biological constraints Discovery Revised model of gene regulatory processes that explains observed expression data Updated model

Reasons for Studying Cyanobacteria Cyanobacteria are 3.5 billion years old and created Earth’s early oxygen atmosphere. Algae and Cyanobacteria produce most of the oxygen we breath and fix most greenhouse carbon dioxide. Thus, together they form the base of the marine ecosystem.

Collecting Data on Photosynthetic Processes www.affymetrix.com/ Microarray Trace /wwwscience.murdoch.edu.au/teach Continuous Culture (Chemostat) Stress (e.g., High Light) Adaptation Period Sampling mRNA/cDNA Health of Culture Equlibrium Period www.affymetrix.com/ Time

A Biologist’s Depiction of Photosynthesis http://www.bio.ic.ac.uk/research/barber/photosystemII.html

Challenge 1: Representing Biological Models To assist biologists in their modeling efforts, we must first encode candidate models; however, most biological models are: qualitative rather than quantitative; abstract in that they ignore many details; causal in that they describe chains of effects; involve processes that involve biological mechanisms. We need some formal way to represent such models that can be interpreted computationally.

Some Representations of Biological Knowledge taxonomies differential equations Boolean networks Bayesian networks

+ - NBLR NBLA PBS + - dspA psbA1 Health + - - + - + RR psbA2 Photo + - Light cpcB An Abstract Qualitative Causal Model How do plants modify their photosynthetic apparatus in high light? This model is qualitative but relates continuous variables, much as formalisms from qualitative physics (e.g., Forbus, 1984).

To evaluate a regulatory model, it must make predictions about quantitative measures of gene expression. Challenge 2: Making Predictions from Models A qualitative model cannot predict numeric values but can predict: that some partial correlations will be zero; that some partial correlation products will be equal; and the signs of correlations between variables. These predictions assume each that variable is a linear function of its causal parents, as in Glymour et al.’s (1987) Tetrad. Some models must also include statements that certain regulatory pathways dominate others.

X Y Z X Y Z X Y Z Implications of Three Causal Models XZ.Y = 0 XZ.Y  0 XZ.Y  0 Note that these implications do not depend on the effect’s sign.

To constrain candidate models, we must encode knowledge about biological entities and processes. Challenge 3: Encoding Background Knowledge This background knowledge can take the form of: an initial qualitative model of gene regulation; genes that may be involved in the phenomena; a taxonomy of these relevant genes; and constraints on links between types of genes. Analysis of biological data should take into account knowledge about the organism under study.

+ - NBLR NBLA PBS + - dspA psbA1 Health + - - + - + RR psbA2 Photo + × × - Light cpcB Some Constraints on Biological Models We can start with an initial causal model proposed by biologists. We can also forbid causal links between certain pairs of variables.

To revise a regulatory model, we must develop an algorithm that searches through the space of models. Challenge 4: Revising Models Given Expression Data This requires us to make design decisions about: the initial state from which to start search; the operators that generate new states; the evaluation function that selects among states; the overall control regime for the search; and the halting criterion for ending the search. We have implemented a two-stage method to search the space of qualitative causal models of gene regulation.

Stage 1: Determining Model Structure Our system carries out heuristic search through the space of causal model structures. Initial state: A preliminary model proposed by a biologist. Operators: Add a new link (constrained by variable types); Delete an existing link. Evaluation: Agreement with predicted relations among partial correlations, similar to those used in Tetrad. Control: Greedy search to select best structure on each round. Halting: Stop when there is no further improvement in the evaluation metric.

Stage 2: Adding Signs to the Model Our system carries out a second search through the space of signed qualitative models. Initial state: The unsigned model structure generated in Stage 1. Operators: Associate a sign (+ or –) with a given link; Label some pathways as dominant over others. Evaluation: Agreement with the signs of correlations computed from the data. Control: Exhaustive search for small models; Greedy search for more complex models. Halting: Stop when each link has an associated sign.

Expression Data on Photosynthetic Regulation Initial study produced four replications at each of five time steps.

+ - NBLR NBLA PBS + + dspA psbA1 Health + - - - × × RR psbA2 Photo + - Light cpcB A Revised Model of Photosynthesis Regulation Changes to the model improve its match to the expression data. Similar changes adapt the model to expression data from mutants.

Microarray technology provides many measurements but it often gives very few samples. Challenge 5: Dealing with Small Data Sets To reduce variance and avoid overfitting these data, our method: starts from an initial model rather than from scratch; incorporates biological constraints on model revisions; uses bootstrap sampling to generate 20 data sets, then runs the revision method 20 times and retains only changes that occur in at least 75% of the runs. Experimental studies suggest that these strategies reduce variance and produce more robust models.

To evaluate our revision method, we used a target model to create synthetic data and systematically varied distance from that model. Experimental Studies with Synthetic Data The number of incorrect revisions seems unaffected by distance.

Challenge 5: Dealing with Temporal Phenomena Many biological processes occur over extended periods of time; to deal with such phenomena, we need methods that: represent biological models with time-delayed effects; utilize these time-delayed models to make predictions; evaluate alternative models in terms of their fit to data; carry out search through the space of alternative models. We have extended our framework to handle qualitative causal models with time delays and we have done initial evaluations.

3 13 NBLR NBLA PBS 20 16 15 dspA psbA1 Health 9 17 6 7 17 8 RR psbA2 Photo 6 17 Light cpcB A Regulatory Model with Time Delays We can handle temporal phenomena by adding time delays to links. This model predicts the system’s qualitative behavior over time.

30 Light NBLA Health 25 20 15 10 5 0 10 20 30 40 50 60 70 80 90 100 Synthetic Data from Time-Delay Model

A Method for Revising Time-Delay Models Generalize correlation and partial correlation to frequency domain.

3 13 NBLR NBLA PBS 20 16 × dspA psbA1 Health 9 17 6 7 17 8 RR psbA2 Photo 6 17 Light cpcB A Reconstructed Model with Time Delays Our method reconstructs most of this model from synthetic data. Determining the link delays from time series seems tractable, but this requires a high sampling rate.

Challenge 6: Interfacing with Biologists We are developing an environment that lets its biologist users: specify qualitative causal models of biological systems; display and edit a model’s structure and details graphically; incorporate knowledge and results from previous studies; evaluate the evidence in favor of specific hypotheses; propose revisions to the model in response to observations. The environment will offer computational assistance in forming and evaluating models but let the biologist retain control.

An Interactive Environment for Biological Modeling

Additional Work on Biological Modeling Our ongoing research on biological model revision has involved: • developing other approaches to revising regulatory models, including Bayesian scoring and neural networks; • introducing taxonomic knowledge about genes and biological processes to constrain the search process; and • expanding the modeling formalism to represent biological mechanisms in addition to abstract processes. Thus, we continue to explore ways to combine knowledge with data to aid the creation of biological models.

Additional Models and Data We are also applying our biological modeling framework to: • naturalistic data on photosynthesis regulation in Cyanobacteria in a setting that mimics the day/night cycle; • testing if certain genes are targets of unobserved transcription factors, using time-series data on the yeast cell cycle; • testing whether the transcription factor c-Jun is activated by anything other than Jnk2, using data on healthy lung tissue. These efforts should further test the robustness of our approach and provide evidence of its generality.

Our approach to computational biological discovery borrows ideas from many traditions: Intellectual Influences qualitative physics and simulation (e.g., Forbus, 1984); linear causal models and their inference (Glymour et al., 1987); computational scientific discovery (e.g., Langley et al., 1987); theory revision in machine learning (e.g., Towell, 1991); interactive tools for data analysis (e.g., Schneiderman, 2001). Our work combines, in novel ways, insights from machine learning, knowledge representation, and human-computer interaction.

In summary, our work on computational biological modeling and discovery responds to six major challenges: Contributions of the Research representing biological models that are qualitative and abstract; making testable predictions from such qualitative causal models; encoding knowledge about biological entities and processes; utilizing knowledge and data to revise initial process models; making revision methods robust despite small amounts of data; developing interactive tools that let biologists remain in control. Taken together, our six responses constitute a novel and promising approach to elucidating biological models.

Revising Qualitative Models of Gene Regulation Pat Langley Jeff Shrager Institute for the Study of Learning and Expertise Palo Alto, California and Andrew Pohorille Center for Computational Astrobiology NASA Ames Research Center Moffett Field, California Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, and H. Spencer.

Initial model Revision 1.1 Revision 1.2 Revision 1.3 Revision 1.4 Revision 2.1 Revision 2.2 Revision 2.3 Revision 2.4 Revision 3.1 Revision 3.2 Revision 3.3 Revision 3.4 Greedy Search Through a Space of Models

Synthetic Data from Time-Delay Model

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and