Methods for evaluating inference algorithms

Methods for evaluating inference algorithms June, 2005 Omer Berkman Tel Aviv University, Israel.

Introduction • A central goal of molecular biology is to understand the regulatory mechanisms of gene transcription and protein synthesis. • One major goal of functional genomics research is to take large sets of biological data and elucidate functional interactions between elements in a causal pathway or network. • The invention of DNA microarrays, which measure the abundance of thousands of mRNA targets simultaneously, has been an important milestone. • Many approaches to the reverse engineering of genetic regulatory networks from gene expression data have been explored.

Reminder of the problem • Interactions between thousands of genes have to be learned from short time series of typically only about a dozen measurements. When can genetic networks be identified from microarray data? • According to Zak et al. (2002) – not too soon: Identifying the network architecture may only be possible when a rich microarray time course is coupled with a priori biological information.

Is the inference possible? • It is not clear whether an elicitation of biological structures from expression data is at all possible (whether the posterior probabilities over structures can be expected to be sufficiently informative). • Most studies assess the inference results by comparing predicted regulatory interactions with those known from the biological literature. • This approach is controversial due to the absence of known gold standards, which renders the estimation unreliable and difficult.

Biological literature – problem one • Having detected an interaction, authors tend to delve into the literature to confirm their findings. • They back up their results with interactions between proteins which are similar to the ones encoded by genes studied in the performed experiment. • First, sequence similarity does not necessarily imply similar functions. • Second, there is an inherent arbitrariness in deciding what is similar. • One may therefore suspect that some of the reported true interactions are spurious and do not really support the interactions detected in the reverse engineering procedure.

Biological literature – problem two • The second and more serious drawback is the difficulty in estimating the false detection rate. • This is because on predicting a gene interaction that is not supported by the literature, it is impossible to decide, without further expensive interventions in the form of multiple gene knock-out experiments, whether the algorithm has discovered a new, previously unknown interaction, or whether it has flagged a spurious edge.

Simulation motivation • Some of the predicted interactions are biologically reasonable, some appear to be unreasonable, but the validity of the vast majority is difficult to assess. • Testing all of the predicted interactions experimentally would take decades. • No suitable approach has been formulated for evaluating the effectiveness at recovering models of complex biological systems from limited data. • To overcome this limitation, recent works used biologically reasonable simulated data.

Simulation approach benefits • Enables evaluation of different algorithms. • Enables better understanding of the inference algorithms. • Enables parameters setting such as noise level, network topology and sampling interval. • Enables learning with unlimited data. • Gives indications about yet-unrealistic environments. • Enables a priori to design experiments and data-collection protocols that are amenable to functional network inference.

Simulation approach - illustration Smith et al. 2002 • Build a known network. • Create a simulator in which we know all the rules. • Sample data from it. • Present the sampled data to inference algorithm. • Evaluate the algorithm output.

Smith et al. (2002) - introduction • The first approach for evaluating the effectiveness of inference algorithms at recovering models from limited data. • They created data simulator, sampled the simulated data, and used the sampled data to evaluate the effectiveness of the algorithm they developed. • The first application to modeling complex systems at multiple biological levels of organization beyond genetic regulation.

BRAINSIM – Smith et al. (2002)Design • A simulator which models an electrophysiological activity (0-400 Hz) and expression level of 100 genes (0-50 arbitrary units). • 90% of the simulated variables are unregulated by other variables in the system and are included simply as distracters. Smith et al. 2002

BRAINSIM – Initialization • Activity begins at a random value. • The initial level of a gene is called its ‘target’ value, intends to correspond to its constitutive expression level. • This value is selected as a random value in the range 0-10 for up-regulated genes, 40-50 for down-regulated genes and 0-50 for the unregulated genes.

BRAINSIM – Dynamics rules Degradation - from all genes a constant amount, chosen to be 4, is added or subtracted to move closer to their target levels. Regulation - from all regulated genes a proportion, chosen to be 0.2, of their regulator’s level is added or subtracted. The unregulated genes added or subtracted a random number, chosen to be 0-5, to simulate regulation by other unmeasured processes. Noise – A random amount, chosen to be 0-6, was added to or subtracted from each gene to simulate stochasticity in gene expression.

BRAINSIM – Simulated Data Generated data for Activity, Gene1, Gene10 and Gene12. Note the regulation effects and the time lags, as well as the random walk of the unregulated gene. Data sampled every 5 time steps, time 45 – 150, causing loss of information. Quartile discretized values, causing another information loss but making the data more robust.

NETWORKINFERENCE – Smith et al. (2002)Design • Takes a collection of observed data as input. • Searches for Bayesian networks that are good at explaining the observed data without unnecessary complexity. • Designed in a manner that is capable of searching for Dynamic Bayesian Networks from temporal data.

Bayesian networks definitions • Representation of a joint probability distribution which consists of two components: • G is a directed acyclic graph (DAG) whose vertices correspond to the random variables. • θ describes a conditional distribution for each variable, given its parents in G.

Bayesian networks illustration • The network structure implies that the joint distribution has the product form P(A,B,C,D,E) = P(A)P(B|A,E)P(C|B)P(D|A)P(E) • A graph-based model that captures properties of conditional independence between variables. • Encodes the Markov assumption: Each variable is independent of its non-descendants, given its parents in the graph.

Learning network from data • Find the parameters: • From Bayes rule we have: • And:

Bayesian networks properties • Can capture many types of relationships between variables. • Owing to their probabilistic nature, BN algorithms are capable of handling noisy data. • Effective handling of hundreds of variables.

BN model genetic networks • BN were first applied to this problem by Friedman (2000), Pe’er and Hartemink (2001). • At a qualitative level, the structure of a BN describes the relationships between variables in the form of conditional independence relations. • At a quantitative level, relationships between the interacting agents are described by conditional probability distributions. • The probabilistic nature of this approach is capable of handling noise inherent in both the biological processes and the microarray experiments.

Bayesian networks drawbacks • Networks with the same skeleton but different edge directions can have the same marginal likelihood P(D|M). Thus it’s impossible to distinguish between them on the basis of the data. • The acyclicity constraint rules out recurrent structures. However, feedback is an essential feature of biological systems.

Husmeier 2003 Equivalent DBN obtained by unfolding the recurrent network in time An interaction between genes is not instantaneous. Its effect happens with a time delay after its cause. Dynamic Bayesian networks illustration Recurrent network comprising two genes with feedback that interact with each other

NETWORKINFERENCE – Search strategy • Identifying the highest-scoring network has been shown to be hard (Chickering 1996). • NETWORKINFERENCE uses simulated annealing (with extension for re-annealing to avoid becoming trapped in local maxima). • NETWORKINFERENCE also allows to modify the prior over network structures by specifying sets of links that are required to be present or required to be absent.

Simulated annealing overview • The concept is based on the process of annealing - melt is slowly cooled so that the system at any time is approximately in thermodynamic equilibrium. • If the change in energy is negative the new configuration is accepted. If the change in energy is positive it is accepted with a certain probability.This processes is then repeated sufficient times to give good sampling statistics for the current temperature, and then the temperature is decremented and the process repeated until a frozen state is achieved. • If the initial temperature of the system is too low or cooling is done insufficiently slowly the system may become trapped in a local minimum energy state.

Simulated annealing illustration Computational Science Education Project http://www.phy.ornl.gov/csep

Sampling rate of 10 Recovery: 27±0.3%, Correctness: 30±0.4% NETWORKINFERENCE – Results Sampling rate of 5 Recovery: 89±0.1%, Correctness: 98±0.1%

Smith et al. (2002) - discussion • It is possible to sort out noise and distracters from a specific regulatory network of interest. • It is possible to evaluate and thus develop inference algorithms by using simulated data. • Shows how sampling influences the inference. • Simulator and inference are based on distinguished models • Used very simple model for simulator. • The inference procedure was tested for an unrealistically large training set • Conclusions are not convincing enough.

Husmeier (2003) - abstract • The objective of this study is to test the viability of the BN paradigm in a realistic simulation study. • First, gene expression data are simulated from a realistic biological network. • Then, interaction networks are inferred from these data using DBN and Bayesian learning with Markov Chain Monte Carlo.

Inference - MCMC • One has to resort to heuristic optimization methods. • Markov Chain Monte Carlo (Chib and Greenberg 1995): Given a network structure Mold, a new structure Mnew is proposed, with proposal probability Q(Mnew|Mold). Husmeier 2003

DBN settings Nachman 2004 • To avoid an explosion of the model complexity, parameters are tied such that the transition probabilities between time slices are the same. • Edges within a time slice are not allowed. • A limitation was set on the maximum number of edges converging on a single node.

Simulation study • Data are generated from a known Bayesian network. • New networks are learned and compared with the true network.

Evaluating inference • Sensitivity is the proportion of recovered true edges: TP / (TP + FN) • Complementary specificity is the proportion of erroneously recovered spurious edges: FP / (TN + FP) • Plotting the ensuing sensitivity against the corresponding complementary specificity gives the Receiver Operator Characteristic.

In a second series of simulations, 38 redundant nodes were added to this network as confounders, giving a total of 50 nodes. Evaluation on synthetic data - structure • Using a structure of sub-network of the yeast cell cycle, taken from Friedman et al. (2000).

Evaluation on synthetic data - distribution • First conditional probability distribution: Noisy regulation according to a binomial distribution • Excitation: P(on|on) = 0.9, P(on|off) = 0.1; • Inhibition: P(on|on) = 0.1, P(on|off) = 0.9; • Noisy XOR co-regulation: P(on|on,on) = P(on|off,off) = 0.1, P(on|off,on) = P(on|on,off) = 0.9; • Second conditional probability distributions: Stochastic interaction where all parameters were chosen at random (binomial or trinomial).

Evaluation on synthetic – results (ROC curves) Structure 1 Structure 2 T=100 T=30 T=7 Solid line shows noisy regulation, dashed line shows stochastic interaction

And still, this is an upper bound • The model used for inference is identical to the simulator model. • The true continuous signals are typically sampled at discrete time points, which loses information, especially if the sampling intervals are not matched to the relaxation times of the true biological processes. • Gene expression ratios are typically discretized, which inevitably adds noise and causes further loss of information.

Evaluation on realistic simulated data The study applies the model regulatory network proposed by Zak et al. (2001), but in contrast to it, the system was augmented by adding 41 spurious, unconnected genes (giving a total of 50 genes), which were up- and down-regulated at random. Husmeier 2003

Realistic simulated data – two experiments • The first experiment (followed closely Zak’s study): Ligand was injected at time 1000 min. Then, 12 data points were collected over 4000 min in equidistant intervals. • The second experiment (adopted a sampling strategy different from Zak): focusing on the time immediately after external perturbation, when the system is in disequilibrium. The sampler therefore collected 12 data points over a shorter interval of only 500 min immediately after ligand injection, between times 1100 and 1600 min.

Inference results • The most restrictive prior (maximum fan-in = 2). • The threshold θ was chosen so as to obtain the same number of edges between non-spurious nodes as in the true network. True Learned Solid arrows show true edges and dashed arrows represent spurious edges

Realistic simulated data – Results Experiment1 Experiment2 Fan-in=2 Fan-in=3 Fan-in=4 Averaged over three MCMC simulations. The solid line shows the ROC curve obtained from gene expression data alone, while the dash-dotted line shows the ROC curve obtained when including sequence information.

Realistic simulated data – Discussion • The inclusion of available prior knowledge improves the performance of the inference. • Sampling over a long time interval, which covers the system in equilibrium, gives areas that are small, and the low slope of the curves at the left-hand side implies that even the dominant true edges are obscured by a large proportion of spurious edges. • Sampling to a short time interval, when the system is in a perturbed non-equilibrium state, gives areas that are significantly increased, and the larger slope of the curves implies that the dominant true edges are obscured by far fewer spurious edges.

Discussion • The search for new genetic interactions is hard, but it is significantly more effective than a search from tabula rasa. • Simulations quantify how much can be learned from the data in the described unfavourable situation. • It demonstrates how the network inference performance varies with the training set size, the prior assumptions, the experimental sampling strategy and the inclusion of biological information.

References • Friedman et al., “Learning the structure of dynamic probabilistic networks”, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 1998. • Friedman et al., “Using Bayesian Networks to Analyze Expression Data”, Journal of Computational Biology, 2000. • Husmeier Dirk, “Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks”, Bioinformatics 2003. • Nachman et al.,“Inferring quantitative models of regulatory networks from expression data”, Bioinformatics 2004. • Smith et al. “Evaluating functional network inference using simulations of complex biological systems”, Bioinformatics 2002. • Yu et al.,“Advances to Bayesian network inference for generating causal networks from observational biological data”, Bioinformatics 2004. • Zak et al.,“Local Identifiability: when can genetic networks be identified from microarray data?”, Proceedings of the Third International Conference on Systems Biology 2004.

Methods for evaluating inference algorithms