Genetic Regulatory Network Inference

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University

Why Study Network Inference? It can help us understand how to interpret and when to trust biological networks It is a model for many kinds of complex inference problems in systems biology and beyond It is a great example of a machine learning problem, a kind of computer science central to much work in biology Network inference is a good way of thinking about issues in data abstraction central to all computational thinking

Our Assumptions + + + - + Cro cI - + * genes conditions We will focus specifically on transcriptional regulatory networks, assuming no cycles We will assume, at least initially, that our data source is a set of microarray gene expression values *Clustered gene expression data from NCBI Gene Expression Omnibus (GEO), entry GSE1037: M. H. Jones et al. (2004) Lancet 363(9411):775-81.

Intuition Behind Network Inference 1 1 1 1 1 genes 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 conditions 1 1 + - 1 + - 2 3 2 3 - - … 2 3 - 1 1 + + - - 4 2 3 2 3 - correlated expression implies common regulation that intuition still leaves a lot of ambiguity

Why Is Intuition Not Enough? … • Models are ambiguous: • Data are noisy: • Data are sparse: * 1 1 1 3 3 2 3 2 2 4 4 4 *Clustered gene expression data from NCBI Gene Expression Omnibus (GEO), entry GSE1037: M. H. Jones et al. (2004) Lancet 363(9411):775-81.

A Next Step Beyond Intuition: Assuming a Binary Input Matrix conditions gene 1 1 1 0 0 1 1 1 0 gene 2 1 1 1 1 0 0 1 0 gene 3 1 0 0 0 0 1 0 0 gene 4 0 0 0 0 0 1 0 1 1 4 We will assume for the moment that genes only have two possible states: 0 (off) or 1 (on) We will also assume that we want to find directionality but not strength of regulatory interactions: 2 3

Making it Even Simpler: Two Genes conditions gene 1 1 1 0 0 1 1 1 0 gene 2 1 1 1 0 0 1 0 1 2 1 2 1 1 2 model 2 “G2 regulates G1” model 3 “G1 and G2 are independent” model 1 “G1 regulates G2” Only three possible models to consider

Judging a Model: Likelihood Complicated inference problems like this are commonly described in terms of probabilities We want to infer a model (which we will call M) using a data set (which we will call D) Problems like this are commonly posed in terms of maximizing a likelihood function: We read this as “probability of the data given the model,” i.e., the probability that a given model would generate a given data set

What is the Probability of a Microarray? Pr{ }= Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ } 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 0 We can describe the probability of a microarray as the product of the probabilities of all of its individual measurements:

What is the Probability of One Measurement on a Microarray? 1 0 1 0 Pr{ } =Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ } =5/8 x 5/8 x 3/8 x 3/8 x 5/8 x 5/8 x 5/8 x 3/8 = 0.00503 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 0 • We can estimate Pr{ } and Pr{ } by counting how often each individual value occurs: • Pr{ } = 5/8 • Pr{ } = 3/8 • Therefore:

Evaluating One Model gene 1 1 1 0 0 1 1 1 0 data D = gene 2 1 1 1 0 0 1 0 1 model M = 1 2 Pr{D|M} = Pr{ } x Pr{ } = 0.00503 x 0.00503 = 2.5 x 10-5 1 1 0 0 1 1 1 0 1 1 1 0 0 1 0 1

Adding in Regulation gene 1 1 1 0 0 1 1 1 0 2 1 gene 2 1 1 1 1 0 0 1 0 Pr{G2= |G1= } = 1/5 Pr{G2= |G1= } = 2/3 0 1 0 0 Pr{G2= |G1= } = 4/5 Pr{G2= |G1= } = 1/3 0 1 1 1 How do we evaluate output probabilities for a regulated gene? We need the notion of conditional probability: evaluating the probability of gene 2’s output given that we know gene one’s output:

Evaluating Another Model gene 1 1 1 0 0 1 1 1 0 data D = gene 2 1 1 1 0 0 1 0 1 model M = 1 2 Pr{D|M} = Pr{ } x Pr{ | } = 0.00503 x (1/5 x 4/5 x 2/3 x 1/3 x 4/5 x 4/5 x 4/5 x 2/3) = 6.1 x 10-5 1 1 0 0 1 1 1 0 1 1 1 0 0 1 0 1 1 0 0 1 1 1 0 1

Evaluating Another Model gene 1 1 1 0 0 1 1 1 0 data D = gene 2 1 1 1 0 0 1 0 1 model M = 1 2 Pr{D|M} = Pr{ | } x Pr{ } = (1/5 x 4/5 x 2/3 x 1/3 x 4/5 x 4/5 x 4/5 x 2/3) x 0.00503 = 6.1 x 10-5 1 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 0 1

Comparing the Models for Two Genes 1 1 0 0 1 1 1 0 Pr{ | } = 2.5 x 10-5 1 2 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 Pr{ | } = 6.1 x 10-5 1 2 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 Pr{ | } = 6.1 x 10-5 1 2 1 1 1 0 0 1 0 1 Conclusion: Knowing the expression of gene 1 helps us predict the expression of gene 2 and vice versa; we can suggest there should be an edge between them but cannot decide the direction it should take

Generalizing to Many Genes 1 1 0 0 1 1 1 0 1 4 1 1 1 0 0 1 0 1 Pr{ | } = Pr{ } x Pr{ | } x Pr{ | , } x Pr{ | } 1 0 0 0 0 1 0 0 2 3 0 0 0 0 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1 0 0 1 0 1 1 0 0 1 1 1 0 1 1 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 The same basic concepts let us evaluate the plausibility of any regulatory model This is known as a Bayesian graphical model

Adding Prior Knowledge 1 1 0 0 1 1 1 0 1 1 4 1 1 1 0 0 1 0 1 Pr{ | } x Pr{ } x Pr{ } x Pr { } x Pr { } x … 1 0 0 0 0 1 0 0 2 3 2 0 0 0 0 1 0 1 0 1 1 2 4 3 3 We can also build in any prior knowledge we have about the proper model (e.g., from the literature) We can use that knowledge by simply multiplying each likelihood by our prior confidence in its validity:

Adding in Other Data Types 1 1 1 0 0 1 1 1 0 Pr{ , ACGATCTCA… | } = Pr{ | } x Pr{ACGATCTCA … | } 1 1 1 0 0 1 0 1 2 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 0 1 2 1 Evaluate as before 2 We can also incorporate other pieces of evidence in much the same way Example: suppose we have microarrays and TF binding site predictions: Evaluate by a binding site prediction method (e.g., PSSM)

Moving from Discrete to Real-Valued Data 1.5 0.4 -0.3 -1.2 -1 0 1 1.5 0.4 -0.3 -1.2 Pr{ } = We can also drop the need for discrete (on or off) data by making an assumption of how values vary in the absence of regulation, e.g., Gaussian:

Finding the Best Model • We now know how to compare different network models, but finding the best model is not easy; far too many possibilities to compare them all • Algorithms for model inference is a more complex topic than we can cover here, but there are some general approaches to be aware of • optimization: many specialized methods exist for finding the best model without trying everything; solving hard problems of this type is a core concern in computer science • sampling: there are also many specialized methods for randomly generating solutions likely to be “good” and seeing what model features are preserved across most solutions; this is a core concern of statisticians

Network Inference in Practice The methods covered here are the key ideas behind how people really infer networks from complex data The practice is usually more complicated, though: many kinds of data sources, specialized prior probabilities, lots of algorithmic tricks needed to get good results If you really want to know the details, these topics are typically covered in a class on machine learning

Genetic Regulatory Network Inference

Genetic Regulatory Network Inference

Presentation Transcript

Genetic network inference: from co-expression clustering to reverse engineering

PROTEIN INTERACTION NETWORK – INFERENCE TOOL

Genetic Regulatory Mechanisms

Motif-directed Network Component Analysis for Regulatory Network Inference

Bayesian network inference

Network Inference

Gene Regulatory Network Inference

Predicting Genetic Regulatory Response Using Classification

Gene regulatory network

Modeling Genetic Network: Boolean Network

Stochastic Simulations of Genetic Regulatory Networks: The Genetic Toggle Switch

Nonparametric Bayesian Methods for Genetic Inference

Inference of gene regulatory networks using regression based network method

Genetic Regulatory Mechanisms

Genetic Regulatory Networks

Dynamic Network Inference

Exploring Network Inference Models

Regulatory Guidance for Genetic Testing

Recombination and genetic variation – models and inference

Motif-directed Network Component Analysis for Regulatory Network Inference