Systems Biology: The inference of networks from high dimensional genomics data

Systems Biology: The inference of networks from high dimensional genomics data Ka Yee Yeung Nov 3, 2011

Systems Biology • “Systems biology is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behavior of that system” (Nir Friedman) • The goal is to construct models of complex biological systems and diseases (Trey Ideker)

An iterative approach High-throughput assays Experiments Data handling Integration of multiple forms of experiments and knowledge Mathematical modeling

Multi-disciplinary Science • Biology • Biotechnology • Computer Science • Mathematics and Statistics • Physics and chemistry • Engineering…

Networks as a universal language We are caught in an inescapable network of mutuality. ... Whateveraffects one directly, affects all indirectly.—MartinLuther King Jr. Internet Electronic Circuit Social Network Gene Regulatory Network Science Special Online Collection: Complex systems and networks: http://www.sciencemag.org/complexity/

Road Map • Definitions: graphical representation of networks • Different types of molecular networks • What can we do with networks? • Network construction methods • Co-expression networks • Bayesian networks • Regression-based methods

Graphical Representation of Gene Networks • G=(V,E) where • V: set of nodes (vertices) • E: set of edges between the nodes that represent the relationships between nodes • Directed vs. undirected • Network topology: connectivity structure • Modules: subset of nodes that are more highly interconnected with each other than other nodes in the network Undirected Directed

Degree • The degree k of a node is the number of edges connected to it. • In a directed graph, each node has an in-degree and an out-degree. Courtesy of Bill Noble

Degree distribution • The degree distribution plots the number of nodes that have a given degree k as a function of k. • The shape of the degree distribution allows us to distinguish among types of networks. Courtesy of Bill Noble

Scale-free networks • Most nodes have only one connection; a few hub nodes are highly connected. • The degree distribution is exponential, which yields a straight line on a log plot. • Most biological networks are scale-free. Courtesy of Bill Noble

Molecular or Biochemical Pathways • A set of coupled chemical reactions or signaling events. • Nodes are molecules (often substrates) and edges represent chemical reactions. • Represent decades of work in which the underlying chemical reactions are validated. • Example:KEGG (Kyoto Encyclopedia of Genes and Genomes) • Contains 410 “pathways” that represent molecular interaction and reaction networks that are manually curated from 149,937 published references (as of 10/25/2011).

Molecular Networks Constructed from High-throughput assays (1) • Physical interaction network: • A graphical representation of molecular binding interactions such as a protein-protein interaction (PPI) network. • Nodes are molecules; edges represent physical interactions between molecules. • Example: Yeast PPI network in which most interactions derived from large-scale experiments like yeast 2-hybrid data (high false +ve/-ve rates)

Molecular Networks Constructed from High-throughput assays (2) Correlation or co-expression network: A graphical representation that averages over observed expression data. Nodes are mRNA molecules, edges represent correlations between expression levels of connected nodes. Bayesian networks: A directed, graphical representation of the probabilities of one observation given another. Nodes represent mRNA molecules; edges represent the probability of a particular expression value given the expression values of the parent nodes.

What can we do with these molecular networks? Using the position in networks to describe function Guilt by association Finding the causal regulator(the "Blame Game") Courtesy of Mark Gerstein

What can we do with these molecular networks? Hubs tend to be essential! Power-law distribution log(Frequency) • Success stories: • Network modeling links breast cancer susceptibility and centrosome dysfunction. Pujana et al. Nature Genetics 2007 • Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target. PNAS 2006 Courtesy of Mark Gerstein

What can we do with these molecular networks? Network-based drug discovery • Success stories: • Variations in DNA elucidate molecular networks that cause disease. Chen et al. Nature 2008. • Genetics of gene expression and its effect on disease. Emilsson et al. Nature 2008. Fig 4 Schadt et al. Nature Reviews Drug Discovery 2009

Motivation of gene network inference • Using biochemical methods, it takes 1000’s of person years to assign genes to pathways. • Even for well-studied genomes, the majority of genes are not mapped to known pathways. • As more genomes are being sequenced, and more genes are discovered, we need systematic methods to assign genes to pathways.

A gene-regulation function describes how inputs such as transcription factors and regulatory elements, are transformed into a gene’s mRNA level. Kim et al. Science 2009

A gene-regulation function describes how inputs such as transcription factors and regulatory elements, are transformed into a gene’s mRNA level. Modeling DNA sequence-based cis-regulatory gene networks. Bolouri & Davidson. 2002. Kim et al. Science 2009

Network construction methods • Co-expression networks • Bayesian networks • Regression-based methods

Goal: construct gene networks

Early inference of transcriptional regulation: Clustering • Clustering: extract groups of genes that are tightly co-expressed over a range of different experiments. • Pattern discovery • No prior knowledge required • Applications: • Guilt by association (functional annotations) • Extraction of regulatory motifs • Molecular signatures for tissue sub-types

Correlation: pairwise similarity Experiments X genes n 1 p 1 X Similarity matrix Raw matrix genes genes Y Y n n Correlation (X,Y) = 1 Correlation (X,Z) = -1 Correlation (X,W) = 1

Clustering algorithms • Inputs: • Similarity matrix • Number of clusters or some other parameters • Many different classifications of clustering algorithms: • Hierarchical vs partitional • Heuristic-based vs model-based • Soft vs hard

Hierarchical Clustering • Agglomerative(bottom-up) • Algorithm: • Initialize: each item a cluster • Iterate: • select two most similar clusters • merge them • Halt: when required number of clusters is reached dendrogram

Hierarchical: Single Link • cluster similarity = similarity of two most similar members - Potentially long and skinny clusters + Fast

Hierarchical: Complete Link • cluster similarity = similarity of two least similar members + tight clusters - slow

Hierarchical: Average Link • cluster similarity = average similarity of all pairs + tight clusters - slow

Co-expression Networks • Co-expression networks • Aka: Correlation networks, association networks • Use microarray data only • Nodes are connected if they have a significant pairwise expression profile association across environmental perturbations • References: • A general framework for weighted gene co-expression network analysis (Zhang, Horvath SAGMB 2005) • WGCNA: an R package for weighted correlation network analysis. (Langfelder, Horvath BMC Bioinformatics 2008)

Overview: gene co-expression network analysis Steps for constructing aco-expression network • Microarray gene expression data • Measure concordance of gene expression with correlation • C) The Pearson correlation matrix is thresholded to arrive at an adjacency matrix  unweighted network • Or transformed continuously with the power adjacency function  weighted network

Example: co-expression network Correlation matrix Correlation threshold, t=1 C I H A G J At t=1, there are no edges, so all nodes have degree (k) = 0 B F D E

Example: co-expression network Correlation matrix Correlation threshold, t=0.9 C I H A At t=0.9, there are 4 edges, so 8 nodes have degree (k) = 1 G J B F D E

Example: co-expression network Correlation matrix Correlation threshold, t=0.7 C I H A G J B Log(P(k)) F D E Log(k)

Bayesian networks • A directed acyclic graph (DAG) such that the nodes represent mRNA expression levels and the edges represent the probability of observing an expression value given the values of the parent nodes. • The probability distribution for a gene depends only on its regulators (parents) in the network. Example: G4 and G5 share a common regulator G2, i.e., they are conditionally independent given G2.  factorization of the full joint probability distribution into component conditional distributions. Needham et al. PLOS Comp Bio 2007

Independent Events If G1, …, G5 are independent, then the joint probability p(G1, G2, G3, G4, G5) = p(G1) p(G2) p(G3) p(G4) p(G5) G1 G2 G3 Example: K=“KaYee gives the lecture today”. R=“It is raining outside today” Whether it is rain or shine outside doesn’t affect whether KaYee is giving the lecture today. p(K,R) = p(K) * p(R) G4 G5

Conditional Probability Distributions • Conditional probability distributions: p(B|A) = the probability of B given A. Example: K=“KaYee gives the lecture today”. E=“today’s lecture contains equations” P(E, K) = Probability that Ka Yee gives the lecture today and today’s lecture contains equations = 0.05. P(K)=1/10 = 0.1. P(E|K) = P(E, K) / P(K) = 0.05/0.1 = 0.5. • Score a network (fit) in light of the data: p(M|D) where D=data, M=network structure  infer how well a particular network explains the observed data.

Conditional Independence Example: K=“KaYee gives the lecture today”. E=“today’s lecture contains equations” C=“today’s slides are in Comic Sans font” If Ka Yee is giving the lecture today, then whether today’s lecture contains equations doesn’t affect whether today’s slides are in Comic Sans. P(E|K,C) = P(E|K) E and C are conditionally independent given K. K • In Bayesian networks, each node is independent of its non-descendants, given its parents in the DAG. • Using conditional independence between variables, the joint probability distribution of the models may be represented in a compact manner. E C

Joint Probability Distribution p(G1, G2, G3, G4, G5) = p(G1) p(G2|G1) p(G3|G1) p(G4|G2) p(G5|G1, G2, G3)

Constructing a Bayesian network • Variables (nodes in the graph) • Add edges to the graph by computing conditional probabilities that characterize the distribution of states of each node given the state of its parents. • The number of possible network structures grows exponentially with the number of nodes, so an exhaustive search of all possible structures to find the one best supported by the data is not feasible. • Monte Carlo Markov Chain (MCMC) algorithm: • Start with a random network. • Small random changes are then made to the network by flipping, adding, or deleting individual edges. • Accept changes that improve the fit of the network to the data.

Bayesian networks • Advantages: • Compact and intuitive representation • Integration of prior knowledge • Probabilistic framework for data integration • Limitation: no feedback loop  dynamic Bayesian networks (variables are indexed by time and replicated in the network) • References: • Using Bayesian Network to Analyze Expression Data. Friedman et al. J. Computational Biology 7:601-620, 2000. • A Primer on Learning in Bayesian Networks for Computational Biology. Needham et al. PLOS Computational Biology 2007.

What kinds of data contain potential information about gene networks? Large expression sets • Co-expression (correlation of expression levels) implies connectivity • But correlation ≠ causality ✔ • Adding causality • Genetic perturbation: DNA variation at A influences RNA variation at B. • Time series: A goes up prior to B. • Prior knowledge A B A B  A B A B C

Adding genetics data • Quantitative trait locus (QTL):a region of DNA that is associated with a particular trait (eg. Height) • QTL mapping (linkage analysis): correlate the genotypic and phenotypic variation

: Our data Our experimental design: Time dependencies: ordering of regulatory events. Genotype data: correlate DNA variations in the segregants to measured expression levels Phenotype: RNA levels in response to drug perturbation DNA genotype 6 time points BY (lab) × RM (wild) . . . 95 segregants Genetics of global gene expression. Rockman & Kruglyak. 2006. Experimental design: Roger Bumgarner, Kenneth Dombek, Eric Schadt, Jun Zhu.

Other data, e.g. protein-protein interaction, genetic interaction, genotype etc. Genome-wide binding data Literature Expression data genes Supervised learning: integration of external data regulators 0.95 0.23 0.78 … …. Probability that R regulates g Regulators constrained by the external data sources Variable selection Gene regulatory network Time series expression data g 44 Yeung et al. To appear in PNAS.

Integration of external data Genome-wide binding data Other data Expression data Literature Compute variables (Xi) that capture evidence of regulation for (TF-gene) pairs Y Xi Training data: Positive (Y=1) vs. negative (Y=0) training examples Apply logistic regression to determine weights (ai’s) of Xi’s. TF-gene genes 0.95 0.23 0.78 … …. regulators Probability that R regulates g

Constraining candidate regulators Graphical representation of network as a set of nodes and edges. Goal: To infer parent nodes (regulators) for each gene g using the time series expression data R2 R3 R1 • Without prior knowledge, every gene is a potential regulator of every other gene. We want to restrict the search to the most likely regulators. • For each gene g, we estimated how likely that each regulator R regulates g (a priori) using the supervised framework and the external data sources. g

Regression-based approach Use the expression level at time (t-1) to predict the expression levels at time tin the same segregant Let X(g,t,s)= expression level of gene g at time t in segregant s Potential regulators R t-1 t g Variable selection Yeung et al. To appear in PNAS.

Bayesian Model Averaging (BMA)[Raftery 1995], [Hoeting et. al. 1999] • BMA takes model uncertainty into account by averaging over the posterior distributions of a quantity of interest based on multiple models, weighted by their posterior model probabilities. • Output: Posterior probabilities for selected genes and selected models

Assessment • Recovery of known regulatory relationships: • We showed significant enrichment between our inferred network and the assessment criteria. • Lab validation of selected sub-networks • Comparison to other methods in the literature. Genes that respond to deletion with rapamycin perturbation Child nodes of selected TFs … WT DTF

Systems Biology: The inference of networks from high dimensional genomics data