CS 5263 Bioinformatics

CS 5263 Bioinformatics Reverse-engineering Gene Regulatory Networks

Genes and Proteins Gene (DNA) Transcriptional regulation Transcription (also called expression) mRNA mRNA degradation Translational regulation Translation (De)activation Protein Post-translational regulation

Gene Regulatory Networks • Functioning of cell controlled by interactions between genes and proteins • Genetic regulatory network: genes, proteins, and their mutual regulatory interactions repressor gene 1 activator repressor gene 2 gene 3

Reverse-engineering GRNs • GRNs are large, complex, and dynamic • Reconstruct the network from observed gene expression behaviors • Experimental methods focus on a few genes only • Computer-assisted analysis: large scale • Since 1960s • Theoretical study mostly • Attracting much attention since the invent of Microarray technology • Emerging advanced large-scale assay techniques are making it even more feasible (ChIP-chip, ChIP-seq, etc.)

Problem Statement • Assumption: expression value of a gene depends on the expression values of a set of other genes • Given: a set of gene expression values under different conditions • Goal: a function for each gene that predicts its expression value from expression of other genes • Probabilistically: Bayesian network • Boolean functions: Boolean network • Linear functions: linear model • Other possibilities such as decision trees, SVMs

Characteristics • Gene expression data is often noisy, with missing values • Only measures mRNA level • Many genes regulated not only on the transcriptional level • # genes >> # experiments. Underdetermined problem!!!! • Correlation  causality • Good news: Network structure is sparse (scale-free)

Methods for GRN inference • Directed and undirected graphs • E.g. KEGG, EcoCyc • Boolean networks • Kauffman (1969), Liang et al (1999), Shmulevich et al (2002), Lähdesmäki et al (2003) • Bayesian networks • Friedman et al (2000), Murphy and Mian (1999), Hartmink et al (2002) • Linear/non-linear regression models • D’Haeseleer et al (1999), Yeung et al (2002) • Differential equations • Chen, He & Church (1999) • Neural networks • Weaver, Workman and Stormo (1999)

Boolean Networks • Genes are either on or off (expressed or not expressed) • State of gene Xi at time t is a Boolean function of the states of some other genes at time t-1 X Y Z X’ Y’ Z’ X’ = Y and (not Z) Y’ = X Z’ = Y

Learning Boolean Networks for Gene Expression • Assumptions: • Deterministic (wiring does not change) • Synchronized update • All Boolean functions are probable • Data needed: 2N for N genes. (In comparison, N needed for linear models) • General techniques: limit the # of inputs per gene (k). Data required reduced to 2k log(N).

Learning Boolean Networks • Consistency Problem • Given: Examples S: {<In, Out>}, where • In {0,1}k, output  {0,1} • Goal: learn Boolean function f such that for every <In, Out>  S, f(In) = out. • Note: • Given the same input, the output is unique. • For k input variables, there are at most 2k distinct input configurations. • Example: <001,1> <101,1> <110,1> <010,0> <011,0> <101,0> 1,1 5,1 6,1 2,0 3,0 5,0

Learning Boolean Networks • <001,1> • <101,1> • <110,1> • <010,0> • <101,1> • <101,0> ? • no clash -> consistency. • Question marks -> undetermined elements • O (Mk), M is # of experiments • N genes, Choose k from N, N * C(N, k) * O(MK) 1 0 0 ? * 1 ? • Best-fit problem: Find a function f with minimum # of errors • Limited error-size problem: Find all functions with error-size within max Lähdesmäki et al, Machine Learning 2003;52: 147-167.

State space and attractor basins

What are some biological interpretations of basins and attractors?

Linear Models • Expression level of gene at time t depends linearly on the expression levels of some genes at time t-1 • Basic model:Xi (t) = Σj Wij Xj(t-1) • Xi’ (t) = Σj Aij Xj(t), whereXi(t)can be measured,Xi’ (t)can be estimated from Xi(t) • In matrix form: X’NM = ANN XNM , where M is the number of time points, N is the number of genes t-1 t W11 X1 X1 W21 W31 X2 X2 W31 W32 X3 X3 W33

Linear Models (cont’d) • X’NM = ANN·XNM • ANN: connectivity matrix, Aijdescribes the type and strength of the influence of the jth gene on the ith gene. • To solve A, need to solve MN linear equations • In general N2 >> MN, therefore under-determined => infinity number of solutions

Get Around The Curse of Dimensionality • Non-linear interpolation to increase # of time points • Cluster genes to reduce # of genes • Singular Value Decomposition (SVD) • A = A0 + CNN· VTNN, where cij = 0 if j > M • Take A0as a solution, guaranteed smallest sum of squares. • Robust regression • Minimize # of edges in the network • Biological networks are sparse (scale-free) CNN Cij 0

Robust Regression • A = A0 + CNN· VTNN, • Minimizing # of non-zero entries in A by selecting C • Set A = 0, then C · VT= -A0 , solve for C. • Over-determined. (N2 equations, MN free variables). • Robust regression • Fit a hyper-plane to a set of points by passing as many points as possible

Simulation Experiments SVD alone SVD + Robust Regression Yeung et al, PNAS. 2002;99:6163-8.

Simulation Experiments (cont’d) Nonlinear System close to steady state Linear System • Does not work for nonlinear system not close to steady state • Scale-free property does not hold on small networks

Bayesian Networks X1 X2 • A DAG G (V, E), where • Vertex: a random variable • Edge: conditional distribution for a variable, given its parents in G. • Markov assumption: i, I (Xi, non-descendent(Xi) | PaG(Xi)) e.g. I(X3, X4 | X2), I(X1, X5 | X3) X3 X4 X5 Chain rule: P(X1, X2, …, Xn) = ΠiP(Xi | PaG(Xi), i = 1..n P (X1, X2, X3, X4, X5) = P(X1) P(X2) P(X3 | X1, X2) P (X4 | X2) P(X5 | X3) Learning: argmaxG P (G | D) = P (D | G) * P (G) / C

Bayesian Networks (Cont’d) • Equivalence classes of Bayesian Networks • Same topology, different edge directions • Can not be distinguished from observation • Causality • Bayesian network does not directly imply causality • Can be inferred from observation with certain assumptions: • no hidden common cause • …… C C A B A B I (A, B | C) C PDAG A B Hidden variable C A B

Bayesian Networks for Gene Expression • Deals with noisy data well, reflects stochastic nature of gene expression • Indication of causality • Practical issues: • Learning is NP-hard • Over-fitting • Equivalent classes of graphs • Solution: • Heuristic search, sparse candidate • Model averaging • Learning partial models Gene E Gene C Gene D Gene A  (D | E): Multinomial or linear Gene B Other variables can be added, such as promoters sequences, experiment conditions and time.

Learning Bayesian Nets • Find G to maximize Score (G | D), where • Score(G | D) = Σi Score (Xi, PaG(Xi) | D) • Hill-climbing • Edge addition, edge removal, edge reversal • Divide-and-conquer • Solve for sub-graphs • Sparse candidate algorithm • Limit the number of candidate parents for each variables. (Biological implications – sparse graph) • Iteratively modifying the candidate set

Partial Models (Features) • Model Averaging • Learn many models, common sub-graphs will be more likely to be true • Confidence measure: # of times a sub-graph appeared • Method: bootstrap • Markov relations • A is in B’s Markov blanket iff A and B in some joint biological interaction A B A B or C • Order relations … A is a cause of B A B

Experimental Results Markov Relations • Real biological data set: Yeast cell cycle data • 800 genes, 76 experiments, 200-fold bootstrap • Test for significance and robustness • More higher scoring features in real data than in randomized data • Order relations are more robust than Markov relations with respect to local probability models. Friedman et al, J Comput Biol. 2000;7:601-20

Transcriptional regulatory network • Who regulates whom? • When? • Where? • How? TF Promoter Gene A and not B RNA-Pol RNA-Pol A or B B A A B g1 g3 Not (A and B) RNA-Pol RNA-Pol A and B A B g4 A B g2 PNAS 2003;100(9):5136-41

Data-driven vs. model-driven methods condition clustering gene MF Descriptive Learning model Post-processing Biological insights Explanatory, predictive “A description of a process that could have generated the observed data”

Data-driven approaches Clustering Motif finding Genes MEME, Gibbs, AlignACE, … Hierarchical, K-means, … Experiments • Assumption • Co-expressed genes are likely co-regulated: not necessarily true • Limitations: • Clustering is subjective • Statistically over-represented but non-functional “junk” motifs • Hard to find combinatorial motifs

Model-based approaches • Intuition: find motifs that are not only statistically over-represented, but are also associated with the expression patterns • E.g., a motif appears in many up-regulated genes but very few other genes => real motif? • Model: gene expression = f (TF binding motifs, TF activities) • Goal: find the function that • Can explain the observed data and predict future data • Captures true relationships among motifs, TFs and expression of genes

Transcription modeling e = f (m1, m2, m3, m4) Variables Motifs Expression Promoters g1 g2 g3 g4 g5 g6 g7 g8 ? Gene labels Assume that gene expression levels under a certain condition are a function of some TF binding motifs on their promoters.

Different modeling approaches • Many different models, each with its own limitations • Classification models • Decision tree, support vector machine (SVM), naïve bayes, … • Regression models • Linear regression, regression tree, … • Probabilistic models • Bayesian networks, probabilistic Boolean networks, …

Decision tree m1 m2 m3 m4 e m1 g1 g2 g3 g4 g5 g6 g7 g8 yes no m4 m2 e = f (m1, m2, m3, m4) yes no no yes A C D B 1, 2, 5 3, 6 7, 8 4 • Tree structure is learned from data • Only relevant variables (motifs) are used • Many possible trees, the smallest one is preferred • Advantages: • Easy to interpret • Can represent complex logic relationships

RRPE No Yes FHL1 PAC No No Yes Yes RAP1 11 (+) 1(-) 4 (-) 3 (+) 23 (+) No Yes 151 (-) 10 (+) 5 (+) A real example: transcriptional regulation of yeast stress response • 52 genes up-regulated in heat-shock (postive) • 156 random irresponsive genes (negative) • 356 known motifs Small tree: only used 4 motifs All 4 motifs are well-known to be stress-related RRPE-PAC combination well-known

Application to yeast cell-cycle genes Network by our method Ruan et. al., BMC Genomics, 2009 Model network in Science, 2002;298(5594):799-804

Regression tree m1 m2 m3 m4 e • Similar to decision tree • Difference: each terminal node predicts a range of real values instead of a label m1 g1 g2 g3 g4 g5 g6 g7 g8 yes no e = f (m1, m2, m3, m4) m4 m2 no yes no yes 0<e<2 e2 e2 0>e>2

Multivariate regression tree • Multivariate labels: use multiple experiments simultaneously • Use motifs to classify genes into co-expressed groups • Does not need clustering in advance m1 no yes m1 m2 m3 m4 e1 e2 e3 e4 e5 g1 g2 g3 g4 g5 g6 g7 g8 m4 m2 no yes no yes 3 6 8 1 2 5 7 Phuong,T., et. al., Bioinformatics, 2004 4

tf1  0 > 0 e1 e2 e3 e4 e5 tf1 tf2 tf3 tf4 g0 g>0 g Modeling with TF activities • Gene expression = f (binding motifs, TF activities) g = f (tf1, tf2, tf3, tf4) tf1 tf2 tf3 tf4 g rotate e1 e2 e3 e4 e5 Soinov et al., Genome Biol, 2003

A Decision Tree Model Segal et al. Nat Genet. 2003,34(2):166-76. A decision tree model of gene expressions gene experiment

Algorithm BDTree • Gene expression = f (binding motifs, TF activities) • Ruan & Zhang, Bioinformatics 2006 • Basic idea: • Iteratively partition an expression matrix by splitting genes or experiments • Split of genes is according to motif scores • Split of conditions is according to TF expression levels • The algorithm decides the best motifs or TFs to use

Transcriptional regulation of yeast stress response • 173 experiments under ~20 stress conditions • 1411 differentially expressed genes • ~1200 putative binding motifs • Combination of ChIP-chip data, PWMs, and over-represented k-mers (k = 5, 6, 7) • 466 TFs

Genes Genes with motifs FHL1 but no RRPE are down-regulated when Ppt1 is down-regulated and Yfl052w is up-regulated Genes with motifs RRPE & PAC are down-regulated when TFs Tpk1 & Kin82 are up-regulated Experiments … …

Biological validation • Most motifs and TFs selected by the tree are well-known to be stress-related • E.g., motifs RRPE, PAC, FHL1, TFs Tpk1 and Ppt1 • 42 / 50 blocks are significantly enriched with some Gene Ontology (GO) functional terms • 45 / 50 blocks are significantly enriched with some experimental conditions

RRPE & PAC, ribosome biogenesis (60/94, p < e-65) RRPE only, ribosome biogenesis (28/99, p < e-18) FHL1, protein biosynthesis (98/105, p<e-87) STRE (agggg) carbohydrate metabolism p < e-20 PAC Nitrogen metabolism

Relationship between methods c1 c2 c3 c4 c5 • A, C: from promoter to expression • A: single cond • C: multi conds • B, D: from expression to expression • B: single gene • D: multi genes B t1 t2 t3 t4 m1 m2 m3 m4 A g1 g2 g3 g4 g5 g6 g7 g8 D C

CS 5263 Bioinformatics