Achim Tresch Computational Biology

‘Omics’ - Analysis of high dimensional Data Achim TreschComputational Biology

Different Kinds of Networks

Different Kinds of Network Models A model that explains the datamerely findsassociationsE.g.: GeneticEpidemiology(predictstrokeriskfrom SNPs) A model that explains the mechanism generateshypotheses E.g.: Physics, Systems Biology(predictthesignalflowthrougha cascadeofsignallingmolecules)

qualitative semiquantitative quantitative Network Reconstruction Which Model canbeinferredreliablyfromthedata? Which Model has a biologicalinterpretation?

The Bias/VarianceTradeoff “Any method (or statistician) that takes a complex multivariate dataset and, from it, claims to identify one true model, is both naive and misleading.” David Edwards. Introduction to Graphical Modelling. Springer, 2000 Takenfrom: C. Bishop. Pattern Recognition andMachine Learning (2006) BiasStability Underfitting Variance Flexibility Overfitting Aim:Goodpredictionsforyetunknowndata The more complex the model . . . • the better it fits to the data (unbiasedness), • thehigherthevariance. The simpler the model . . . • the less expressive, the more constrained (biased), • the more stable it is.

Correlation Graphs Assumption: Correlation ≈ Coregulation Genes thatshowcoordinatedexpressionacrossmanyconditionsareregulatedby a commonmechanism Method(„Relevance Networks“):Calculatecorrelationcoefficient r(i,j) for all genepairs (i,j). Draw an edgewhenever r(i,j) > cutoffvalue +Estimationofcorrelationcoefficientsis easy, fast andstable - Nonaturalcutoffvalueforedges, meaningofedgesunclear CancerCell Line Data. Atul J. Butte et al., PNAS (2000) YeastCell Cycle Data. Spellman, Mol.Biol.Cell (2001) Multi-SpeciesComparison. Stuart et al., Science (2003)

Correlation Graphs Possible reasons for the correlation of three genes (x,y,z): • It is impossible to distinguish direct from indirect dependence Conclusion: A strong correlation is not a strong evidence for regulatory dependence (lots of false positives). On the other hand, a low correlation is a strong evidence for no regulatory edge (but that does not help much). • Remedies: • search for correlations which cannot be explained by other variables • generate interventional data, not only observational data (e.g. measure the effects of gene perturbations)

Partial Correlations Income and foot size are correlated! Foot size Income

Partial Correlations Income and foot size are correlated! … but genderexplainsthisfact Correlation >> 0 , Partial Correlation ≈ 0 = Foot size Income Gender

Partial Correlations Income and foot size are correlated! … but genderexplainsthisfact -Estimationof partial correlationsisdifficult, slowandunstable + Positive partial correlationis a strong evidenceforregulatorydependence Correlation >> 0 , Partial Correlation ≈ 0 = Foot size Income Gender Partial correlationof (x,y) wrt. zisessentiallycalculatedasthecorrelationof (x,y) after linear correctionfor z.

GaussianGraphical Models Assumption: Partial Correlation ≈ Direct Interaction Genes whosecorrelatedexpressioncannotbeexplainedbyother variables interactwitheachother. Method(„GaussianGraphical Models“):Calculate partial correlationcoefficient p(i,j) for all genepairs (i,j). Draw an edgewhenever p(i,j) > cutoffvalue E.ColiMicroarray Data. Schäfer, Strimmer, SAGeMB (2005) Meinshausen, Bühlmann. Ann.Statist. (2005)

Bayesian Networks Assumption: Edges in the model ≈ Causal Relations Childrendepend on theirparents! Method(„CausalGraphical Models“, „Bayesian Networks“): The distributionof a variable („child“) dependsonly on thevaluesofits „parents“. P1 P2 Example: C = „Baby is happy“ (yes/no) P1 = „Mumisaround“ (yes/no) P2 = „Dad isaround“ (yes/no) C Localdistribution: These valueshavetobeestimatedfromthedata! („parameteridentification“)

Structure Learning in BNs Constructionof larger Networks: Global distribution= Productoflocaldistributions P(x1) P(x2) P(x4 | x1) x1 x2 P(x3) x5 x3 x4 P(x5 | x1,x2) x6 P(x6 | x3,x4,x5) P(x1,x2,…,x6) = P(x1) ∙ P(x2) ∙ P(x3) ∙ P(x4 | x1) ∙ P(x5 | x1,x2) ∙ P(x6 | x3,x4,x5) +Causalrelationscanberead off directlyfromthe model (formostedges, Pearl „Causality“ (2000))

Structure Learning in BNs For n nodes, thereareabout 2n*n/2 different BN structures (combinatorialexplosion). Howtosearchthespaceof all possiblenetworks? Markov Chain Monte Carlo (MCMC) techniques, greedyhillclimbing x y Graph Θ „MCMC Moves“ - Model Space isvast (highvariance!) and model identificationisalmostimpossible z SomeNeighboursofΘ x y x x y y z z z delete an edge reverse an edge(ifallowed) add an edge(ifallowed)

Application: StrokePrediction The bestscoring model isalmostdefinitelyNOTthetrue model-> Most causalrelationsinferredfromthebestscoring model are not real Dirk Husmeier, Bioinformatics (2003) Data: 108 SNPs in eachsicklecellpatient (having an increasedriskofstroke) Training:1398 individuals, 92 (94%) ofthemwithoutstroke. Validation: 114 newindividuals, 7 (94%) ofthemwithoutstroke.Classificationaccuracyfrom BN: 98%. „stupid“ predictor: 94% Stroke Network from SNP data. Sebastiani et al. Nature Genetics (2005)

Summary (1) Relevance Networks are easy tocalculate, but difficulttointerpret. GaussianGraphical Models aremoredifficulttoobtain, but morereliable. Bayesian Networks arecomputationallyveryhard,but theyhave a causalinterpretation(ifthe model is „sufficientlyunique“). Bayesian Networks andGaussian Models tendtooutperformRelevance Networks. Dirk Husmeier, Bioinformatics (2006) …but essentially, noneoftheabovemethodsworks (personal experience)

Ways out ofthe Dilemma • Search Space Reduction:Not all graphStructures in a Bayesian Network are a priori equallylikely.Thishelpstoreducethesearchspaceconsiderably: • Someedgesarealreadyknowntoexist, otherscannotexist(biologicalknowledge) • Biological networkstendtohavefewedges(sparsity/parsimony) • Some variables must lie „downstream“ ofothers(hierarchicalstructure) • Interventional Data: • Bayesian Networks areableto model interventionsoftheobservedsystem in a naturalway. Interventionssubstantiallyfacilitatetheinferenceofthedirectionalityofinteractions.

A Model of Regulation

? A Model of Regulation Thisiswhatweobserve This is the true model Bothmodelsexplaintheobservationsperfectly. Whatmakestheright model (biologically) more plausible?

Signal transmissionis expensive! ? Find a consistent model with a mostparsimoniouseffectsgraph A Model of Regulation Thisiswhatweobserve This is the true model Bothmodelsexplaintheobservationsperfectly. Signals,Signal graph Γ Observables,Effects graph Θ Whatmakestheright model (biologically) more plausible?

NestedEffects Models NEMs (Markowetz, Bioinformatics 2005) Signals Signal graph, AdjacencymatrixΓ(with 1´s in the diagonal) Effectsgraph,AdjacencymatrixΘ Effects ParsimonyAssumption: Each observable islinkedtoexactlyoneaction Predictedeffects • Properties ofNestedEffects Models: • ReduceSearch Space • Biological networkstendtohavefewedges(sparsity/parsimony) • Someedgesarealreadyknowntoexist, otherscannotexist(biologicalknowledge) • Some variables must lie „downstream“ ofothers(hierarchicalorganisation) • UseInterventional Data

NestedEffects Models Effectofsignalson effecta Signals s Rs,a a Effects Predictedeffects Measured effects = R • More favorable Properties of Nested Effects Models:Very fast calculation of likelihoods (scoring of a model) • Model identifyability can be proved (Tresch and Markowetz, SAGeMB 2008) • NEMs are indeed a special instance of Bayesian networks (Zeller, EURASIP J. Bioinf. 2008) • Powerful search heuristics can be applied (Expectation-Maximization (EM) algorithm, Niederberger, PLoS Comp Biol 2011)

NEMs - Simulations R/Bioconductor packages: Nessy, nem True graphsΓ,Θ idealmeasure-ments (ΓΘ) Simulatedmeasure-ments(R)

NEMs - Simulations True graph Estimated graph 12 edges, 212=4096 action graphs, ~ 4seconds Distribution ofthelikelihoods

NEMs - Simulations 30 genes assigned, 60 genes unassigned True graph Estimated graph Veryunstableestimationof Signals Graph StableestimationofEffects Graph

Pathway IPathway IIsynthetic lethality Application: SyntheticLethality Pathway II Pathway I • Hypotheses: • SL betweentwo genes occursifthe genes arelocated in different pathways • Genes sharingthe same syntheticlethalitypartnershave an increasedchanceofbeinglocated in thesame pathwayYe, Bader et al., Mol.Sys. Biol. 2005, PNAS 2005 1 a 2 b 3 • Consequence: • A geneawhose SL partnersarea subsetoftheSL partnersofanothergenebislikelytobelocatedbeneath b in the same pathway.

Application: SyntheticLethality 7 of 10 Genes directly linked to DNA repair

Application: Mediator Knockouts 3 mediator knockout experiments: Holstege, Larivière, Koschubs mediator EMGreedyRandom

Acknowledgements Florian MarkowetzLewis-Sigler Institute, Princeton University Tim Beissbarth, Holger FröhlichGerman Cancer Research Center, Heidelberg Cordula ZellerJohannes Gutenberg-University, Mainz Theresa Niederberger, Tobias Koschubs, Laurent Larivière, Dietmar MartinGene Center, LMU Munich

Achim Tresch Computational Biology