1 / 42

V11: Structure Learning in Bayesian Networks

V11: Structure Learning in Bayesian Networks. In lectures V9 and V10 we made the strong assumption that we know in advance the network structure . Today, we consider the task of learning in situations where we do not know the structure of the BN in advance .

brigid
Download Presentation

V11: Structure Learning in Bayesian Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. V11: Structure Learning in Bayesian Networks In lectures V9 and V10 wemadethe strong assumption thatweknow in advancethenetworkstructure. Today, weconsiderthetaskoflearning in situations wherewe do not knowthestructureofthe BN in advance. Instead, we will applythe strong assumptionthatourdatasetisfullyobserved. As before, weassumethatthedata D aregenerated IID from an underlyingdistribution P*(X). Weassumethat P* isinducedbysomeBayesiannetwork G* over X. Mathematics of Biological Networks

  2. Example: cointoss Consider an experimentwherewetoss 2 standardcoins X and Y independently. Wearegiven a datasetwith 100 instancesofthisexperiment andwould like tolearn a modelforthisscenario. A „typical“ datasetmayhavetheentriesfor X and Y: 27 head/head 22 head/tail 25 tail/head 26 tail/tail In thisempiricaldistribution, the 2 coinsareapparently not fullyindependent. (If X = tail, thechancefor Y = tailappearshigherthanif X = head). This isquiteexpectedbecausetheprobabilityoftossing 100 pairsof fair coins andgettingexactly 25 outcomes in eachcategoryisonlyabout 1/1000. Thus, evenifthecoinsareindeedindepdendent, we do not expectthat theobservedempiricaldistribution will satisfyindependence. Mathematics of Biological Networks

  3. Example: cointoss Supposewegetthe same results in a very different situation. Say wescanthesportssectionofourlocalnewspaperfor 100 days andchoose an article at randomeachday. Wemark X = x1iftheword „rain“ appears in thearticleandX = x0otherwise. Similarly, Y denoteswhethertheword „football“ appears in thearticle. Ourintuitionwhetherthetworandom variables areindependentisunclear. Ifwegetthe same empiricalcountsas in thecoinsdescribedbefore, wemightsuspectthatthereissomeweakconnection. In otherwords, itishardtobesurewhetherthetrueunderlying modelhas an edgebetween X and Y or not. Mathematics of Biological Networks

  4. Goals ofthelearningprocess The goaloflearning G* fromthedataishardtoachievebecause • thedatasampledfrom P* aretypicallynoisyand • do not reconstruct P* perfectly. Wecannotdetectwithcompletereliabilitywhich independenciesarepresent in theunderlyingdistribution. Therefore, we must generallymake a decisionaboutourwillingness toinclude in ourlearnedmodeledgesaboutwhichwearelesssure. Ifweincludemoreoftheseedges, we will oftenlearn a modelthatcontainsspuriousedges. Ifweincludefewedges, wemaymiss independencies. The decisionwhichcompromiseisbetterdepends on theapplication. Mathematics of Biological Networks

  5. Independence in cointossexample? Itseemsthatifwe do makemistakes in thestructure, itisbettertohave toomanyratherthantoofewedges. But thesituationismorecomplex. Let‘sgo back tothecoinexampleandassumethatwehadonly 20 casesfor X/Y: 3 head/head 6 head/tail 5 tail/head 6 tail/tail Whenweassume X and Y ascorrelated, we find by MLE (wherewesimplycounttheobservednumbers): P(X = H) = 9/20 = 0.45 P(Y = H | X = H) = 3/9 = 1/3 P(Y = H | X = T) = 5/11 In theindependentstructure (noedgebetween X and Y), P(Y = H) = 8/20 = 0.4 Mathematics of Biological Networks

  6. Noisydata→ prefersparsemodels All oftheseparameterestimatesareimperfect, ofcourse. The ones in themorecomplexmodelaresignificantlymorelikelytobeskewed (dt: verzogen)becauseeachoneisestimatedfrom a muchsmallerdataset. E.g. P(Y = H | X = H) isestimatedfromonly 9 instances comparedto20 instancesfortheestimationof P(Y = H). Note thatthestandardestimationerrorof MLE is. Whendoingdensityestimationfrom limited data, itisoftenbettertoprefer a sparserstructure. Mathematics of Biological Networks

  7. Overviewofstructurelearningmethods Roughlyspeaking, thereare 3 approachestolearning without a prespecifiedstructure: • constraint-basedstructurelearning Finds a modelthatbestexplainsthedependencies/independencies in thedata. (2) Score-basedstucturelearning(today) Wedefine a hypothesisspaceof potential modelsand a scoringfunction thatmeasureshowwellthemodelfitstheobserveddata. Ourcomputationaltaskisthento find thehighest-scoringnetwork. (3) Bayesianmodelaveragingmethods Generates an ensembleofpossiblestructures. Mathematics of Biological Networks

  8. StructureScores We will discuss 2 obviouschoicesofscoringfunctions: • Maximum likelihoodparameters (today) • Bayesianscores (V12) Maximum likelihoodparameters This functionmeasurestheprobabilityofthedatagiven a model. → tryto find a modelthatwouldmakethedataas probable aspossible. A modelis a pair G, G. Ourgoalisto find both a graph G andparametersG thatmaximizethelikelihood. Mathematics of Biological Networks

  9. Maximum likelihoodparameters In V9 wedeterminedhowtomaximizethelikelihoodfor a givenstructure G. We will simplyusethesemaximumlikelihoodparametersforeach graph. To find themaximumlikelihood(G, G) pair, weshould find thegraphstructure G thatachievesthehighestlikelihoodwhenweusethe MLE parametersfor G. Wedefine whereisthelogarithmofthelikelihoodfunction andarethemaximumlikelihoodparametersfor G. Mathematics of Biological Networks

  10. Maximum likelihoodparameters Let usconsideragainthescenarioofthe 2 coins. In model G0, X and Y areassumedtobeindependent. In thiscase, weget In model G1, weassume a dependencymodelledbythearc X → Y. Weget whereisthemaximumlikelihoodestimatefor P(x) andisthemaximumlikelihoodestimatefor P(y | x). The scoresofthetwomodelsshare a commoncomponent, thefirstterm. Mathematics of Biological Networks

  11. Maximum likelihoodparameters Thus, wecanwritethedifferencebetweenthetwoscoresas: Bycountinghowmanytimeseachconditionalprobabilityparameterappears in thisterm, wecanchangethesummationindexandwritethisas: Letbetheempiricaldistributionobserved in thedata; thatisistheempiricalfrequencyofx,yin D. Thenwecanwriteand Moreover, and Mathematics of Biological Networks

  12. Mutual information We get whereisthemutual informationbetween X and Y in thedistribution. The likelihoodofmodel G1thusdepends on the mutual informationbetween X and Y. Note thathigher mutual informationimplies strongerdependency. Thus, strongerdependencyimpliesstrongerpreference forthemodelwhere X and Y depend on eachother. Can thisbegeneralizedtogeneralnetworkstructures? Mathematics of Biological Networks

  13. Decompositionof Maximum likelihoodscores Proposition: thelikelihood score decomposesasfollows: Proof: omitted The likelihoodof a networkmeasuresthestrength ofthedependenciesbetween variables andtheirparents. Weprefernetworkswheretheparentsofeach variable are informative about it. Mathematics of Biological Networks

  14. Maximum likelihoodparameters Wecan also express thisresult in a complementarymanner. Corollorary: Let X1, …, Xnbe an orderingofthe variables thatisconsistentwithedges in G. Then The firstterm on theright-hand-sidedoes not depend on thestructure, but thesecondtermdoes. The secondterminvolvesconditional mutual-information of variable Xiandthepreceding variables giventheparentsofXi. Mathematics of Biological Networks

  15. Limitationsofthemaximumlikelihood score The likelihood score is a goodmeasure ofthe fit oftheestimated BN andthetrainingdata. Important, however, istheperformanceofthelearnednetwork on newinstances. Itturns out thatthe MLE score neverprefers simpler networksovermorecomplexnetworks. The optimal ML network will alwaysbe a fullyconnectednetwork. Thus, the ML overfitsthetrainingdata. The modeloftendoes not generalizewelltonewdatacases. Mathematics of Biological Networks

  16. Structuresearch The inputtotheoptimizationproblemis: - trainingset D - scoringfunction (includingpriors, ifneeded) - a set G ofpossiblenetworkstructures Ourdesiredoutput: - a networkstructure (fromthesetofpossiblestructures) thatmaximizesthe score. Weassumethatwecandecomposethe score ofthenetworkstructure G: asthesumoffamilyscores. FamScore(X | U : D) is a score measuringhowwell a setof variables U servesasparentsof X in thedataset D. Mathematics of Biological Networks

  17. Tree-structurenetworks Structurelearningoftree-structurenetworksisthesimplestcase. Definition: in tree-structurednetworkstructuresG, each variable X has at mostoneparent in G. The advantageoftreesisthattheycanbelearnedefficiently in polynomial time. Learning a treemodelisoftenusedas a startingpoint forlearningmorecomplexstructures. Key propertiestobeused: - decomposabilityofthe score - restriction on thenumberofparents Mathematics of Biological Networks

  18. Score oftreestructure Instead ofmaximizingthe score of a treestructure G, we will trytomaximizethedifferencebetweenitsscore andthe score oftheemptystructure G0. issimply a sumofterrmsforeachXi. Thatisthe score ofXiifitdoes not haveanyparents. The score consistsofterms. Nowthereare 2 cases: If, thenthetermsforXiin bothscorescancel out. Mathematics of Biological Networks

  19. Structuresearch If weareleftwiththedifferencebetweenthe 2 terms. Ifwedefinetheweight thenweseethat (G) isthesumofweights on pairs, such that in G Wehavethustransformedourproblemtooneoffinding a maximumweightspanningforestin a directedweighted graph. Mathematics of Biological Networks

  20. General case For a generalmodeltopologythatmay also involvecycles, westartbyconsideringthesearchspace. Wecanthinkofthesearchspaceas a graph overcandidatesolutions(different networktopologies). Nodes areconnectedbyoperatorsthattransformonenetworkintotheotherone. Ifeachstatehasfewneighbors, thesearchprocedureonly hastoconsiderfewsolutions at eachpointofthesearch. However, thesearchfor an optimal (or high-scoring) solution maybelongandcomplex. On theotherhand, ifeachstatehasmanyneighbors, thesearchmayinvolveonly a fewsteps, but itmaybedifficulttodecide at eachstepwhichpointtotake. Mathematics of Biological Networks

  21. General case A good trade-off forthisproblemchoosesreasonablyfewneighborsforeachstate but ensuresthatthe „diameter“ ofthesearchspaceremainssmall. A naturalchoicefortheneighborsof a stateis a setofstructures thatareidenticaltoitexceptforsmall „local“ modifications. We will considerthefollowingsetofmodificationsofthissort: • Edge addition • Edge deletion • Edge reversal The diameterofthesearchspaceisthen at mostn2. (Wecouldfirstdelete all edges in G1that do not appear in G2andthenaddtheedgesof G2thatare not in G1. This isboundedbythe total numberofpossibleedgesn2). Mathematics of Biological Networks

  22. Examplerequiringedgedeletion • Original modelthatgeneratedthedata • and (c) Intermediate networksencounteredduringthesearch. Let‘sassumethat A ishighly informative about B and C. Whenstartingfrom an emptynetwork, edges A → B and A → C wouldbeaddedfirst. Sometimes, wemay also addtheedge A → D. Since A is informative about B and C (whicharetheparentsof D), A is also informative about D -> (b) Later, wemay also addtheedges B → D and C → D-> (c) Node A cannotprovide additional information on top ofwhat B and C convery. Deleting A → D thereforemakesthe score optimal. Mathematics of Biological Networks

  23. Examplerequiringedgereversal • Original network, (b) and (c) intermediate networksduringsearch. (d) Undesirableoutcome. Whenaddingedges, we do not knowtheirdirectionbecausebothdirectionsgivethe same score. This iswhereedgereversalhelps. In situation (c) werealizethat A and B togethershouldmakethebestpredictionof C. Therefore, weneedtoreversethedirectionofthearcpointing at A. Mathematics of Biological Networks

  24. Who controls thedevelopmental status of a cell? We will continuewithstructurelearning in V12 Mathematics of Biological Networks

  25. Transcription http://www.berkeley.edu/news/features/1999/12/09_nogales.html a Mathematics of Biological Networks

  26. Preferred transcription factor binding motifs DNA-binding domain of a glucocorticoid receptor from Rattus norvegicus with matching DNA fragment. www.wikipedia.de Chen et al., Cell 133, 1106-1117 (2008) Mathematics of Biological Networks

  27. Gene regulatory network around Oc4 controls pluripotency Tightly interwoven network of 9 transcription factors keeps ES cells in pluripotent state. 6632 human genes have binding site in their promoter region for at least one of these 9 TFs. Many genes have multiple motifs. 800 genes bind ≥ 4 TFs. Kim et al. Cell 132, 1049 (2008) Mathematics of Biological Networks

  28. Complex of transcription factors Oct4 and Sox2 Idea: Check forconserved transcriptionfactorbindingsites in mouseand human www.rcsb.org Mathematics of Biological Networks

  29. Combined binding of Oct4, Sox2 and Nanog The combination of OCT4, SOX2 and NANOG influences conservation of binding events. (A) Bars indicate the fraction of loci where binding of Nanog, Sox2, Oct4 or CTCF can be observed at the orthologous locus in mouse ES cells for all combinations of OCT4, SOX2 and NANOG in human ES cells as indicated by the boxes below. Dark boxes indicate binding, white boxes indicate no binding (‘‘AND’’ relation). Combinatorial binding of OCT4, SOX2 and NANOG shows the largest fraction of conserved binding for Oct4, Sox2 and Nanog in mouse. Göke et al., PLoSComputBiol 7, e1002304 (2011) SS 2014 - lecture 11 Mathematics of Biological Networks

  30. Increased Binding conservation in ES cells at developmental enhancers Fraction of loci where binding of Nanog, Sox2, Oct4 and CTCF can be observed at the orthologous locus in mouse ESC. Combinations of OCT4, SOX2 and NANOG in human ES cells are discriminated by developmental activity as indicated by the boxes below. Dark boxes : ‘‘AND’’ relation, light grey boxes with ‘‘v’’ : ‘‘OR’’ relation, ‘ ‘?’’ : no restriction. Combinatorial binding events at develop-mentally active enhancers show the highest levels of binding conservation between mouse and human ES cells. Göke et al., PLoS Comput Biol 7, e1002304 (2011) SS 2014 - lecture 11 Mathematics of Biological Networks

  31. Transcriptional activation looping factors Mediator DNA-looping enables interactions for the distal promotor regions, Mediator cofactor-complex serves as a huge linker Mathematics of Biological Networks

  32. coactivators corepressor TFs cis-regulatory modules TFs are not dedicated activators or respressors! It‘s the assembly that is crucial. IFN-enhanceosome from RCSB Protein Data Bank, 2010 Mathematics of Biological Networks

  33. Aim: identify Protein complexes involving transcription factors Borrow idea from ClusterOne method: Identify candidates of TF complexes in protein-protein interaction graph by optimizing the cohesiveness Thorsten Will, Master thesis (accepted for ECCB 2014) Mathematics of Biological Networks

  34. DDI model “domain-level“ network transition to domain-domain interactions domain interactions can reveal which proteins can bind simultaneously if interactions per domain are restricted “protein-level“ network Ozawa et al., BMC Bioinformatics, 2010 Ma et al., BBAPAP, 2012 SS 2014 - lecture 11 Mathematics of Biological Networks

  35. underlying domain-domain representation of PPIs Assumption: every domain can only participate in one interaction. Green proteins A, C, E form actual complex. Their red domains are connected by the two green edges. B and D are incident proteins. They could form new interactions (red edges) with unused domains (blue) of A, C, E Mathematics of Biological Networks

  36. data source used: Yeast Promoter Atlas, PPI and DDI Mathematics of Biological Networks

  37. Daco identifies far more TF complexes than other methods Mathematics of Biological Networks

  38. Examples of TF complexes – comparison with ClusterONE Green nodes: proteins in the reference that were matched by the prediction red nodes: proteins that are in the predicted complex, but not part of the reference. Mathematics of Biological Networks

  39. Performance evaluation Mathematics of Biological Networks

  40. Are target genes of TF complexes co-expressed? Mathematics of Biological Networks

  41. Functional role of TF complexes Mathematics of Biological Networks

  42. Complicated regulation of Oct4 Kellner, Kikyo, Histol Histopathol 25, 405 (2010) Mathematics of Biological Networks

More Related