1 / 40

CSCE555 Bioinformatics

CSCE555 Bioinformatics. Lecture 23 Integrative Genomics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

kelii
Download Presentation

CSCE555 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE555 Bioinformatics • Lecture 23 Integrative Genomics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

  2. Outline: Integrative Genomics II • Integrative Microarray Analysis • The Data Sources • Graph mining of co-expression relationships • Codense • Second order relationship analysis

  3. 2001 Human 30~100K genes 1995 Bacteria ~1.6K genes 1997 Eukaryote ~6K genes 1998 Animal ~20K genes Objective: $1000 human genome • Gene Expression Datasets: the Transcriptome • Oligonucleotide Array • cDNA Array Protein abundance measurement (Mass Spec) Protein interactions (yeast 2-hybrid system, protein arrays) Protein complexes (Mass Spec) Information Flow Cellular systems DNA RNA Protein

  4. Rapid accumulation of microarray data in public repositories • NCBI Gene Expression Omnibus • EBI Array Express 137,231 experiments 55,228 experiments The public microarray data increases by 3 folds per year

  5. Multiple Microarray Technology Platforms Microarray Platforms

  6. Coexpression Networks Recurrent Patterns Annotation Transcriptional Annotation f ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- f j TF j a h a h c c e e b h k b k e d i g d g i h e f i f g j j a a h h c c e g i e b b k k d i g d i g Functional Annotation f f j j a f a h h c c a e e b b k k b d d i g i g d f f j j a a h h c c e e Gene Ontology b b k k d d i g i g Graph-based Approach for the Integrative Microarray Analysis Datasets experiments gene

  7. Frequent Subgraph Mining Problem is hard! Problem formulation: Given n graphs, identify subgraphs which occur in at least m graphs (m  n) Our graphs are massive!(>10,000 nodes and >1 million edges) The traditional pattern growth approach (expand frequent subgraph of k edges to k+1 edges) would not work, since the time and memory requirements increase exponentially with increasing size of patterns and increasing number of networks.

  8. Novel Algorithms to identify diverse frequent network patterns • CoDense (Hu et al. ISMB 2005) • identify frequent coherent dense subgraphs across many massive graphs • Network Biclustering (Huang et al, ISMB 2007) • identify frequent subgraphs across many massive graphs • Network Modules (NeMo) (Yan et al. ISMB 2007) • identify frequent dense vertex sets across many massive graphs

  9. CODENSE: identify frequent coherent dense subgraphs across massive graphs

  10. Identify frequent co-expression clusters across multiple microarray data sets f j a h c1 c2… cm g1 .1 .2… .2 g2 .4 .3… .4 … c e b k d g i f f f j j a j a h a c h c h c f j e e c1 c2… cm g1 .8 .6… .2 g2 .2 .3… .4 … e a c e h b b k b b k k k d i g d d i g i g d i g . . . . . . . . . c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … f j a h c e f b k j d i g a h c e f c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … j a b h c k e d b i k g d i g

  11. The solution: CODENSE a novel algorithm, called CODENSE, to mine frequent coherent dense subgraphs. The target subgraphs have three characteristics: • All edges occur in >= k graphs (frequency) • All edges should exhibit correlated occurrences in the given graph set. (coherency) • The subgraph is dense, where density d is higher than a threshold  and d=2m/(n(n-1)) (density) m: #edges, n: #nodes

  12. f f f a a a h h c c c h e e e b b b f d d d i i i g g g a h c G1 G2 G3 e b f f f d a a i g h h h c c c a summary graph Ĝ e e e b b b d d d i g i i g g G4 G5 G6 CODENSE: Mine coherent dense subgraph (1) Builds a summary graph by eliminating infrequent edges

  13. CODENSE: Mine coherent dense subgraph (2) Identify dense subgraphs of the summary graph f f a Step 2 h h c c e e b d MODES i i g g summary graph Ĝ Sub(Ĝ) Observation: If a frequent subgraph is dense, it must be a dense subgraph in the summary graph. However, the reverse conclusion is not true.

  14. CODENSE: Mine coherent dense subgraph (3) Construct the edge occurrence profiles for each dense summary subgraph E G1 G2 G3 G4 G5 G6 f c-e 0 0 1 1 0 1 h c Step 3 c-f 0 1 0 1 1 1 e c-h 0 0 0 1 1 1 i g c-i 0 0 1 1 1 0 Sub(Ĝ) e-f 0 0 0 1 1 1 … … … … … … … edge occurrence profiles

  15. CODENSE: Mine coherent dense subgraph (4) builds a second-order graph for each dense summary subgraph g-h f-i E G1 G2 G3 G4 G5 G6 e-i h-i c-e 0 0 1 1 1 1 Step 4 c-f 0 1 0 1 1 1 g-i e-g c-h 0 0 0 1 1 1 e-h c-i 0 0 1 1 1 0 c-h f-h e-f 0 0 0 1 1 1 c-f e-f … … … … … … … c-i edge occurrence profiles c-e second-order graph S

  16. CODENSE: Mine coherent dense subgraph (5) Identify dense subgraphs of the second-order graph g-h g-h f-i e-i h-i e-i h-i Step 4 g-i e-g g-i e-g e-h e-h c-h f-h c-h f-h c-f e-f c-f e-f c-i c-e c-e second-order graph S Sub(S) Observation: if a subgraph is coherent (its edges show high correlation in their occurrences across a graph set), then its 2nd-order graph must be dense.

  17. CODENSE: Mine coherent dense subgraph (6) Identify the coherent dense subgraphs g-h h e-i h-i e Step 5 i g g-i e-g e-h f h c c-h f-h e c-f e-f c-e Sub(G) Sub(S)

  18. CODESNE Solution • The identified subgraphs by definition satisfy the three • criteria: • All edges occur in >= k graphs (frequency) • All edges should exhibit correlated occurrences in the given graph set. (coherency) • The subgraph is dense, where density d is higher than a threshold  and d=2m/(n(n-1)) (density) • m: #edges, n: #nodes

  19. CODENSE: Mine coherent dense subgraph a f f f a a h h c c c h b e e e b b f f d d d i i i g g g a Step 2 Step 1 h h c c G1 G2 G3 e e b a f f f d MODES Add/Cut a a i i g g h h h c c c summary graph Ĝ b Sub(Ĝ) e e e b b d d d i g i i g g G4 G5 G6 Step 3 g-h g-h f-i h e-i e-i h-i h-i E G1 G2 G3 G4 G5 G6 e c-e 0 0 1 1 1 1 Step 6 Step 5 Step 4 i g g-i g-i e-g e-g c-f 0 1 0 1 1 1 c-h 0 0 0 1 1 1 e-h e-h f MODES c-i 0 0 1 1 1 0 c-h c-h Restore G and MODES f-h h c f-h e-f 0 0 0 1 1 1 c-f c-f e-f e e-f … … … … … … … c-i c-e c-e edge occurrence profiles second-order graph S Sub(G) Sub(S)

  20. CODENSE The design of CODENSE can solve the scalability issue. Instead of mining each biological network individually, CODENSE compresses the networks into two meta-graphs (the summary graph and the second-order graph) and performs clustering in these two graphs only. Thus, CODENSE can handle any large number of networks.

  21. Applying CoDense to 39 yeast microarray data sets f f j j a h a h c1 c2… cm g1 .1 .2… .2 g2 .4 .3… .4 … c c e e b b k k d i d g i g f f j j a e c1 c2… cm g1 .8 .6… .2 g2 .2 .3… .4 … a c c h h e b b k k d d i i g g c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … f f j j a a h c h c e e b b k k i d g d i g f f j c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … j a a h h c c e e b b k k d d i g i g

  22. Functional annotation Annotation

  23. Functional Annotation (Validation) Method: leave-one-out approach - masking a known gene to be unknown, and assign its function based on the other genes in the subgraph pattern. Functional categories: 166 functional categories at GO level at least 6 Results: 448 predictions with accuracy of 50%

  24. Functional Annotation (Prediction) We made functional predictions for 169 genes, covering a wide range of functional categories, e.g. amino acid biosynthesis, ATP biosynthesis, ribosome biogenesis, vitamin biosynthesis, etc. A significant number of our predictions can be supported by literature.

  25. Functional annotation Annotation We made functional predictions for 779 known and 116 unknown genes by random forest classification with 71% accuracy. Variables for random forest classification: functional enrichment P-value network topology score network connectivity pattern recurrence numbers average node degree unknown gene ratio Network size

  26. Reconstruct transcriptional cascades by second-order analysis Zhou et al. Nature Biotech 2005

  27. Frequently occurring tight clusters

  28. Transcription Factors Frequently occurring tight clusters

  29. Co-occurrence of tight clusters Coexpression network constructed with the dataset 1

  30. Co-occurrence of tight clusters Coexpression network constructed with the dataset 2

  31. Co-occurrence of tight clusters Coexpression network constructed with the dataset 3

  32. Co-occurrence of tight clusters Coexpression network constructed with the dataset 4

  33. Co-occurrence of tight clusters Coexpression network constructed with the dataset 5

  34. Cooperativity Transcription Factors Set 1 Transcription Factor Set 2 Coexpression Networks

  35. Relevance Networks

  36. Three types of transcription cascades gene 1 gene 2 TF2 Type I TF1 gene 3 TF3 gene 4 gene 5 gene 1 gene 2 TF2 Type II TF1 gene 3 gene 4 gene 5 gene 1 Transcription Regulation gene 2 TF1 gene 3 Protein Interaction Type III TF2 gene 4 gene 5

  37. Applying to 39 yeast microarray data sets • We identified 60 transcription modules. Among them, we found 34 pairs that showed high 2nd-order correlation. A significant portion (29%, p-value<10-5 by Monte Carlo simulation ) of those modules pairs are participants in transcription cascades: 2 pairs in Type I, 8 pairs in Type II, and 3 pairs in type III cascades. In fact, these transcription cascades inter-connect into a partial cellular regulatory network.

  38. HGH1 LOC1 NOC3 YDR152W YHL013C BUD19 RPL13A RPL13B RPL16A RPL24B RPL28 RPL2B RPL33B RPL39 RPL42B RPS16B RPS18A RPS21B RPS22A RPS23B RPS4B RPS6A BAT1 ILV2 LEU9 YBL113C YRF1-1 YRF1-7 RCS1 RGM1 MET10 MET14 MET28 SUL2 Leu3 GAL4 MET4 PDR1 ALD5 ARG1 ARG2 ARG3 ARG4 ARG5,6 ARO1 ARO3 ARO4 ASN1 BNA1 CPA2 DED81 ECM40 HIS1 HIS4 HOM3 LEU4 LEU9 LYS20 MET22 ORT1 PCL5 TEA1 TRP2 YBR043C YDR341C YHM1 YHR162W YJL200C YJR111C YBL113C YRF1-1 YRF1-3 YRF1-7 GCN4 YAP5 YBL113C YIL177C YJL225C YLR464W YML133C YRF1-1 YRF1-3 YRF1-4 YRF1-5 YRF1-6 YRF1-7 SWI6 GAT3 SWI5 MBP1 CDC45, RAD27, RNR1, SPT21, YPL267W MSN4 SWI4 NDD1 YEL077C YRF1-3 YRF1-7 CDC21 CDC45 CLB5 CLB6 CLN1 GIN4 IRR1 MCD1 MSH6 PDS5 RAD27 RNR1 SPO16 SPT21 SWE1 TOF1 YBR070C ALK1 BUD4 CDC20 CDC5 CLB2 HST3 SWI5 YIL158W YJL051W YOR315W CLB6 RNR1 SPT21 YPL267W GAS1 HTA1 HTB1 Regulation of transcription modules by transcription factors, based on ChIP-chip data and supported by recurrent expression clusters Regulation between transcription factors, based on ChIP-chip data and supported by 2nd-order expression correlation Protein interactions between two transcription factors, based on experimental data and supported by 2nd-order expression correlation Two transcription modules with high 2nd-order expression correlations

  39. Integrative Array Analyzer (iArray): a software package for cross-platform and cross-species microarray analysis Bioinformatics. 22(13):1665-7

  40. Acknowledgement This slides are adapted from the following presentation: Xianghong Jasmine Zhou, Molecular and Computational Biology University of Southern California . Functions, Networks, and Phenotypes by Integrative Genomics Analysis IMS, NUS, Singapore, July 18, 2007

More Related