1 / 63

Network-based analysis of functional genomics data

Network-based analysis of functional genomics data. Jianhua Ruan, PhD Computer Science Department University of Texas at San Antonio http://www.cs.utsa.edu/~jruan. Final project. Final Project Report due Sat, Dec 15 Presentations: Mon, Dec 17, 8-10:30 pm 10 teams to present

nola
Download Presentation

Network-based analysis of functional genomics data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network-based analysis of functional genomics data Jianhua Ruan, PhD Computer Science Department University of Texas at San Antonio http://www.cs.utsa.edu/~jruan

  2. Final project • Final Project Report due Sat, Dec 15 • Presentations: Mon, Dec 17, 8-10:30 pm • 10 teams to present • Each team will have up to 13 minutes. (10 min presentation, 3 min questions) • Since time is limited, you don’t need to cover all the details in your presentation. • Focus on the most important concepts • More details in your project report

  3. Human: ~22,000 genes Dog: ~20,000 genes Rice: ~35,000 genes Mouse: ~22,000 genes C. Elegans: ~20,000 genes It is not just the genes, but the networks!

  4. Why networks? • For complex systems, the actual output may not be predictable by looking at only individual components: • The whole is greater than the sum of its parts • Studying genes/proteins on the network level allows us to: • Assess the role of individual genes/proteins in the overall pathway • Evaluate redundancy of network components • Identify candidate genes involved in genetic diseases • Sets up the framework for mathematical models

  5. Graph model of biological networks • An abstract of the complex relationships among molecules in the cell • Vertex: molecule • Gene, protein, metabolite, DNA, RNA • Edges: relationships • Physical interaction • Functional association • Share many common statistical properties with real-world networks • Small-world • Scale-free • Hierarchical • Modular (community structure) (Jeong et al., 2001)

  6. Network analysis algorithms Genetic network reverse-engineering Data Mining, tree models, DNA motif finding Community discovery Network-based disease studies Research Overview Data integration, classification, graph algorithms

  7. Agenda • Community discovery in biological networks • Network-based analysis of microarray data • Network-based biomarker discovery for metastatic breast cancer • Conclusions

  8. Network communities • Communities • Are relatively densely connected sub-networks (modules) • Appear in many types of networks • Independently studied in many fields: • Social science, Computer science, Physics, Biology, etc. • Significance • Biological systems are modular • Metabolic pathways • Protein complexes • Transcriptional regulatory modules • Biological systems are large and complex • Communities provide a high-level overview of the networks • Guilt-by-association • Predict gene functions based on community memberships

  9. Community discovery problem • Divide a network into relatively densely connected sub-networks • Similar to clustering, but # of clusters is determined automatically Vertex reorder

  10. Modularity function (Q) • Measure strength of community structures • Newman, Phy Rev E, 2003 e11 e22 e33 e44 e55

  11. Q = 0.45 Q = 0 Q = 0.40 Q = 0.56 Q = 0.54 Modularity automatically determines # of communities!

  12. Methods for community discovery • Previous methods • Fast but inaccurate (CNM, Phy Rev E, 2004) • Accurate but slow (Guimera&Amaral, Nature 2005) • Relatively accurate, relatively fast (Ruan&Zhang, AAAI 2006, ICDM2007) • Relatively accurate, fast, memory intensive (Newman, PNAS 2006) • Our new algorithm: Qcut/HQcut (Ruan&Zhang, Phys Rev E 2008) • Accurate, fast and memory friendly • HQcut solves the resolution limit of Q

  13. Our method Accuracy Newman’s Inter-community edge probability Algorithm Qcut • Recursive multi-way partitioning until Q is max • Improve Q by efficient heuristic search eig kmeans

  14. Resolution limit and HQcut • Q is known to have a resolution limit problem • Cannot detect small communities • Q slight decreases if forced to split • HQcut solves this problem • Apply Qcut to get communities with largest Q • Recursively search for sub-communities within each community without dramatic change to Q • Statistical test for termination criteria Ruan & Zhang, Physical Review E 2008

  15. Graphical user interface for Qcut/HQcut

  16. Application: protein complex prediction • Input: a yeast PPI network • Data from Krogan et.al., Nature. 2006;440:637-43 • 2708 vertices (proteins), 7123 interactions • 289 communities • Sizes range from 2 to 49 • Evaluation: compare communities with known complexes manually curated in MIPS database

  17. Small ribosomal subunit (90%) RNA poly II mediator (83%) Proteasome core (90%) gamma-tubulin (77%) Exosome (94%) respiratory chain complex IV (82%) Communities in a yeast PPI network

  18. Communities vs. complexes • Communities and complexes have good one to one correspondence • Overall accuracy: 0.81 • Newman: 0.58 Predicted complex Known complex Agreement between a predicted complex (P) and a known complex (K) = |P∩K| / sqrt(|P| x |K|).

  19. Work-in-progress: Random walk-based improvement • Motivation: • PPI network often contain both false positive and false negative edges • Hub genes prevent good partitioning • Three goals: • Eliminate false positive edges • Predict missing links • Reduce the impact of hub genes • Intuition: • Two proteins with high topological similarity, regardless of connected or not, may belong to same complex • Two proteins with direct link but very different topological properties may belong to different complexes

  20. Method overview Original network Initial prob vectors Equilibrium prob vectors Random walk with resistance Distance calculation = New network Adjacency matrix Similarity matrix threshold (guided by topology)

  21. Preliminary results on yeast PPI • Predicted PPIs have much higher functional relevance than removed PPIs, using several sources of evidence • Gene Ontology • Gene expression • etc. • New network significantly improved accuracy of protein complex predictions • Using HQcut: 0.50 to 0.55 • Using MCL: 0.48 to 0.59

  22. Agenda • Community structure in biological networks • Prediction of protein complexes • Network-based microarray data analysis • Network-based biomarker discovery for metastatic breast cancer • Conclusions

  23. Microarray data analysis • Gene network structure is unknown • Microarray measures gene expression (activity) level • Clustering is the most common analysis tool • Many clustering algorithms available • K-means • Hierarchical • Self organizing maps • Parameter (e.g., k) hard to guess • Does not consider network structure Conditions Clustering genes • Common functions? • Common regulation? • Predict functions for unknown genes?

  24. Network-based microarray data analysis Sample • Genes i and j connected if their expression patterns are “sufficiently similar” • Pearson correlation coefficient > arbitrary threshold • K nearest neighbors (KNN) • Key: how to get the “best” network? Construct Co-expression network i j = Gene

  25. Motivation • Can we use the idea of community discovery for clustering microarray data? • Advantages: • Parameter free • Network topology considered • Constructed network may have other interesting applications beyond clustering

  26. Our idea Ruan, ICDM 2009 Network series • Intuition: the real network is naturally modular • Can be measured by modularity (Q) • If constructed right, should have the highest Q Microarray data Similarity matrix Qcut Net_1, Most dense …… Net_m, Most sparse Qcut

  27. Our idea (cont’d) True network Random network Modularity Difference Network density • Therefore, use ∆Q to determine the best network parameter and obtain the best community structure

  28. Results: synthetic data set 1 • High dimensional data generated by synDeca. • 20 clusters of high dimensional points, plus some scatter points • Clusters are of various shapes: eclipse, rectangle, random Accuracy ∆Q

  29. Comparison mKNN-HQcut with the optimum k mKNN-HQcut with automatically determined k K-means

  30. Results: synthetic data set 2 • Gene expression data • Thalamuthu et al, 2006 • 600 data sets • ~600 genes, 50 conditions, 15 clusters • 0 or 1x outliers Without outliers With outliers mKNN-HQcut With optimal k mKNN-HQcut With auto k

  31. Comparison with other methods Ruan et al., BioKDD 2010

  32. Results on yeast stress response data • 3000 genes, 173 samples Best k = 140. Resulting in 75 clusters

  33. Results on yeast stress response data • Enrichment of common functions • Accumulative hyper-geometric test (Fisher’s exact test) Protein biosynthesis (p < 10-96) Peroxisome (p < 10-13) Nuclear transport (p < 10-50) Gene mt ribosome (p < 10-63) DNA repair (p < 10-66) RNA splicing (p < 10-105) Nitrogen compound metabolism (p < 10-37) GO Function Terms

  34. Comparison with k-means Using automatically determined k = 140 mkNN-HQcut K-means Overall enrichment score

  35. An interesting community • A 25-gene community missed by other methods 4 telomere maintenance genes (p < 10-7) 16 unknown genes, all located in chromosome telomeric regions 5 other genes at rim of the sub-network 4 transcription factors regulate many genes in the community

  36. Application to Arabidopsis data • ~22000 genes, 1138 samples • 1150 singletons • 800 (300) modules of size >= 10 (20) • > 80% (90%) of modules have enriched functions • Much more significant than all five existing studies on the same data set Top 40 most significant modules

  37. Cis-regulatory network of Arabidopsis Motif Module Ruan et al., BMC Bioinfo, to appear

  38. Beyond gene clustering (1) • Gene specific studies • Collaborator is interested in Gibberellins • A hormone important for the growth and development of plant • Commercially important • Biosynthesis and signaling well studied • Transcriptional regulation of biosynthesis and signaling not yet clear • 3 important gene families, GA20ox, GA3ox and GA2ox for biosynthesis • Receptor gene family: GID1A,B,C • Analyze the co-expression network around these genes

  39. 20ox GID1C GID1A 3ox GA3 20ox5 GID1B 2ox 2ox6 2ox4 2ox8 2ox2 20ox1 3ox2 3ox4 3ox3 2ox3 20ox3 20ox4 2ox7 20ox2 3ox1 2ox1

  40. Beyond gene clusters (2) Sample • Cancer classification Sample: patient or cell lines Sample Qcut Gene Alizadeh et. al. Nature, 2000

  41. Network of cell samples Shape: cell line / cancer type Color: clustering results Transformed cell lines Activated Blood B Follicular lymphoma (FL) DLBCL DLBCL Resting Blood B Blood T Diffuse large B-cell Lymphoma (DLBCL) Chronic lymphocytic leukemia (CLL)

  42. Survival rate after chemotherapy Survival rate: 40% Median survival time: 22.3 months Survival rate: 73% Median survival time: 71.3 months DLBCL-2 DLBCL-1 DLBCL-3 Survival rate: 20% Median survival time: 12.5 months

  43. Beyond gene clustering (3) • Topology vs function % of essential proteins Number of connections Jeong et. al. Nature 2001

  44. Community participation vs. essentiality (PPI) Hub % Essential Non-hub Community participation

  45. Community participation vs. essentiality (coexp) • Key: how to systematically search for such relationships? • Data mining – association rule? Hub Participation < 0.2 % Essential % Essential Non-hub Participation >= 0.2 Number of connections Community participation

  46. Agenda • Community structure in biological networks • Prediction of protein complexes • Network-based microarray data analysis • Network-based biomarker discovery for metastatic breast cancer • Conclusions

  47. Background Metastasis is the spread of cancer from one organ to another non-adjacent organ or part. Challenge: Predict Metastasis If metastasis is likely => aggressive adjuvant therapy How to decide the likelihood? Traditional predictive factors are not goodenough

  48. Microarray-based marker discovery Examine genome-wide expression profiles Idea: Score individual genes for how well they discriminate between different classes of disease Establish gene expression signature Limitations: # genes >> # patients Downstream effects Individual variations not attributed to cancer Consequences: Low reproducibility acrossdata sets Missing biological insight M N

More Related