1 / 20

BFAM Project BF-S15T07 “ Efficient clustering algorithms for genome-wide expression analysis “

Chair for Efficient Algorithms. Prof. E. W. Mayr. Institut für Informatik Technische Universität München. BFAM Project BF-S15T07 “ Efficient clustering algorithms for genome-wide expression analysis “. BFAM Project BF-S15T08 “ Modeling and visualization of biochemical networks “.

pomona
Download Presentation

BFAM Project BF-S15T07 “ Efficient clustering algorithms for genome-wide expression analysis “

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chair for Efficient Algorithms Prof. E. W. Mayr Institut für Informatik Technische Universität München BFAM Project BF-S15T07 “Efficient clustering algorithms for genome-wide expression analysis“ BFAM Project BF-S15T08 “Modeling and visualization of biochemical networks“ Sebastian Wernicke (wernicke@in.tum.de) Arno Buchner (buchner@in.tum.de) Jan Griebsch (griebsch@in.tum.de) Jens Ernst (ernstj@in.tum.de) Misc. projects in Bioinformatics Hanjo Täubig (taeubig@in.tum.de) Moritz Maass (maass@in.tum.de)

  2. Expression Profiles Normalization Similarity Measure Gene Expression Data Clustering Project I: Efficient Clustering Algorithms for genome-wide Expression Analysis

  3. 1. Retrospect: The SR-Algorithm • Powerful algorithm for similarity-based clustering • Based on methods of spectral graph theory, • numerical linear algebra and randomization • Applicable not only to gene expression profiles • but to any class of biological objects where pair-wise similarity is defined • Thoroughly mathematically analyzed with respect • to noise-robustness and running time • Complexity: Θ(n2), and hence optimal • New: Parallelized version and optimized version for • sparse similarity matrices.

  4. 1.0 α 0.45 α n n 0.45 500 – 2000 genes forming 4 clusters with 20%-49% false positives/negatives 2. Tests on Synthetic Data (1) Output quality as a function of n and the amount of noise (false positive, false negative rate α). The number of clusters is specified to the algorithm.

  5. 0.45 1.0 n n α α 0.45 500 – 4000 genes forming 4 clusters with 20%-45% false positives/negatives Tests on Synthetic Data (2) Output quality as a function of n and the amount of noise (false positive, false negative rate α). The number of clusters is found by the algorithm.

  6. α= 0.45 293.0 293 time(s) time(s) 0.45 α n 5.0 5000 30,000 n 5.000 – 30.000 genes, i.e. 25.000.000 – 900.000.000 similarity values Tests on Synthetic Data (3) Running time as a function of n and the amount of noise (false positive, false negative rate α) on a 1GHz machine.

  7. 4. Clustering Protein Interaction Networks • Experiments with a network from the STRING system • provided by the Bork group at EMBL. • Data:Escherichia coli, orthologous group-based • Edge scores: Interaction intensities defined by • score=1-(1-neighborhoodscore)x(1-fusionscore)x • (1-co-occurencescore) [ Courtesy of C. von Mering, Nucleic Acids Res. 2003 Jan 1;31(1):258-61 ]

  8. 4.1 Methods Current Applied in STRING • Functional module extraction:Generic partition- • based clustering methods (Single Linkage, Markov-Clustering) have been applied to identify functional modules in the network. • However: Due to the definition of the interaction score • as a combination of three different channels, multiple cluster structures are superimposed in this data set. • Generalized Clustering:Grouping such that any • protein (/orthologous group) can belong to multiple clusters. The density of each cluster should be as high as possible, whereas the inter-cluster connectivity (excluding overlaps) should be minimized.

  9. “Lsets” 1 1,2,3 1,3 1,2 Cluster Structure Interaction Matrix (permuted with respect to cluster structure) Interaction Matrix (original form) 2 3 2,3 2,4 3,4 4 4.2. Schematic representation:

  10. Frequency distribution of edge densities within and between Lsets 4.2. Construction of Intersecting Clusters: • Construction of elementary sets by SR-techniques • Result: A partition of the protein set into a fixed number kof elementary sets. The value of k may safely be overestimated. • Intra- and inter-Lset edge densities: • k=150; • Mean intra-Lset density: 0.309 • Inter-Lset connectivity: 0.024 • Lsets belonging to the same • cluster

  11. 1 1,2 1,2,3 2 2,3 4 3,4 3 1,3 2,4 • Definition of the Lset-graph • Some pairs of Lsets are still highly connected. • This is represented by a graph structure whose • nodes are Lsets. Maximal cliques in this graph • are macroscopic clusters, which can overlap. • Note: This means that the method self-corrects • an over-estimated value of k. • 3. Construction of the intersecting clusters • The cliques are extracted using the Tsukiyama-algorithm. • Result: 144 clusters • Intra-cluster density: 0.269 • Inter-cluster connectivity: 0.020 (excl. overlaps)

  12. Quality assessment based on biological expert knowledge: currently pending • The clusters are being compared with a known set of protein-to-pathway assignments.

  13. 5. Mathematical Result Evaluation in Comparative Analysis of Clustering Algorithms • Mathematical scoring scheme for clustering quality: • Suppose a clustering has induced the partition • C={C1,C2,…,Ck} of the set of genes {X1,X2,…,Xn}. • Denote the similarity between a pair of genes Xi,Xj • with s(Xi,Xj). • Denote the Cluster containing Xi with C(Xi) and the • center of some cluster C with XC. • Cluster Homogeneity: • Separation:

  14. Remarks: • The cluster analysis was conducted in the form of a blind test. Use of expert knowledge or supervised learning techniques was not intended for. • No prior selection of genes was asked for. • Normalization/standardization of expression data or the similarity-/distance measure were not explicitly required. • Choice of similarity measure s for the evaluation: • Pearson Correlation Coefficient (due to invariance under • scaling and translation of expression profiles, which was • used by some participants).

  15. NRO Data Set (Pearson correlation) “Average” (2) “Average” (3) Kröger(10) “Binary” (16) (20) Separation “SR” “Ward” (2) “SOM”(2) “Optimum” (2) Homogeneity • Homogeneity and Separation in the Clusterings (NRO)

  16. NRO Data Set (absolute Pearson correlation) “SOM”(2) “Ward” (2) Kröger(10) Separation “Binary” (16) (16) (20) “Average” (2) “SR” “Average” (3) (3) “Optimum” Homogeneity • Using |Pearson| to accommodate for anti-correlation

  17. An SR-Clustering with 16 Clusters on the NRO Data:

  18. The gray off-diagonal blocks suggest some inter-cluster similarity. Cluster overlap is conceivable here. Isolated clusters with high confidence • The appropriately permuted similarity matrix

  19. 6. Cooperation within the BFAM Network: • Cooperation with Genomatix Software GmbH: • Extension of cluster analysis by integration of information • from biological databases and expert knowledge • Cooperation with Genomatix Software GmbH, Biomax • Informatics GmbH, the group of Prof. Lasser and the • group of Prof. Kriegel: • Comparative analysis of clustering algorithms • Publications: • [1] „Similarity-Based Clustering Algorithms for Gene Expression Profiles“, • J. Ernst, Dissertation, Technische Universität München, 2002 • [2] „Generalized Clustering of Gene Expression Profiles – A Spectral Approach“, • J. Ernst, Proc. of the Int. Conference on Bioinformatics, Bangkok, 2002 • [3] „The Complexity of Detecting Fixed-Density Clusters“, H. Täubig et. al., • Proc. of the 5th Italian Conference on Algorithms and Complexity, 2003

  20. Chair for Efficient Algorithms Algorithms for Bioinformatics Project “Clustering“ Graph Theory Combinatorial Optimization Randomized Algorithms Project “Biological Networks“ Algorithm Visualization Complexity Theory Computer Algebra Misc. Bioinformatics Projects Petri Nets Scheduling

More Related