1 / 90

341: Introduction to Bioinformatics

341: Introduction to Bioinformatics. Dr. Nataša Pržulj Department of Comput ing Imperial College London natasha@imperial.ac.uk Winter 2011. Topics. Introduction to biology (cell, DNA, RNA, genes, proteins) Sequencing and genomics (sequencing technology, sequence alignment algorithms)

Gabriel
Download Presentation

341: Introduction to Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 341: Introduction to Bioinformatics Dr. Nataša Pržulj Department of Computing Imperial College London natasha@imperial.ac.uk Winter 2011

  2. Topics • Introduction to biology (cell, DNA, RNA, genes, proteins) • Sequencing and genomics (sequencing technology, sequence alignment algorithms) • Functional genomics and microarray analysis (array technology, statistics, clustering and classification) • Introduction to biological networks • Introduction to graph theory • Network properties • Global: network/node centralities • Local: network motifs and graphlets • Network models • Network comparison/alignment • Software tools for network analysis • Network/node clustering • Interplay between topology and biology 2 2

  3. Why clustering? • Acluster is a group of related objects (informal def.) – see Lecture 6 • In biological nets, a group of “related” genes/proteins • Application in PPI nets: protein function prediction • Are you familiar with Gene Ontology?

  4. Clustering • Data clustering (Lecture 6) vs. Graph clustering

  5. Data Clustering – Review Find relationships and patterns in the data Get insights in underlying biology Find groups of “similar” genes/proteins/samples Deal with numerical values of biological data They have many features (not just color)

  6. Data Clustering – Review (homogeneity) (separation)

  7. Data Clustering – Review • There are many possible distance metrics between objects • Theoretical properties of distance metrics, d: • d(a,b) >= 0 • d(a,a) = 0 • d(a,b) = 0 a=b • d(a,b) = d(b,a) – symmetry • d(a,c) <= d(a,b) + d(b,c) – triangle inequality

  8. Clustering Algorithms – Review Example distances: Euclidean (L2) distance Manhattan (L1) distance Lm: (|x1-x2|m+|y1-y2|m)1/m L∞: max(|x1-x2|,|y1-y2|) Inner product: x1x2+y1y2 Correlation coefficient For simplicity, we will concentrate on Euclidean distance

  9. Clustering Algorithms – Review Distance/Similarity matrices: • Clustering is based on distances – distance/similarity matrix: • Represents the distance between objects • Only need half the matrix, since it is symmetric

  10. Clustering Algorithms – Review Hierarchical Clustering: • Scan the distance matrix for the minimum • Join items into one node • Update the matrix and repeat from step 1

  11. Clustering Algorithms – Review Hierarchical Clustering: • Distance between two points – easy to compute • Distance between two clusters – harder to compute: • Single-Link Method / Nearest Neighbor • Complete-Link / Furthest Neighbor • Average of all cross-cluster pairs

  12. Clustering Algorithms – Review Hierarchical Clustering: • 1. Example: Single-Link (Minimum) Method: Resulting Tree, or Dendrogram:

  13. Clustering Algorithms – Review Hierarchical Clustering: • 2. Example: Complete-Link (Maximum) Method: Resulting Tree, or Dendrogram:

  14. Clustering Algorithms – Review Hierarchical Clustering: In a dendrogram, the length of each tree branch represents the distance between clusters it joins Different dendrograms may arise when different linkage methods are used

  15. Clustering Algorithms – Review Hierarchical Clustering: How do you get clusters from the tree? Where to cut the tree?

  16. Clustering Algorithms – Review Hierarchical Clustering: How do you get clusters from the tree? Where to cut the tree?

  17. Clustering Algorithms – Review K-Means Clustering: • Basic idea: use cluster centroids (means) to represent cluster • Assigning data elements to the closest cluster (centroid) • Goal: minimize intra-cluster dissimilarity

  18. Clustering Algorithms – Review K-Means Clustering: • Pick K objects as centers of K clusters and assign all the remaining objects to these centers • Each object will be assigned to the center that has minimal distance to it • Solve any ties randomly • In each cluster C, find a new center X so as to minimize the total sum of distances between X and all other objects in C • Reassign all objects to new centers as explained in step (1) • Repeat the previous two steps until the algorithm converges

  19. Clustering Algorithms – Review K-Means Clustering Example:

  20. Clustering Algorithms – Review • Differences between the two clustering algorithms: • Hierarchical Clustering: • Need to select Linkage Method • To perform any analysis, it is necessary to partition the dendrogram into k disjoint clusters, cutting the dendrogram at some point. A limitation is that it is not clear how to choose this k • K-means • Need to select K • In both cases: Need to select distance/similarity measure • K-medoids • Centers are data points • Hierarchical and k-means clust. implemented in Matlab

  21. Clustering Algorithms Nearest neighbours “clustering:”

  22. Clustering Algorithms Pros and cons: No need to know the number of clusters to discover beforehand (different than in k-means and hierarchical). 2. We need to define the threshold . Nearest neighbours “clustering:” Example:

  23. Clustering Algorithms k-nearest neighbors “clustering” -- classification algorithm, but we use the idea here to do clustering: • For point v, create the cluster containing v and top k closest points to v, e.g., based on Euclidean distance. • Do this for all points v. • All of the clusters are of size k, but they can overlap. • The challenge: choosing k.

  24. k-Nearest Neighbours (k-NN) Classification Example: - The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. - If k = 3 it is classified to the second class (2 triangles vs only 1 square). - If k = 5 it is classified to the first class (3 squares vs. 2 triangles). • An object is classified by a majority vote of its neighbors • It is assigned to the class most common amongst its k nearest neighbors

  25. What is Classification? Application: medical diagnosis, treatment effectiveness analysis, protein function prediction, interaction prediction, etc. The goal of data classification is to organize and categorize data into distinct classes. • A model is first created based on the training data (learning). • The model is then validated on the testing data. • Finally, the model is used to classify new data. • Given the model, a class can be predicted for new data. Example:

  26. Classification = Learning the Model

  27. What is Clustering? There is no training data (objects are not labeled) We need a notion of similarity or distance Should we know a priori how many clusters exist?

  28. Supervised and Unsupervised Classification = Supervised approach • We know the class labels and the number of classes Clustering = Unsupervised approach • We do not know the class labels and may not know the number of classes

  29. Classification vs. Clustering (we can compute it without the need of knowing the correct solution)

  30. Graph clustering Overlapping terminology: • Clustering algorithm for graphs = “Community detection” algorithm for networks • Community structure in networks = Cluster structure in graphs • Partitioning vs. clustering • Overlap?

  31. Graph clustering • Decompose a network into subnetworks based on some topological properties • Usually we look for dense subnetworks

  32. Graph clustering • Why? • Protein complexes in a PPI network

  33. E.g., Nuclear Complexes

  34. Graph clustering Algorithms: • Exact: have proven solution quality and time complexity • Approximate: heuristics are used to make them efficient Example algorithms: • Highly connected subgraphs (HCS) • Restricted neighborhood search clustering (RNSC) • Molecular Complex Detection (MCODE) • Markov Cluster Algorithm (MCL)

  35. Highly connected subgraphs (HCS) • Definitions: • HCS - a subgraph with n nodes such that more than n/2 edges must be removed in order to disconnect it • A cut in a graph - partition of vertices into two non-overlapping sets • A multiway cut - partition of vertices into several disjoint sets • The cut-set - the set of edges whose end points are in different sets • Edges are said to be crossing the cut if they are in its cut-set • The size/weight of a cut - the number of edges crossing the cut • The HCS algorithm partitions the graph by finding the minimum graph cut and by repeating the process recursively until highly connected components (subgraphs) are found

  36. Highly connected subgraphs (HCS) • HCS algorithm: • Input: graph G • Does G satisfy a stopping criterion? • If yes: it is declared a “kernel” • Otherwise, G is partitioned into two subgraphs, separated by a minimum weight edge cut • Recursively proceed on the two subgraphs • Output: list of kernels that are basis of possible clusters

  37. Highly connected subgraphs (HCS)

  38. Highly connected subgraphs (HCS) • Clusters satisfy two properties: • They are homogeneous, since the diameter of each cluster is at most 2 and each cluster is at least half as dense as clique • They are well separeted, since any non-trivial split by the algorithm happens on subgraphs that are likely to be of diameter at least 3 • Running time complexity of HCS algorithm: • Bounded by 2N f(n,m) • N is the number of clusters found (often N << n) • f(n,m) is time complexity of computing a minimum edge cut of G with n nodes and m edges • The fastest deterministic min edge cut alg. for unweighted graphs has time complexity O(nm); for weighted graphs it’s O(nm+n2log n) More in survey chapter: N. Przulj, “Graph Theory Analysis of Protein-Protein Interactions,” a chapter in “Knowledge Discovery in Proteomics,” edited by I. Jurisica and D. Wigle, CRC Press, 2005

  39. Highly connected subgraphs (HCS) • Several heuristics used to speed it up • E.g., removing low degree nodes • If an input graph has many low degree nodes (remember, bio nets have power-law degree distributions), one iteration of the minimum edge cut algorithm many only separate a low degree node from the rest of the graph contributing to increased computational cost at a low informative value in terms of clustering • After clustering is over, singletons can be “adopted” by clusters, say by the cluster with which a singleton node has the most neighbors

  40. Restricted neighborhood search clust. (RNSC) • RNSC algorithm - partitions the set of nodes in the network into clusters by using a cost function to evaluate the partitioning • The algorithm starts with a random cluster assignment • It proceeds by reassigning nodes, so as to maximize the scores of partitions • At the same time, the algorithm keeps a list of already explored partitions to avoid their reprocessing • Finally, the clusters are filtered based on their size, density and functional homogeneity A. D. King, N. Przulj and I. Jurisica, “Protein complex prediction via cost-based clustering,” Bioinformatics, 20(17): 3013-3020, 2004.

  41. Restricted neighborhood search clust. (RNSC) • A cost function to evaluate the partitioning: • Consider node v in G and clustering C of G • αv is the number of “bad connections” incident with v • A bad connection incident to v is an edge that exist between v and a node in a different cluster from that where v is, or one that does not exist between v and node u in the same cluster as v • The cost function is then: • Cn(G,C) = ½ ∑v∈Vαv • There are other cost functions, too • Goal of each cost function: clustering in which the nodes of a cluster are all connected to each other and there are no other connections A. D. King, N. Przulj and I. Jurisica, “Protein complex prediction via cost-based clustering,” Bioinformatics, 20(17): 3013-3020, 2004. 41

  42. Molecular Complex Detection (MCODE) • Step 1: node weighting • Based on the core clustering coefficient • Clustering coefficient of a node: the density of its neighborhood • A graph is called a “k-core” if the minimal degree in it is k • “Core clustering coefficient” of a node: the density of the k-core of its immediate neighborhood • It increases the weights of heavily interconnected graph regions while giving small weights to the less connected vertices, which are abundant in the scale-free networks • Step 2: the algorithm traverses the weighted graph in a greedy fashion to isolate densely connected regions • Step 3: The post-processing step filters or adds proteins based on connectivity criteria • Implementation available as a Cytoscape plug-in http://baderlab.org/Software/MCODE -- a Cytoscape plugin

  43. Fun12 Tif11 Nop1 Rgm1 Has1 Tfb1 Cbf5 Sik1 Kin28 Bud14 Yor179c Pwp2 Nop58 Fir1 Srp1 Ufd1 Nop12 Pta1 Hca4 Glc7 Pti1 Sro9 Uba2 Ref2 Pub1 Pap1 Rna14 Vps53 Ssu72 Ysh1 Mpe1 Tif4632 Fip1 Yth1 Cft2 Cft1 Ycl046w Sec13 Pfs2 Pcf11 Ynl092w Tye7 Ktr3 Rsa3 Pfk1 Yhl035c Yml030w Cct6 Hgh1 Cct5 Tcp1 Cct2 Molecular Complex Detection (MCODE) • Example:

  44. Fun12 Tif11 Nop1 Rgm1 Has1 Tfb1 Cbf5 Sik1 Kin28 Bud14 Yor179c Pwp2 Nop58 Fir1 Srp1 Ufd1 Nop12 Pta1 Hca4 Glc7 Pti1 Sro9 Uba2 Ref2 Pub1 Pap1 Rna14 Vps53 Ssu72 Ysh1 Mpe1 Tif4632 Fip1 Yth1 Cft2 Cft1 Ycl046w Sec13 Pfs2 Pcf11 Ynl092w Tye7 Ktr3 Rsa3 Pfk1 Yhl035c Yml030w Cct6 Hgh1 Cct5 Tcp1 Cct2 Input Network

  45. Tif11 Yor179c Pta1 Hca4 Pti1 Ref2 Pap1 Mpe1 Fip1 Yth1 Cft2 Cft1 Pfs2 Find neighbors of Pti1

  46. Pta1 Pti1 Ref2 Pap1 Mpe1 Fip1 Yth1 Cft2 Cft1 Pfs2 Find highest k-core (8-core) Removes low degree nodes in power-law networks

  47. Pta1 Pti1 Ref2 Pap1 Mpe1 Fip1 Yth1 Cft2 Cft1 Pfs2 Density= Number edges = 44/55 = 0.8 Number possible edges Find graph density

  48. High Pta1 Pti1 Score Ref2 Pap1 Low Mpe1 Fip1 Yth1 Cft2 Cft1 Pfs2 Score = highest k-core * density = 8 * 0.8 = 6.4 = Calculate score for Pti1

  49. Fun12 Tif11 Nop1 Rgm1 Has1 Tfb1 Cbf5 Sik1 Kin28 Bud14 Yor179c Pwp2 Nop58 Fir1 Srp1 Ufd1 Nop12 Pta1 Hca4 Glc7 Pti1 Sro9 Uba2 Ref2 Pub1 Pap1 Rna14 Vps53 Ssu72 Ysh1 Mpe1 Tif4632 Fip1 Yth1 Cft2 Cft1 Ycl046w Sec13 Pfs2 Pcf11 Ynl092w Tye7 Ktr3 Rsa3 Pfk1 Yhl035c Yml030w Cct6 Hgh1 Cct5 Tcp1 Cct2 Repeat for entire network

  50. Find dense regions: -Pick highest scoring vertex -’Paint’ outwards until threshold score reached (% score from seed node) Yor179c High Pta1 Hca4 Score Pti1 Ref2 Pap1 Rna14 Ssu72 Low Ysh1 Mpe1 Fip1 Yth1 Cft2 Cft1 Pfs2 Pcf11

More Related