Create Presentation
Download Presentation

Download Presentation
## Gene Expression Data and Cluster Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Gene Expression Data and Cluster Analysis**http://staff.washington.edu/kayee/research.html**A gene expression data set**…….. • Snapshot of activities in the cell • Each chip represents an experiment: • time course • tissue samples (normal/cancer) p experiments n genes Xij**What is clustering?**• Group similar objects together • Objects in the same cluster (group) are more similar to each other than objects in different clusters • Data exploratory tool: to find patterns in large data sets • Unsupervised approach: do not make use of prior knowledge of data**Applications of clustering gene expression data**• Cluster the genes functionally related genes • Cluster the experiments discover new subtypes of tissue samples • Cluster both genes and experiments find sub-patterns**Examples of clustering algorithms**• Hierarchical clustering algorithms eg. [Eisen et al 1998] • K-means eg. [Tavazoie et al. 1999] • Self-organizing maps (SOM) eg. [Tamayo et al. 1999] • CAST [Ben-Dor, Yakhini 1999] • Model-based clustering algorithms eg. [Yeung et al. 2001]**Overview**• Similarity/distance measures • Hierarchical clustering algorithms • Made popular by Stanford, ie. [Eisen et al. 1998] • K-means • Made popular by many groups, eg. [Tavazoie et al. 1999] • Model-based clustering algorithms[Yeung et al. 2001]**How to define similarity?**Experiments X genes n 1 p 1 X • Similarity measures: • A measure of pairwise similarity or dissimilarity • Examples: • Correlation coefficient • Euclidean distance genes genes Y Y n n Raw matrix Similarity matrix**Similarity measures(for those of you who enjoy equations…)**• Euclidean distance • Correlation coefficient**Example**Correlation (X,Y) = 1 Distance (X,Y) = 4 Correlation (X,Z) = -1 Distance (X,Z) = 2.83 Correlation (X,W) = 1 Distance (X,W) = 1.41**Lessons from the example**• Correlation – direction only • Euclidean distance – magnitude & direction • Array data is noisy need many experiments to robustly estimate pairwise similarity**Clustering algorithms**• From pairwise similarities to groups • Inputs: • Raw data matrix or similarity matrix • Number of clusters or some other parameters**Hierarchical Clustering [Hartigan 1975]**• Agglomerative(bottom-up) • Algorithm: • Initialize: each item a cluster • Iterate: • select two most similarclusters • merge them • Halt: when required number of clusters is reached dendrogram**Hierarchical: Single Link**• cluster similarity = similarity of two most similar members -Potentially long and skinny clusters + Fast**Example: single link**5 4 3 2 1**Example: single link**5 4 3 2 1**Example: single link**5 4 3 2 1**Hierarchical: Complete Link**• cluster similarity = similarity of two least similar members +tight clusters - slow**Example: complete link**5 4 3 2 1**Example: complete link**5 4 3 2 1**Example: complete link**5 4 3 2 1**Hierarchical: Average Link**• cluster similarity = average similarity of all pairs +tight clusters - slow A 1 2 3**Software: TreeView[Eisen et al. 1998]**• Fig 1 in Eisen’s PNAS 99 paper • Time course of serum stimulation of primary human fibrolasts • cDNA arrays with approx 8600 spots • Similar to average-link • Free download at: http://rana.lbl.gov/EisenSoftware.htm**Overview**• Similarity/distance measures • Hierarchical clustering algorithms • Made popular by Stanford, ie. [Eisen et al. 1998] • K-means • Made popular by many groups, eg. [Tavazoie et al. 1999] • Model-based clustering algorithms[Yeung et al. 2001]**Details of k-means**• Iterate until converge: • Assign each data point to the closest centroid • Compute new centroid Objective function: Minimize**Properties of k-means**• Fast • Proved to converge to local optimum • In practice, converge quickly • Tend to produce spherical, equal-sized clusters • Related to the model-based approach • Gavin Sherlock’s Xcluster: http://genome-www.stanford.edu/~sherlock/cluster.html**What we have seen so far..**• Definition of clustering • Pairwise similarity: • Correlation • Euclidean distance • Clustering algorithms: • Hierarchical agglomerative • K-means • Different clustering algorithms different clusters • Clustering algorithms always split out clusters**Which clustering algorithm should I use?**• Good question • No definite answer: on-going research**Traditional hierarchical clustering tree**Networks with community structure Connecting the pair of nodes with strongest link and next strongest link …**Uncovering underlying modularity**No overlap Perfect overlap Using topological overlap # of nodes to which both i and j are linked (+1 if direct link)**Any better methods?**Using global informationnot local information (i.e. using dynamic & whole network information)**Locally not importantbecause degree=2**But globally, very importantbecause connecting two groups!**j**i 1 1 k • Betweenness Centrality (BC) [Freeman, 1977] bij(k) (fraction in the number of the shortest paths between i and j that pass through k.) “How much is the k-th node influential to the communication between i and j” • Example: the BC at k contributed by the communication from i to j is • Accumulate over all ordered pairs:**Algorithm for finding community(destructive way using global**information) • Calculate the betweeness for all edges in the network. • Remove the edge with the highest betweeness. • Recalculate betweeneess for edges affected by the removal. • Repeat from step 2 until no edges remain.**One cluster from M. pneumoniae(sugar import & DNA**replication)**Community type of ordering**Shell type of ordering When far from the root When close to the root**Taxonomy by using Metabolic networks**J. Podani, Z.N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabasi, E. Szathmary [ Nature Genetics29 54 (2001). ]**What makes a eukaryote a eukaryote? The name means "true**nut" or "true kernel" in Greek; the "nut" is in fact the nucleus of the eukaryotic cell, a membrane sac that contains the cell's DNA. Unlike bacteria and archaea, eukaryotes have their DNA in linear pieces that are bound up with special proteins (histones) to make chromosomes (normally visible only in dividing cells). Eukaryote**Bacteria**Bacteria lack the membrane-bound nuclei of eukaryotes; transmission electron micrograph of a typical bacterium, E. coli**Archaea**Basic Archaeal Structure : The three primary regions of an archaeal cell are the cytoplasm, cell membrane, and cell wall. Above, these three regions are labelled, with an enlargement at right of the cell membrane structure. Archaeal cell membranes are chemically different from all other living things, including a "backwards" glycerol molecule and isoprene derivatives in place of fatty acids.**Prokaryotes**Diameter ~2 microns. DNA small and circular. No membrane-bound organelles RNA and protein synthesized in same compartment. No cytoskeleton Cell cycle ~20 min. Eukaryotes Diameter ~20 microns. DNA long, linear, and in chromosomes. Nucleus, mitochondria, +chloroplasts RNA synthesized in nucleus, proteins made in the cytoplasm. Cytoskeleton Cell cycle ~24 hr. Differences between “typical” prokaryotes and eukaryotes