clustering gene expression data n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering Gene Expression Data PowerPoint Presentation
Download Presentation
Clustering Gene Expression Data

Loading in 2 Seconds...

play fullscreen
1 / 39

Clustering Gene Expression Data - PowerPoint PPT Presentation


  • 218 Views
  • Uploaded on

Clustering Gene Expression Data. EMBnet: DNA Microarrays Workshop Mar. 4 – Mar. 8, 2002 ,UNIL & EPFL, Lausanne Gaddy Getz, Weizmann Institute, Israel. Gene Expression Data Clustering of Genes and Conditions Methods Agglomerative Hierarchical: Average Linkage Centroids: K-Means

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Clustering Gene Expression Data


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Clustering Gene Expression Data EMBnet: DNA Microarrays Workshop Mar. 4 – Mar. 8, 2002 ,UNIL & EPFL, Lausanne Gaddy Getz, Weizmann Institute, Israel • Gene Expression Data • Clustering of Genes and Conditions • Methods • Agglomerative Hierarchical: Average Linkage • Centroids: K-Means • Physically motivated: Super-Paramagnetic Clustering • Coupled Two-Way Clustering

    2. Gene Expression Technologies • DNA Chips (Affymetrix) and MicroArrays can measure mRNA concentration of thousands of genes simultaneously • General scheme: Extract RNA, synthesize labeled cDNA, Hybridize with DNA on chip.

    3. Single Experiment • After hybridization • Scan the Chip and obtain an image file • Image Analysis (find spots, measure signal and noise)Tools: ScanAlyze, Affymetrix, … • Output File • Affymetrix chips: For each gene a reading proportional to the concentrations and a present/absent call.(Average Difference, Absent Call) • cDNA MicroArrays: competing hybridization of target and control. For each gene the log ratio of target and control. (CH1I-CH1B, CH2I-CH2B)

    4. Preprocessing: From one experiment to many • Chip and Channel Normalization • Aim: bring readings of all experiments to be on the same scale • Cause: different RNA amounts, labeling efficiency and image acquisition parameters • Method: Multiply readings of each array/channel by a scaling factor such that: • The sum of the scaled readings will be the same for all arrays • Find scaling factor by a linear fit of the highly expressed genes • Note: In multi-channel experiments normalize each channel separately.

    5. Preprocessing: From one experiment to many • Filtering of Genes • Remove genes that are absent in most experiments • Remove genes that are constant in all experiments • Remove genes with low readings which are not reliable.

    6. Noise and Repeats log – log plot • >90% 2 to 3 fold • Multiplicative noise • Repeat experiments • Log scaledist(4,2)=dist(2,1)

    7. We can ask many questions? Supervised Methods(use predefined labels) • Which genes are expressed differently in two known types of conditions? • What is the minimal set of genes needed to distinguish one type of conditions from the others? • Which genes behave similarly in the experiments? • How many different types of conditions are there? Unsupervised Methods(use only the data)

    8. Unsupervised Analysis • Goal A:Find groups of genes that have correlated expression profiles.These genes are believed to belong to the same biological process and/or are co-regulated. • Goal B:Divide conditions to groups with similar gene expression profiles.Example: divide drugs according to their effect on gene expression. Clustering Methods

    9. What is clustering?

    10. Cluster Analysis Yields Dendrogram T (RESOLUTION)

    11. What is clustering? More Mathematically • Input: N data points, Xi, i=1,2,…,N in a D dimensional space. • Goal: Find “natural” groups or clusters. Data point of same cluster - “more similar” • Tasks: • Determine number of clusters • Generate a dendrogram • Identify significant “stable” clusters

    12. Clustering is ill-posed • Problem specific definitions • Similarity: which points should be considered close? • Correlation coefficient • Euclidean distance • Resolution: specify/hierarchical results • Shape of clusters: general, spherical.

    13. Similarity Measure • Similarity measures • Centered Correlation • Uncentered Correlation • Absolute correlation • Euclidean

    14. 2 4 5 3 1 1 3 2 4 5 Need to define the distance between thenew cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Agglomerative Hierarchical Clustering Distance between joined clusters The dendrogram induces a linear ordering of the data points Dendrogram

    15. Agglomerative Hierarchical Clustering • Results depend on distance update method • Single Linkage: elongated clusters • Complete Linkage: sphere-like clusters • Greedy iterative process • NOT robust against noise • No inherent measure to choose the clusters

    16. Centroid Methods - K-means • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 0

    17. Centroid Methods - K-means • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 1

    18. Centroid Methods - K-means • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 1

    19. Centroid Methods - K-means • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 3

    20. Centroid Methods - K-means • Result depends on initial centroids’ position • Fast algorithm: compute distances from data points to centroids • No way to choose K. • Example: 3 clusters / K=2, 3, 4 • Breaks long clusters

    21. Super-Paramagnetic Clustering (SPC)M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties dilute magnets. • Calculating correlation between magnet orientations atdifferent temperatures (T). T=Low

    22. Super-Paramagnetic Clustering (SPC)M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties dilute magnets. • Calculating correlation between magnet orientations atdifferent temperatures (T). T=High

    23. Super-Paramagnetic Clustering (SPC)M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties dilute magnets. • Calculating correlation between magnet orientations atdifferent temperatures (T). T=Intermediate

    24. Super-Paramagnetic Clustering (SPC) • The algorithm simulates the magnets behavior at a range of temperatures and calculates their correlation • The temperature (T) controls the resolution • Example: N=4800 points in D=2

    25. Output of SPC A function (T) that peaks when stable clusters break Size of largest clusters as function of T Dendrogram Stable clusters “live” for large T

    26. Choosing a value for T

    27. Advantages of SPC • Scans all resolutions (T) • Robust against noise and initialization -calculates collective correlations. • Identifies “natural” () and stable clusters (T) • No need to pre-specify number of clusters • Clusters can be any shape

    28. Many clustering methods applied to expression data • Agglomerative Hierarchical • Average Linkage (Eisen et. al., PNAS 1998) • Centroid (representative) • K-Means (Golub et. al., Science 1999) • Self Organized Maps (Tamayo et. al., PNAS 1999) • Physically motivated • Deterministic Annealing (Alon et. al., PNAS 1999) • Super-Paramagnetic Clustering (Getz et. al., Physica A 2000)

    29. Available Tools • Software packages: • M. Eisen’s programs for clustering and display of results (Cluster, TreeView) • Predefined set of normalizations and filtering • Agglomerative, K-means, 1D SOM • Web sites: • Coupled Two-Way Clustering (CTWC) websitehttp://ctwc.weizmann.ac.il both CTWC and SPC • http://ep.ebi.ac.uk/EP/EPCLUST/ • General mathematical tools • MATLAB • Agglomerative, public m-files. • Statistical programs (SPSS, SAS, S-plus)

    30. Back to gene expression data • 2 Goals: Cluster Genes and Conditions • 2 independent clustering: • Genes represented as vectors of expression in all conditions • Conditions are represented as vectors of expression of all genes

    31. First clustering - Experiments 1. Identify tissue classes (tumor/normal)

    32. Second Clustering - Genes Ribosomal proteins Cytochrome C metabolism HLA2 2.Find Differentiating And Correlated Genes

    33. Two-wayClustering

    34. Coupled Two-Way Clustering (CTWC)G. Getz, E. Levine and E. Domany (2000) PNAS • Motivation: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest. • New Goal: Use subsets of genes to study subsets of samples (and vice versa) • A non-trivial task – exponential number of subsets. • CTWC is a heuristic to solve this problem.

    35. Football Booing Cheering

    36. CTWC of colon cancer data Tumor Normal (A) Protocol A Protocol B (B)

    37. Glioma cell line Low grade astrocytoma Secondary GBM Primary GBM p53 mutation CTWC of Glioblastoma Data – S1(G5) Godard, Getz, Kobayashi, Nozaki, Diserens, Hamon, Stupp, Janzer, Bucher, de Tribolet, Domany & Hegi (2002) Submitted S14 S13 S12 S11 S10 AB004904 STAT-induced STAT inhibitor 3 M32977 VEGFANGIOGENESIS M35410 IGFBP2 X51602 VEGFR1ANGIOGENESIS M96322 Gravin AB004903 STAT-induced STAT inhibitor 2 X52946 PTN J04111 C-JUN X79067 TIS11B

    38. Biological Work • Literature search for the genes • Genomics: search for common regulatory signal upstream of the genes • Proteomics: infer functions. • Design next experiment – get more data to validate result. • Find what is in common with sets of experiments/conditions.

    39. Summary • Clustering methods are used to • find genes from the same biological process • group the experiments to similar conditions • Different clustering methods can give different results. The physically motivated ones are more robust. • Focusing on subsets of the genes and conditions can uncover structure that is masked when using all genes and conditions http://ctwc.weizmann.ac.il