1 / 58

Advanced Methods of Data Analysis

Advanced Methods of Data Analysis. Course on Microarray Data Acquisition and Analysis Weizmann Institute of Science 16 May 2007 Presented by Tal Shay & Yuval Tabach Weizmann Institute of Science Rehovot, Israel. 9:00 - 10:00 CTWC 10:00 - 11:00 CTWC exercise 11:00 – 11:30 Break

torgny
Download Presentation

Advanced Methods of Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Methods of Data Analysis Course on Microarray Data Acquisition and Analysis Weizmann Institute of Science 16 May 2007 Presented by Tal Shay & Yuval Tabach Weizmann Institute of Science Rehovot, Israel • 9:00 - 10:00 CTWC • 10:00 - 11:00 CTWC exercise • 11:00 – 11:30 Break • 11:30 - 12:00 SPIN • 12:00 - 13:00 SPIN exercise

  2. Coupled Two-Way Clustering CTWC Course on Microarray Data Acquisition and Analysis Weizmann Institute of Science 16 May 2007 Presented by Tal Shay & Yuval Tabach Weizmann Institute of Science Rehovot, Israel • Gad Getz, Erel Levine, and Eytan Domany • Coupled two-way clustering analysis of gene microarray dataPNAS 97: 12079-12084

  3. Talk Aim Guide how to use the CTWC server to properly analyze micro-array data.

  4. Motivation • Micro-array experiments generate millions of numbers containing • a lot of biological information. • The problem: Very complicated data contain large amount of noise. • How to unravel the biological information which is masked • by a mess of irrelevant information. • CTWC is a simple heuristic clustering procedure that was • developed especially to cope with micro-array data.

  5. Talk Outline • Preprocessing and filtering • Clustering of Genes and Conditions • Super-Paramagnetic Clustering (SPC) • Coupled Two-Way Clustering (CTWC) • CTWC server • Exercise

  6. Gene Expression Matrix – CTWC format The DB_NAME is used to link genes to a database

  7. Visualization of Expression Matrix • Column = chip (=sample) • Row = probeset • Color = expression level genes samples

  8. Preprocessing • Select variable genes • Standardize genes samples Initial Expression Matrix

  9. Preprocessing • Select variable genes • Standardize genes samples 1000 probesets with highest standard deviation

  10. Preprocessing • Select variable genes • Standardize genes samples 1000 probesets with highest standard deviation, standardized

  11. Talk Outline • Preprocessing and filtering • Clustering of Genes and Conditions • Super-Paramagnetic Clustering (SPC) • Coupled Two-Way Clustering (CTWC) • CTWC server • Exercise

  12. What questions can we ask? Supervised Methods Hypothesis Testing(use predefined labels) • Which genes are expressed differently in two known types of samples? • What is the minimal set of genes needed to distinguish one type of samples from the others? • Which genes behave similarly in the experiments? • How many different types of samples are there? Unsupervised MethodsExploratory Analysis(use only the data)

  13. Clustering – unsupervised analysis samples Low variation genes All genes genes High variation genes Filtering 1 Clustering 3 clusters, each contains highly correlated genes 2 3

  14. Unsupervised Analysis • Goal A:Find groups of genes that have correlated expression profiles.These genes are believed to belong to the same biological process and might be co-regulated.Learn on the biology, infer function • Goal B:Divide conditions to groups with similar gene expression profiles.Examples: Find sub-types of a disease, group or drugs according to their effect Clustering Methods

  15. DEFINITION OF THE CLUSTERING PROBLEM Giraffe

  16. Dendrogram1 How many clusters we have ? The answer depends on the resolution CLUSTER ANALYSIS YIELDS DENDROGRAM T (RESOLUTION)

  17. BUT WHAT ABOUT THE OKAPI? Giraffe + Okapi

  18. Clustering problem definition • Input: N data points, Xi, i=1,2,…,N in a D dimensional space. • Goal: Find “natural” groups (clusters) of points. Points that belong to the same cluster – are “more similar”

  19. Clustering is not well defined • Similarity: which points should be considered close? • Clustering method: • Resolution: specify/hierarchical results • Shape of clusters: general, spherical.

  20. Agglomerative Hierarchical Clustering • Results depend on distance update method • Single Linkage: elongated clusters • Average Linkage: sphere-like clusters • Greedy iterative process • NOT robust against noise • Not always finds the “natural” clusters.

  21. Stop … think • We want to identify the real (“natural”) clusters. • We should have a reliability parameter that will help us to distinguish between significant and non-significant clusters.

  22. Talk Outline • Preprocessing and filtering • Clustering of Genes and Conditions • Super-Paramagnetic Clustering (SPC) • Coupled Two-Way Clustering (CTWC) • CTWC server • Exercise

  23. Super-Paramagnetic Clustering (SPC)M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties of dilute magnets. • Calculating correlation between magnet orientations atdifferent temperatures (T). Small elements, Spins T=Low

  24. Super-Paramagnetic Clustering (SPC)M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties of dilute magnets. • Calculating correlation between magnet orientations atdifferent temperatures (T). T=High

  25. Super-Paramagnetic Clustering (SPC)M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties of dilute magnets. • Calculating correlation between magnet orientations atdifferent temperatures (T). T=Intermediate

  26. T=High T=Low T=Intermediate Phases of the Inhomogeneous Potts Ferromagnet Ferro Super-Para Para

  27. Super-Paramagnetic Clustering (SPC) T=Low T=Low T=Intermediate T=High

  28. Super-Paramagnetic Clustering (SPC) • The algorithm simulates the magnets behavior at a range of temperatures and decides which interactions to break. • The temperature (T) controls the resolution Example: N=4800 points in D=2

  29. Identify the stableclusters T=16

  30. Same data - Average Linkage

  31. Advantages of SPC • Scans all resolutions (T) • Robust against noise and initialization -calculates collective correlations. • Identifies “natural” and stable clusters (T) • No need to pre-specify number of clusters • Clusters can be any shape

  32. Inside SPC: dendrogam and stable clusters Min Cluster Size: 3 Stable Delta T: 14 Ignore dropout: 1 T 28 26 24 22 10

  33. Genes Samples CTWC server - Setting the SPC parameters

  34. Talk Outline • Preprocessing and filtering • Clustering of Genes and Conditions • Super-Paramagnetic Clustering (SPC) • Coupled Two-Way Clustering (CTWC) • CTWC server • Exercise

  35. Back to gene expression data • 2 Goals: Cluster Genes and Conditions • 2 independent clustering: • Genes represented as vectors of expression in all conditions • Conditions are represented as vectors of expression of all genes

  36. First clustering - Experiments 1. Identify tissue classes (tumor/normal) D = 2000

  37. Second Clustering - Genes 2.Find Differentiating And Correlated Genes D = 62 genes samples

  38. Two-way clustering S1(G1) G1(S1) TWO-WAY CLUSTERING:

  39. Two way clustering-ordered TWO-WAY CLUSTERING: S1(G1) G1(S1)

  40. Football Song A Song B

  41. Coupled Two-Way Clustering (CTWC)G. Getz, E. Levine and E. Domany (2000) PNAS • Philosophy: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest. • New Goal: Use subsets of genes to study subsets of samples (and vice versa) • A non-trivial task – exponential number of subsets. • CTWC is a heuristic to solve this problem.

  42. Inside CTWC: Iterations Two-way clustering

  43. CTWC server -Setting the coupled two-way clustering parameters E-mail notification

  44. tissues 1 G4 G12 COUPLED TWO-WAY CLUSTERING OF COLON CANCER: TISSUES S1(G4) S1(G12)

  45. CTWC colon cancer - tissues Tumor Normal S17 Protocol A Protocol B COUPLED TWO-WAY CLUSTERING OF COLON CANCER: TISSUES S1(G4) S1(G12)

  46. colon cancer carcinoma +adenoma What kind of results do you wish to find ? type A /type B distance matrix

  47. Talk Outline • Preprocessing and filtering • Clustering of Genes and Conditions • Super-Paramagnetic Clustering (SPC) • Coupled Two-Way Clustering (CTWC) • CTWC server • Exercise

  48. CTWC software • Web interface • ctwc.weizmann.ac.il • ctwc.bioz.unibas.ch • Standalone • Write to Assif.Yitzhaky@weizmann.ac.il

  49. CTWC standalone

  50. #L1 in C #L1 in C |L1| |C1| Sample Labels • Given as a binary file • For a cluster Gx, label L with values L1 and L2: • Purity(C1, L1) – how much of C1 is composed of L1? • Efficiency(C1 , L1) – how much of L1 is contained in of C1?

More Related