Advanced methods of data analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 58

Advanced Methods of Data Analysis PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

Advanced Methods of Data Analysis. Course on Microarray Data Acquisition and Analysis Weizmann Institute of Science 16 May 2007 Presented by Tal Shay & Yuval Tabach Weizmann Institute of Science Rehovot, Israel. 9:00 - 10:00CTWC 10:00 - 11:00 CTWC exercise 11:00 – 11:30 Break

Download Presentation

Advanced Methods of Data Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Advanced methods of data analysis

Advanced Methods of Data Analysis

Course on Microarray Data Acquisition and Analysis

Weizmann Institute of Science

16 May 2007

Presented by Tal Shay & Yuval Tabach

Weizmann Institute of Science

Rehovot, Israel

  • 9:00 - 10:00CTWC

  • 10:00 - 11:00 CTWC exercise

  • 11:00 – 11:30 Break

  • 11:30 - 12:00 SPIN

  • 12:00 - 13:00 SPIN exercise


C oupled t wo w ay c lustering ctwc

Coupled Two-Way Clustering CTWC

Course on Microarray Data Acquisition and Analysis

Weizmann Institute of Science

16 May 2007

Presented by Tal Shay & Yuval Tabach

Weizmann Institute of Science

Rehovot, Israel

  • Gad Getz, Erel Levine, and Eytan Domany

  • Coupled two-way clustering analysis of gene microarray dataPNAS 97: 12079-12084


Talk aim

Talk Aim

Guide how to use the CTWC server to properly analyze micro-array data.


Motivation

Motivation

  • Micro-array experiments generate millions of numbers containing

  • a lot of biological information.

  • The problem: Very complicated data contain large amount of noise.

  • How to unravel the biological information which is masked

  • by a mess of irrelevant information.

  • CTWC is a simple heuristic clustering procedure that was

  • developed especially to cope with micro-array data.


Talk outline

Talk Outline

  • Preprocessing and filtering

  • Clustering of Genes and Conditions

  • Super-Paramagnetic Clustering (SPC)

  • Coupled Two-Way Clustering (CTWC)

  • CTWC server

  • Exercise


Gene expression matrix ctwc format

Gene Expression Matrix – CTWC format

The DB_NAME is used to link genes to a database


Visualization of expression matrix

Visualization of Expression Matrix

  • Column = chip (=sample)

  • Row = probeset

  • Color = expression level

genes

samples


Preprocessing

Preprocessing

  • Select variable genes

  • Standardize

genes

samples

Initial Expression Matrix


Preprocessing1

Preprocessing

  • Select variable genes

  • Standardize

genes

samples

1000 probesets with highest standard deviation


Preprocessing2

Preprocessing

  • Select variable genes

  • Standardize

genes

samples

1000 probesets with highest standard deviation, standardized


Talk outline1

Talk Outline

  • Preprocessing and filtering

  • Clustering of Genes and Conditions

  • Super-Paramagnetic Clustering (SPC)

  • Coupled Two-Way Clustering (CTWC)

  • CTWC server

  • Exercise


What questions can we ask

What questions can we ask?

Supervised Methods

Hypothesis Testing(use predefined labels)

  • Which genes are expressed differently in two known types of samples?

  • What is the minimal set of genes needed to distinguish one type of samples from the others?

  • Which genes behave similarly in the experiments?

  • How many different types of samples are there?

Unsupervised MethodsExploratory Analysis(use only the data)


Advanced methods of data analysis

Clustering – unsupervised analysis

samples

Low variation genes

All genes

genes

High variation genes

Filtering

1

Clustering

3 clusters,

each contains highly

correlated genes

2

3


Unsupervised analysis

Unsupervised Analysis

  • Goal A:Find groups of genes that have correlated expression profiles.These genes are believed to belong to the same biological process and might be co-regulated.Learn on the biology, infer function

  • Goal B:Divide conditions to groups with similar gene expression profiles.Examples: Find sub-types of a disease, group or drugs according to their effect

Clustering Methods


Giraffe

DEFINITION OF THE CLUSTERING PROBLEM

Giraffe


Dendrogram1

Dendrogram1

How many clusters we have ?

The answer depends on the resolution

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)


Giraffe okapi

BUT WHAT ABOUT THE OKAPI?

Giraffe + Okapi


Clustering problem definition

Clustering problem definition

  • Input: N data points, Xi, i=1,2,…,N in a D dimensional space.

  • Goal: Find “natural” groups (clusters) of points. Points that belong to the same cluster – are “more similar”


Clustering is not well defined

Clustering is not well defined

  • Similarity: which points should be considered close?

  • Clustering method:

    • Resolution: specify/hierarchical results

    • Shape of clusters: general, spherical.


Agglomerative hierarchical clustering

Agglomerative Hierarchical Clustering

  • Results depend on distance update method

    • Single Linkage: elongated clusters

    • Average Linkage: sphere-like clusters

  • Greedy iterative process

  • NOT robust against noise

  • Not always finds the “natural” clusters.


Stop think

Stop … think

  • We want to identify the real (“natural”) clusters.

  • We should have a reliability parameter that will help us to distinguish between significant and non-significant clusters.


Talk outline2

Talk Outline

  • Preprocessing and filtering

  • Clustering of Genes and Conditions

  • Super-Paramagnetic Clustering (SPC)

  • Coupled Two-Way Clustering (CTWC)

  • CTWC server

  • Exercise


S uper p aramagnetic c lustering spc m blatt s weisman and e domany 1996 neural computation

Super-Paramagnetic Clustering (SPC)M.Blatt, S.Weisman and E.Domany (1996) Neural Computation

  • The idea behind SPC is based on the physical properties of dilute magnets.

  • Calculating correlation between magnet orientations atdifferent temperatures (T).

Small elements,

Spins

T=Low


Super paramagnetic clustering spc m blatt s weisman and e domany 1996 neural computation

Super-Paramagnetic Clustering (SPC)M.Blatt, S.Weisman and E.Domany (1996) Neural Computation

  • The idea behind SPC is based on the physical properties of dilute magnets.

  • Calculating correlation between magnet orientations atdifferent temperatures (T).

T=High


Super paramagnetic clustering spc m blatt s weisman and e domany 1996 neural computation1

Super-Paramagnetic Clustering (SPC)M.Blatt, S.Weisman and E.Domany (1996) Neural Computation

  • The idea behind SPC is based on the physical properties of dilute magnets.

  • Calculating correlation between magnet orientations atdifferent temperatures (T).

T=Intermediate


Phases of the inhomogeneous potts ferromagnet

T=High

T=Low

T=Intermediate

Phases of the Inhomogeneous Potts Ferromagnet

Ferro

Super-Para

Para


Super paramagnetic clustering spc

Super-Paramagnetic Clustering (SPC)

T=Low

T=Low

T=Intermediate

T=High


Super paramagnetic clustering spc1

Super-Paramagnetic Clustering (SPC)

  • The algorithm simulates the magnets behavior at a range of temperatures and decides which interactions to break.

  • The temperature (T) controls the resolution

Example: N=4800 points in D=2


Identify the st ab le clusters

Identify the stableclusters

T=16


Same data average linkage

Same data - Average Linkage


Advantages of spc

Advantages of SPC

  • Scans all resolutions (T)

  • Robust against noise and initialization -calculates collective correlations.

  • Identifies “natural” and stable clusters (T)

  • No need to pre-specify number of clusters

  • Clusters can be any shape


Inside spc dendrogam and stable clusters

Inside SPC: dendrogam and stable clusters

Min Cluster Size: 3

Stable Delta T: 14

Ignore dropout: 1

T

28

26

24

22

10


Advanced methods of data analysis

Genes

Samples

CTWC server - Setting the SPC parameters


Talk outline3

Talk Outline

  • Preprocessing and filtering

  • Clustering of Genes and Conditions

  • Super-Paramagnetic Clustering (SPC)

  • Coupled Two-Way Clustering (CTWC)

  • CTWC server

  • Exercise


Back to gene expression data

Back to gene expression data

  • 2 Goals: Cluster Genes and Conditions

  • 2 independent clustering:

    • Genes represented as vectors of expression in all conditions

    • Conditions are represented as vectors of expression of all genes


First clustering experiments

First clustering - Experiments

1. Identify tissue classes (tumor/normal)

D = 2000


Second clustering genes

Second Clustering - Genes

2.Find Differentiating And Correlated Genes

D = 62

genes

samples


Two way clustering

Two-way clustering

S1(G1)

G1(S1)

TWO-WAY

CLUSTERING:


Two way clustering ordered

Two way clustering-ordered

TWO-WAY

CLUSTERING:

S1(G1)

G1(S1)


Football

Football

Song A

Song B


Coupled two way clustering ctwc g getz e levine and e domany 2000 pnas

Coupled Two-Way Clustering (CTWC)G. Getz, E. Levine and E. Domany (2000) PNAS

  • Philosophy: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest.

  • New Goal: Use subsets of genes to study subsets of samples (and vice versa)

  • A non-trivial task – exponential number of subsets.

  • CTWC is a heuristic to solve this problem.


Inside ctwc iterations

Inside CTWC: Iterations

Two-way clustering


Advanced methods of data analysis

CTWC server -Setting the coupled two-way clustering parameters

E-mail notification


Tissues 1

tissues 1

G4

G12

COUPLED TWO-WAY CLUSTERING OF COLON

CANCER: TISSUES

S1(G4)

S1(G12)


Ctwc colon cancer tissues

CTWC colon cancer - tissues

Tumor

Normal

S17

Protocol A

Protocol B

COUPLED TWO-WAY CLUSTERING OF COLON

CANCER: TISSUES

S1(G4)

S1(G12)


Colon cancer carcinoma adenoma

colon cancer carcinoma +adenoma

What kind of results do you

wish to find ?

type A /type B

distance matrix


Talk outline4

Talk Outline

  • Preprocessing and filtering

  • Clustering of Genes and Conditions

  • Super-Paramagnetic Clustering (SPC)

  • Coupled Two-Way Clustering (CTWC)

  • CTWC server

  • Exercise


Ctwc software

CTWC software

  • Web interface

    • ctwc.weizmann.ac.il

    • ctwc.bioz.unibas.ch

  • Standalone

    • Write to [email protected]


Advanced methods of data analysis

CTWC standalone


Sample labels

#L1 in C

#L1 in C

|L1|

|C1|

Sample Labels

  • Given as a binary file

  • For a cluster Gx, label L with values L1 and L2:

  • Purity(C1, L1) – how much of C1 is composed of L1?

  • Efficiency(C1 , L1) – how much of L1 is contained in of C1?


Biological work

Biological Work

  • Literature search for information on interesting genes.

  • Annotation analysis: classify the genes according to their function.

  • Find whether there is a common function or biological meaning for clusters of interest.

  • Find what is in common with sets of experiments/conditions.

  • Genomics analysis: search for common regulatory signal upstream of the genes

  • Design next experiment – get more data to validate result.

Remember : most of your work is

starting here - understanding the

biology behind your results


Summary

Summary

  • Clustering methods are used to

    • find genes from the same biological process

    • group the experiments to similar conditions

  • Focusing on subsets of the genes and conditions can unravel structure that is masked when using all genes and conditions

ctwc.weizmann.ac.il

or

[email protected]


Exercise course experiment

Exercise - Course Experiment

On time 0 a treatment is given.

For D8, treatment suppresses mutp53.

For D11, treatment does not.


The data

The Data

Save and backup the CEL files!


R code from cel to ecxel

R Code – From CEL to ECXEL

> library(affy)

> A = ReadAffy()

> rma_data = rma(A)

> write.exprs(rma_data, file='rma_expression.txt')

> mas5_data = mas5(A)

> write.exprs(mas5_data, file = 'mas5_expression')

> mas5_calls = mas5calls(A)

> write.exprs(mas5_calls, file = 'mas5_detection')


The excel

The EXCEL

Filter the genes – do not cluster all probesets on the chip!


Edit the excel for ctwc

Edit the EXCEL for CTWC

Title #1: U133_AFFX

Title #2:

NAME

Column #2:

Probeset info

Make the chip names clear!


Samples distance matrix

Samples distance matrix


  • Login