- 54 Views
- Uploaded on
- Presentation posted in: General

Microarray Data Analysis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

- Data preprocessing and visualization
- Supervised learning
- Machine learning approaches

- Unsupervised learning
- Clustering and pattern detection

- Gene regulatory regions predictions based co-regulated genes
- Linkage between gene expression data and gene sequence/function databases
- …

- Supervised methods
- Can only validate or reject hypotheses
- Can not lead to discovery of unexpected partitions

- Unsupervised learning
- No prior knowledge is used
- Explore structure of data on the basis of corrections and similarities

DEFINITION OF THE CLUSTERING PROBLEM

Eytan Domany

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

Eytan Domany

BUT WHAT ABOUT THE OKAPI?

Eytan Domany

Data points at Xi , i= 1,...,N

Centroids at Y , = 1,...,K

Assign data point i to centroid ; Si =

Cost E:

E(S1 , S2 ,...,SN ; Y1 ,...YK ) =

MinimizeE over Si , Y

Eytan Domany

- “Guess” K=3

Eytan Domany

- Start with random
positions of centroids.

Iteration = 0

Eytan Domany

- Start with random
positions of centroids.

- Assign each data point
to closest centroid.

Iteration = 1

Eytan Domany

- Start with random
positions of centroids.

- Assign each data point
to closest centroid.

- Move centroids to
center of assigned

points

Iteration = 2

Eytan Domany

- Start with random
positions of centroids.

- Assign each data point
to closest centroid.

- Move centroids to
center of assigned

points

- Iterate till minimal cost

Iteration = 3

Eytan Domany

- Fast algorithm: compute distances from data points to centroids
- Result depends on initial centroids’ position
- Must preset K
- Fails for “non-spherical” distributions

2

4

5

3

1

1

3

2

4

5

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

at each step merge pair of nearestclusters

initially – each point = cluster

Distance between joined clusters

The dendrogram induces a linear ordering of the data points

Dendrogram

Eytan Domany

- Results depend on distance update method
- Greedy iterative process
- NOT robust against noise
- No inherent measure to identify stable clusters
- Average Linkage – the most widely used clustering method in gene expression analysis

Heat map

- Sample should cluster together based on experimental design
- Often a way to catch labelling errors or heterogeneity in samples

Correlation coeff

Normalized across each gene

- Pearson distance

- Euclidean distance

g1

g3

g2

g4

- Use Average Linkage Algorithm and Manhattan distance.

- A lot of clustering algorithms
- A lot of distance/similarity metrics
- Which clustering algorithm runs faster and uses less memory?
- How many clusters after all?
- Are the clusters stable?
- Are the clusters meaningful?

- What is the biological question?
- Do I have a preconceived notion of how many clusters there should be?
- How strict do I want to be? Spilt or Join?
- Can a gene be in multiple clusters?
- Hard or soft boundaries between clusters

- Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it.
- We will be very glad to answer your emails or schedule appointments to talk about any bioinformatics related questions you might have.
- We wish you all have a wonderful summer break!