Microarray data analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Microarray Data Analysis PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

Microarray Data Analysis. Data preprocessing and visualization Supervised learning Machine learning approaches Unsupervised learning Clustering and pattern detection Gene regulatory regions predictions based co-regulated genes

Download Presentation

Microarray Data Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Microarray data analysis

Microarray Data Analysis

  • Data preprocessing and visualization

  • Supervised learning

    • Machine learning approaches

  • Unsupervised learning

    • Clustering and pattern detection

  • Gene regulatory regions predictions based co-regulated genes

  • Linkage between gene expression data and gene sequence/function databases


Unsupervised learning

Unsupervised learning

  • Supervised methods

    • Can only validate or reject hypotheses

    • Can not lead to discovery of unexpected partitions

  • Unsupervised learning

    • No prior knowledge is used

    • Explore structure of data on the basis of corrections and similarities


Microarray data analysis

DEFINITION OF THE CLUSTERING PROBLEM

Eytan Domany


Microarray data analysis

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

Eytan Domany


Microarray data analysis

BUT WHAT ABOUT THE OKAPI?

Eytan Domany


Centroid methods k means

Centroid methods – K-means

Data points at Xi , i= 1,...,N

Centroids at Y , = 1,...,K

Assign data point i to centroid  ; Si = 

Cost E:

E(S1 , S2 ,...,SN ; Y1 ,...YK ) =

MinimizeE over Si , Y

Eytan Domany


K means

K-means

  • “Guess” K=3

Eytan Domany


K means1

K-means

  • Start with random

    positions of centroids.

Iteration = 0

Eytan Domany


K means2

K-means

  • Start with random

    positions of centroids.

  • Assign each data point

    to closest centroid.

Iteration = 1

Eytan Domany


K means3

K-means

  • Start with random

    positions of centroids.

  • Assign each data point

    to closest centroid.

  • Move centroids to

    center of assigned

    points

Iteration = 2

Eytan Domany


K means4

K-means

  • Start with random

    positions of centroids.

  • Assign each data point

    to closest centroid.

  • Move centroids to

    center of assigned

    points

  • Iterate till minimal cost

Iteration = 3

Eytan Domany


K means summary

K-means - Summary

  • Fast algorithm: compute distances from data points to centroids

  • Result depends on initial centroids’ position

  • Must preset K

  • Fails for “non-spherical” distributions


Agglomerative hierarchical clustering

2

4

5

3

1

1

3

2

4

5

Agglomerative Hierarchical Clustering

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

at each step merge pair of nearestclusters

initially – each point = cluster

Distance between joined clusters

The dendrogram induces a linear ordering of the data points

Dendrogram

Eytan Domany


Hierarchical clustering summary

Hierarchical Clustering -Summary

  • Results depend on distance update method

  • Greedy iterative process

  • NOT robust against noise

  • No inherent measure to identify stable clusters

  • Average Linkage – the most widely used clustering method in gene expression analysis


Nature 2002 breast cancer

nature 2002 breast cancer

Heat map


Cluster both genes and samples

Cluster both genes and samples

  • Sample should cluster together based on experimental design

    • Often a way to catch labelling errors or heterogeneity in samples


Epinephrine treated rat fibroblast cell

Epinephrine Treated Rat Fibroblast Cell


Heap map

Correlation coeff

Heap map

Normalized across each gene


Distance issues

  • Pearson distance

Distance Issues

  • Euclidean distance

g1

g3

g2

g4


Exercise

Exercise

  • Use Average Linkage Algorithm and Manhattan distance.


Exercise1

Exercise


Issues in cluster analysis

Issues in Cluster Analysis

  • A lot of clustering algorithms

  • A lot of distance/similarity metrics

  • Which clustering algorithm runs faster and uses less memory?

  • How many clusters after all?

  • Are the clusters stable?

  • Are the clusters meaningful?


Which clustering method should i use

Which Clustering Method Should I Use?

  • What is the biological question?

  • Do I have a preconceived notion of how many clusters there should be?

  • How strict do I want to be? Spilt or Join?

  • Can a gene be in multiple clusters?

  • Hard or soft boundaries between clusters


The end

The End

  • Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it.

  • We will be very glad to answer your emails or schedule appointments to talk about any bioinformatics related questions you might have.

  • We wish you all have a wonderful summer break!


  • Login