Microarray data analysis
Download
1 / 24

Microarray Data Analysis - PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on

Microarray Data Analysis. Data preprocessing and visualization Supervised learning Machine learning approaches Unsupervised learning Clustering and pattern detection Gene regulatory regions predictions based co-regulated genes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Microarray Data Analysis' - zia-velasquez


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Microarray data analysis
Microarray Data Analysis

  • Data preprocessing and visualization

  • Supervised learning

    • Machine learning approaches

  • Unsupervised learning

    • Clustering and pattern detection

  • Gene regulatory regions predictions based co-regulated genes

  • Linkage between gene expression data and gene sequence/function databases


Unsupervised learning
Unsupervised learning

  • Supervised methods

    • Can only validate or reject hypotheses

    • Can not lead to discovery of unexpected partitions

  • Unsupervised learning

    • No prior knowledge is used

    • Explore structure of data on the basis of corrections and similarities



CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

Eytan Domany



Centroid methods k means
Centroid methods – K-means

Data points at Xi , i= 1,...,N

Centroids at Y , = 1,...,K

Assign data point i to centroid  ; Si = 

Cost E:

E(S1 , S2 ,...,SN ; Y1 ,...YK ) =

MinimizeE over Si , Y

Eytan Domany


K means
K-means

  • “Guess” K=3

Eytan Domany


K means1
K-means

  • Start with random

    positions of centroids.

Iteration = 0

Eytan Domany


K means2
K-means

  • Start with random

    positions of centroids.

  • Assign each data point

    to closest centroid.

Iteration = 1

Eytan Domany


K means3
K-means

  • Start with random

    positions of centroids.

  • Assign each data point

    to closest centroid.

  • Move centroids to

    center of assigned

    points

Iteration = 2

Eytan Domany


K means4
K-means

  • Start with random

    positions of centroids.

  • Assign each data point

    to closest centroid.

  • Move centroids to

    center of assigned

    points

  • Iterate till minimal cost

Iteration = 3

Eytan Domany


K means summary
K-means - Summary

  • Fast algorithm: compute distances from data points to centroids

  • Result depends on initial centroids’ position

  • Must preset K

  • Fails for “non-spherical” distributions


Agglomerative hierarchical clustering

2

4

5

3

1

1

3

2

4

5

Agglomerative Hierarchical Clustering

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

at each step merge pair of nearestclusters

initially – each point = cluster

Distance between joined clusters

The dendrogram induces a linear ordering of the data points

Dendrogram

Eytan Domany


Hierarchical clustering summary
Hierarchical Clustering -Summary

  • Results depend on distance update method

  • Greedy iterative process

  • NOT robust against noise

  • No inherent measure to identify stable clusters

  • Average Linkage – the most widely used clustering method in gene expression analysis



Cluster both genes and samples
Cluster both genes and samples

  • Sample should cluster together based on experimental design

    • Often a way to catch labelling errors or heterogeneity in samples


Epinephrine treated rat fibroblast cell
Epinephrine Treated Rat Fibroblast Cell


Heap map

Correlation coeff

Heap map

Normalized across each gene


Distance issues

Distance Issues

  • Euclidean distance

g1

g3

g2

g4


Exercise
Exercise

  • Use Average Linkage Algorithm and Manhattan distance.



Issues in cluster analysis
Issues in Cluster Analysis

  • A lot of clustering algorithms

  • A lot of distance/similarity metrics

  • Which clustering algorithm runs faster and uses less memory?

  • How many clusters after all?

  • Are the clusters stable?

  • Are the clusters meaningful?


Which clustering method should i use
Which Clustering Method Should I Use?

  • What is the biological question?

  • Do I have a preconceived notion of how many clusters there should be?

  • How strict do I want to be? Spilt or Join?

  • Can a gene be in multiple clusters?

  • Hard or soft boundaries between clusters


The end
The End

  • Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it.

  • We will be very glad to answer your emails or schedule appointments to talk about any bioinformatics related questions you might have.

  • We wish you all have a wonderful summer break!


ad