microarray data analysis n.
Download
Skip this Video
Download Presentation
Microarray Data Analysis

Loading in 2 Seconds...

play fullscreen
1 / 24

Microarray Data Analysis - PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on

Microarray Data Analysis. Data preprocessing and visualization Supervised learning Machine learning approaches Unsupervised learning Clustering and pattern detection Gene regulatory regions predictions based co-regulated genes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Microarray Data Analysis' - zia-velasquez


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
microarray data analysis
Microarray Data Analysis
  • Data preprocessing and visualization
  • Supervised learning
    • Machine learning approaches
  • Unsupervised learning
    • Clustering and pattern detection
  • Gene regulatory regions predictions based co-regulated genes
  • Linkage between gene expression data and gene sequence/function databases
unsupervised learning
Unsupervised learning
  • Supervised methods
    • Can only validate or reject hypotheses
    • Can not lead to discovery of unexpected partitions
  • Unsupervised learning
    • No prior knowledge is used
    • Explore structure of data on the basis of corrections and similarities
slide4

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

Eytan Domany

centroid methods k means
Centroid methods – K-means

Data points at Xi , i= 1,...,N

Centroids at Y , = 1,...,K

Assign data point i to centroid  ; Si = 

Cost E:

E(S1 , S2 ,...,SN ; Y1 ,...YK ) =

MinimizeE over Si , Y

Eytan Domany

k means
K-means
  • “Guess” K=3

Eytan Domany

k means1
K-means
  • Start with random

positions of centroids.

Iteration = 0

Eytan Domany

k means2
K-means
  • Start with random

positions of centroids.

  • Assign each data point

to closest centroid.

Iteration = 1

Eytan Domany

k means3
K-means
  • Start with random

positions of centroids.

  • Assign each data point

to closest centroid.

  • Move centroids to

center of assigned

points

Iteration = 2

Eytan Domany

k means4
K-means
  • Start with random

positions of centroids.

  • Assign each data point

to closest centroid.

  • Move centroids to

center of assigned

points

  • Iterate till minimal cost

Iteration = 3

Eytan Domany

k means summary
K-means - Summary
  • Fast algorithm: compute distances from data points to centroids
  • Result depends on initial centroids’ position
  • Must preset K
  • Fails for “non-spherical” distributions
agglomerative hierarchical clustering

2

4

5

3

1

1

3

2

4

5

Agglomerative Hierarchical Clustering

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

at each step merge pair of nearestclusters

initially – each point = cluster

Distance between joined clusters

The dendrogram induces a linear ordering of the data points

Dendrogram

Eytan Domany

hierarchical clustering summary
Hierarchical Clustering -Summary
  • Results depend on distance update method
  • Greedy iterative process
  • NOT robust against noise
  • No inherent measure to identify stable clusters
  • Average Linkage – the most widely used clustering method in gene expression analysis
cluster both genes and samples
Cluster both genes and samples
  • Sample should cluster together based on experimental design
    • Often a way to catch labelling errors or heterogeneity in samples
heap map

Correlation coeff

Heap map

Normalized across each gene

distance issues

Pearson distance

Distance Issues
  • Euclidean distance

g1

g3

g2

g4

exercise
Exercise
  • Use Average Linkage Algorithm and Manhattan distance.
issues in cluster analysis
Issues in Cluster Analysis
  • A lot of clustering algorithms
  • A lot of distance/similarity metrics
  • Which clustering algorithm runs faster and uses less memory?
  • How many clusters after all?
  • Are the clusters stable?
  • Are the clusters meaningful?
which clustering method should i use
Which Clustering Method Should I Use?
  • What is the biological question?
  • Do I have a preconceived notion of how many clusters there should be?
  • How strict do I want to be? Spilt or Join?
  • Can a gene be in multiple clusters?
  • Hard or soft boundaries between clusters
the end
The End
  • Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it.
  • We will be very glad to answer your emails or schedule appointments to talk about any bioinformatics related questions you might have.
  • We wish you all have a wonderful summer break!