- 104 Views
- Uploaded on
- Presentation posted in: General

Lab 3 DAVID, Clustering and Classification

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Lab 3DAVID, Clustering and Classification

Yang Li

Lin Liu

Feb 10 & Feb 11, 2014

http://david.abcc.ncifcrf.gov/summary.jsp

Biological processes

Molecular function

Cellular component

GSEA http://www.broadinstitute.org/gsea/index.jsp

GSAhttp://statweb.stanford.edu/~tibs/GSA/

GOrilla http://cbl-gorilla.cs.technion.ac.il/

Panther http://www.pantherdb.org/pathway/

Finding structures

Key: distance metric/divergence

- Repeatedly
- Merge two nodes (either a gene or a cluster) that are closest to each other
- Re-calculate the distance from newly formed node to all other nodes
- Branch length represents distance

- Linkage: distance from newly formed node to all other nodes

SingleComplete

Average: pairwise distances

- Disjoint groups
- From hierarchical clustering:
- Cut a line from hierarchical clustering
- By varying the cut height, we could produce arbitrary number of clusters

- Choose K centroids at random

Expression in Sample1

Expression in Sample2

Iteration = 0

- Choose K centroids at random
- Assign object i to closest centroid

Iteration = 1

- Choose K centroids at random
- Assign object i to closest centroid
- Recalculate centroid based on current cluster assignment

Iteration = 2

- Choose K centroids at random
- Assign object i to closest centroid
- Recalculate centroid based on current cluster assignment
- Repeat until assignment stabilize

Iteration = 3

D=dist(t(ld), method=c("euclidian"))

hc=hclust(D, method=c("average"))

kmean.cluster <- kmeans(t(ld), 2)

kmean.cluster$cluster

Euclidean distance

Hamming distance (binary)

Correlation (range: [0, 1])

Mahalanobis distance

- RNA-Seq example: (1, 0, 0) -> (0, q1, q2)
- Jensen-Shannon divergence
- JSD(P, Q) = ½ (D(P||M) + D(Q||M))
- D(A||B) is Kullback-Leibler divergence
- M = ½ (P + Q)
- Used in RNA-Seq analysis
- Problem of JSD? Highly abundant rows will dominate analysis; Not a metric (consider to take squared root)

- Mahalanobis distance
- Rectify the problem of JSD by normalizing using the entire covariance matrix
- d(x, y) = (sum((xi – yi)2/si2))1/2

MIC (Reshef, Reshef and et al. 2011 Science) – Mutual Information Coefficient

Principal Component Analysis

Kernel PCA

LDA

Isomap

Laplacian eigenmap

Manifold learning

…

pc.cr= prcomp(t(d[genes.set_a,]))

summary(pc.cr)

biplot(pc.cr)

Key difference between LDA and PCA?

- R code:
- library(MASS)
- lda = lda(t(d[genes.set_a,]), grouping=c(rep('Normal',4), rep('Cancer',8)), subset=1:12)
- predict(lda, t(d[genes.set_a, 13:14]))

D=dist(t(d), method=c("euclidian"))

mds = cmdscale(D, k = 2)

plot(mds[,1], mds[,2], type="p", main="Clustering using MDS”, xlab = 'mds1', ylab = 'mds2')

text(mds, row.names(mds))

Classification is equivalent to prediction with binary outcomes

Machine learning cares more about prediction than statistics

Machine learning is statistics with a focus on prediction, scalability and high dimensional problems

But there’s interconnection between clustering and classification

library('e1071')

model1 = svm(t(d[,1:12]),c(rep('Normal',4), rep('Cancer',8)),type='C',kernel='linear')

predict(model1,t(d[,13:14]))

#KNN k = 1

class::knn(t(ld[,1:12]), t(ld[,13:14]), c(rep('Normal',4), rep('Cancer',8)), k=1)

#KNN k = 3

class::knn(t(ld[,1:12]), t(ld[,13:14]), c(rep('Normal',4), rep('Cancer',8)), k=3)

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

Borrowed from Manolis Kellis’s course slides

For graduate-level question, try to think about removing batch effects using PCA

For ComBat software, try to search “srv bioconductor” on Google.