Lab 3 david clustering and classification
Download
1 / 44

Lab 3 DAVID, Clustering and Classification - PowerPoint PPT Presentation


  • 134 Views
  • Uploaded on

Lab 3 DAVID, Clustering and Classification. Yang Li Lin Liu Feb 10 & Feb 11, 2014. DAVID (gene set analysis). http://david.abcc.ncifcrf.gov/summary.jsp Biological processes Molecular function Cellular component. Other gene set analysis tools.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Lab 3 DAVID, Clustering and Classification' - jalena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lab 3 david clustering and classification

Lab 3DAVID, Clustering and Classification

Yang Li

Lin Liu

Feb 10 & Feb 11, 2014


David gene set analysis
DAVID (gene set analysis)

http://david.abcc.ncifcrf.gov/summary.jsp

Biological processes

Molecular function

Cellular component


Other gene set analysis tools
Other gene set analysis tools

GSEA http://www.broadinstitute.org/gsea/index.jsp

GSAhttp://statweb.stanford.edu/~tibs/GSA/

GOrilla http://cbl-gorilla.cs.technion.ac.il/

Panther http://www.pantherdb.org/pathway/


Clustering
Clustering

Finding structures

Key: distance metric/divergence


Hierarchical clustering
Hierarchical Clustering

  • Repeatedly

    • Merge two nodes (either a gene or a cluster) that are closest to each other

    • Re-calculate the distance from newly formed node to all other nodes

    • Branch length represents distance

  • Linkage: distance from newly formed node to all other nodes


Hierarchical clustering linkage
Hierarchical Clustering Linkage

Single Complete

Average: pairwise distances


Partitional clustering
Partitional Clustering

  • Disjoint groups

  • From hierarchical clustering:

    • Cut a line from hierarchical clustering

    • By varying the cut height, we could produce arbitrary number of clusters


K means algorithm
K-means Algorithm

  • Choose K centroids at random

Expression in Sample1

Expression in Sample2

Iteration = 0


K means algorithm1
K-means Algorithm

  • Choose K centroids at random

  • Assign object i to closest centroid

Iteration = 1


K means algorithm2
K-means Algorithm

  • Choose K centroids at random

  • Assign object i to closest centroid

  • Recalculate centroid based on current cluster assignment

Iteration = 2


K means algorithm3
K-means Algorithm

  • Choose K centroids at random

  • Assign object i to closest centroid

  • Recalculate centroid based on current cluster assignment

  • Repeat until assignment stabilize

Iteration = 3


Let s look at the data first
Let’s look at the data first



Hierarchical clustering1
Hierarchical clustering

D=dist(t(ld), method=c("euclidian"))

hc=hclust(D, method=c("average"))


K means clustering mixture model
K-Means clustering (Mixture model)

kmean.cluster <- kmeans(t(ld), 2)

kmean.cluster$cluster


Distance metric
Distance metric

Euclidean distance

Hamming distance (binary)

Correlation (range: [0, 1])

Mahalanobis distance


How to choose distance context specific
How to choose distance: context specific

  • RNA-Seq example: (1, 0, 0) -> (0, q1, q2)

  • Jensen-Shannon divergence

    • JSD(P, Q) = ½ (D(P||M) + D(Q||M))

    • D(A||B) is Kullback-Leibler divergence

    • M = ½ (P + Q)

    • Used in RNA-Seq analysis

    • Problem of JSD? Highly abundant rows will dominate analysis; Not a metric (consider to take squared root)


  • Mahalanobis distance

    • Rectify the problem of JSD by normalizing using the entire covariance matrix

    • d(x, y) = (sum((xi – yi)2/si2))1/2


Nonparametric correlation
Nonparametric correlation

MIC (Reshef, Reshef and et al. 2011 Science) – Mutual Information Coefficient


Dimension reduction
Dimension reduction

Principal Component Analysis

Kernel PCA

LDA

Isomap

Laplacian eigenmap

Manifold learning


Principal component analysis
Principal Component Analysis

pc.cr= prcomp(t(d[genes.set_a,]))

summary(pc.cr)

biplot(pc.cr)


Fisher s lda
Fisher’s LDA

Key difference between LDA and PCA?


Fisher s lda1
Fisher’s LDA

  • R code:

    • library(MASS)

    • lda = lda(t(d[genes.set_a,]), grouping=c(rep('Normal',4), rep('Cancer',8)), subset=1:12)

    • predict(lda, t(d[genes.set_a, 13:14]))


Multidimensional scaling
Multidimensional scaling

D=dist(t(d), method=c("euclidian"))

mds = cmdscale(D, k = 2)

plot(mds[,1], mds[,2], type="p", main="Clustering using MDS”, xlab = 'mds1', ylab = 'mds2')

text(mds, row.names(mds))


Classification
Classification

Classification is equivalent to prediction with binary outcomes

Machine learning cares more about prediction than statistics

Machine learning is statistics with a focus on prediction, scalability and high dimensional problems

But there’s interconnection between clustering and classification



SVM

library('e1071')

model1 = svm(t(d[,1:12]),c(rep('Normal',4), rep('Cancer',8)),type='C',kernel='linear')

predict(model1,t(d[,13:14]))


K nearest neighbor
K-Nearest Neighbor

#KNN k = 1

class::knn(t(ld[,1:12]), t(ld[,13:14]), c(rep('Normal',4), rep('Cancer',8)), k=1)

#KNN k = 3

class::knn(t(ld[,1:12]), t(ld[,13:14]), c(rep('Normal',4), rep('Cancer',8)), k=3)


Na ve bayes
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes1
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes2
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes3
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes4
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes5
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes6
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes7
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes8
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes9
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes10
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes11
Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Hint for hw1 problem 2
Hint for hw1 problem 2

For graduate-level question, try to think about removing batch effects using PCA

For ComBat software, try to search “srv bioconductor” on Google.


ad