Lab 3 david clustering and classification
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

Lab 3 DAVID, Clustering and Classification PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

Lab 3 DAVID, Clustering and Classification. Yang Li Lin Liu Feb 10 & Feb 11, 2014. DAVID (gene set analysis). http://david.abcc.ncifcrf.gov/summary.jsp Biological processes Molecular function Cellular component. Other gene set analysis tools.

Download Presentation

Lab 3 DAVID, Clustering and Classification

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lab 3 david clustering and classification

Lab 3DAVID, Clustering and Classification

Yang Li

Lin Liu

Feb 10 & Feb 11, 2014


David gene set analysis

DAVID (gene set analysis)

http://david.abcc.ncifcrf.gov/summary.jsp

Biological processes

Molecular function

Cellular component


Other gene set analysis tools

Other gene set analysis tools

GSEA http://www.broadinstitute.org/gsea/index.jsp

GSAhttp://statweb.stanford.edu/~tibs/GSA/

GOrilla http://cbl-gorilla.cs.technion.ac.il/

Panther http://www.pantherdb.org/pathway/


Clustering

Clustering

Finding structures

Key: distance metric/divergence


Hierarchical clustering

Hierarchical Clustering

  • Repeatedly

    • Merge two nodes (either a gene or a cluster) that are closest to each other

    • Re-calculate the distance from newly formed node to all other nodes

    • Branch length represents distance

  • Linkage: distance from newly formed node to all other nodes


Hierarchical clustering linkage

Hierarchical Clustering Linkage

SingleComplete

Average: pairwise distances


Partitional clustering

Partitional Clustering

  • Disjoint groups

  • From hierarchical clustering:

    • Cut a line from hierarchical clustering

    • By varying the cut height, we could produce arbitrary number of clusters


K means algorithm

K-means Algorithm

  • Choose K centroids at random

Expression in Sample1

Expression in Sample2

Iteration = 0


K means algorithm1

K-means Algorithm

  • Choose K centroids at random

  • Assign object i to closest centroid

Iteration = 1


K means algorithm2

K-means Algorithm

  • Choose K centroids at random

  • Assign object i to closest centroid

  • Recalculate centroid based on current cluster assignment

Iteration = 2


K means algorithm3

K-means Algorithm

  • Choose K centroids at random

  • Assign object i to closest centroid

  • Recalculate centroid based on current cluster assignment

  • Repeat until assignment stabilize

Iteration = 3


Let s look at the data first

Let’s look at the data first


Some review

Some review


Hierarchical clustering1

Hierarchical clustering

D=dist(t(ld), method=c("euclidian"))

hc=hclust(D, method=c("average"))


K means clustering mixture model

K-Means clustering (Mixture model)

kmean.cluster <- kmeans(t(ld), 2)

kmean.cluster$cluster


Distance metric

Distance metric

Euclidean distance

Hamming distance (binary)

Correlation (range: [0, 1])

Mahalanobis distance


How to choose distance context specific

How to choose distance: context specific

  • RNA-Seq example: (1, 0, 0) -> (0, q1, q2)

  • Jensen-Shannon divergence

    • JSD(P, Q) = ½ (D(P||M) + D(Q||M))

    • D(A||B) is Kullback-Leibler divergence

    • M = ½ (P + Q)

    • Used in RNA-Seq analysis

    • Problem of JSD? Highly abundant rows will dominate analysis; Not a metric (consider to take squared root)


Lab 3 david clustering and classification

  • Mahalanobis distance

    • Rectify the problem of JSD by normalizing using the entire covariance matrix

    • d(x, y) = (sum((xi – yi)2/si2))1/2


Nonparametric correlation

Nonparametric correlation

MIC (Reshef, Reshef and et al. 2011 Science) – Mutual Information Coefficient


Dimension reduction

Dimension reduction

Principal Component Analysis

Kernel PCA

LDA

Isomap

Laplacian eigenmap

Manifold learning


Principal component analysis

Principal Component Analysis

pc.cr= prcomp(t(d[genes.set_a,]))

summary(pc.cr)

biplot(pc.cr)


Fisher s lda

Fisher’s LDA

Key difference between LDA and PCA?


Fisher s lda1

Fisher’s LDA

  • R code:

    • library(MASS)

    • lda = lda(t(d[genes.set_a,]), grouping=c(rep('Normal',4), rep('Cancer',8)), subset=1:12)

    • predict(lda, t(d[genes.set_a, 13:14]))


Multidimensional scaling

Multidimensional scaling

D=dist(t(d), method=c("euclidian"))

mds = cmdscale(D, k = 2)

plot(mds[,1], mds[,2], type="p", main="Clustering using MDS”, xlab = 'mds1', ylab = 'mds2')

text(mds, row.names(mds))


Classification

Classification

Classification is equivalent to prediction with binary outcomes

Machine learning cares more about prediction than statistics

Machine learning is statistics with a focus on prediction, scalability and high dimensional problems

But there’s interconnection between clustering and classification


Like this

Like this


Lab 3 david clustering and classification

SVM

library('e1071')

model1 = svm(t(d[,1:12]),c(rep('Normal',4), rep('Cancer',8)),type='C',kernel='linear')

predict(model1,t(d[,13:14]))


K nearest neighbor

K-Nearest Neighbor

#KNN k = 1

class::knn(t(ld[,1:12]), t(ld[,13:14]), c(rep('Normal',4), rep('Cancer',8)), k=1)

#KNN k = 3

class::knn(t(ld[,1:12]), t(ld[,13:14]), c(rep('Normal',4), rep('Cancer',8)), k=3)


Na ve bayes

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes1

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes2

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes3

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes4

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes5

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes6

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes7

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes8

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes9

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes10

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Na ve bayes11

Naïve Bayes

Borrowed from Manolis Kellis’s course slides


Hint for hw1 problem 2

Hint for hw1 problem 2

For graduate-level question, try to think about removing batch effects using PCA

For ComBat software, try to search “srv bioconductor” on Google.


  • Login