1 / 25

Model-based Clustering in R

Model-based Clustering in R. Tianxi Dong. Theoretical model. Heuristic methods. How many clusters we need? How to compare the performance between methods? How to deal with outliers in heuristic methods? Solution???. Model-based Method.

wray
Download Presentation

Model-based Clustering in R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model-based Clustering in R Tianxi Dong

  2. Theoretical model

  3. Heuristic methods • How many clusters we need? • How to compare the performance between methods? • How to deal with outliers in heuristic methods? Solution???

  4. Model-based Method • Assume that the data come from a mixture of different probability models; • Assign each of the N items to the distribution it most likely belongs to; • Clustering performance is evaluated. Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf

  5. Model-based Clustering • We define the density of a mixture of g distributions as the weighted average Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf

  6. Model-based Clustering • Find the values of the parameters by maximizing the likelihood (usually the log of the likelihood) of the observations max log f(x1… xN) over m1… mG, 1… G and p1… pG • Where N is the number of observations • This turns out to be a nonlinear mess and is greatly aided by the “Expectation Maximization Algorithm” Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf

  7. Covariance Structure • This covariance structure allows for a variety of constraints. VVV G=5 P=8 # of Parameters=? • The best covariance structure is decided based on BIC. Source: http://ms.mcmaster.ca/canty/seminars/paulmcnicholas.pdf

  8. Over fitting • When a model is excessively complex (the number of parameters) • Have poor predictive performance • Training error is shown in blue, validation error in red http://en.wikipedia.org/wiki/Overfitting

  9. Recall: BIC • BIC = 2 loglikM(x, θ) − (# params)M log(N) (Higher is better) • BIC = -2 loglikM(x, θ) + (# params)M log(N) (Lower is better) loglikM(x, θ): the maximized log-likelihood for the model and data (# params)M : the number of independent parameters to be estimated in the model M N: the number of observations in the data. The first format is used in Mclust. Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf

  10. R Procedure – MCLUST

  11. MCLUST Packages • MCLUST is probably the most well known model-based clustering technique in the literature. • http://cran.r-project.org/web/packages/mclust/index.html

  12. MCLUST Packages Syntax Mclust (data, G=NULL, modelNames=NULL, prior=NULL, warn=FALSE, ...) http://cran.r-project.org/web/packages/mclust/mclust.pdf

  13. Parameters to define Mclust • G • An integer vector specifying the possible numbers of mixture components (clusters) for which the BIC is to be calculated. The default is G=1:9. • modelNames • A vector of character strings indicating the models to be fitted in the maximization phase of clustering. • prior • The default assumes no prior and it allows the specification of a conjugate prior on the means and variances. http://cran.r-project.org/web/packages/mclust/mclust.pdf

  14. Real Example

  15. Dataset • Same with Hal’s Dataset GP_PER :gross profit % ROA : Return on Assets ROE : Return on Equities SK_RET : % stock return B_TO_M :book to market NL_ASSETS : Log of assets for size control NL_SALARY: log of CEO salary NL_SALE : log of sale for size control Source: http://www.r-bloggers.com/r-tutorial-series-exploratory-factor-analysis/

  16. Model-based Clustering– Mclust • This procedure cannot handle missing data natively. • If there are missing values: • datam=na.omit(“missing dataset“)

  17. Model-based Clustering– Mclust

  18. Posterior Probability

  19. Classification using Model-based Clustering • Discriminant analyses • Test significance of a set of discriminant functions • Categories are known before classification

  20. Classification

  21. BIC Plot

  22. BIC Table

  23. Means for each cluster

  24. Conclusion

  25. R Code • data <- read.csv("compsetrex.csv") • salary<-data[,c(1,2,3,4,5,7,8,10)] • salaryMclust<- Mclust(salary) • mysummary<-summary(salaryMclust) • #classification matrix • mysummary$classification • #BIC plot and matrix • BICSummary <- summary(salaryMclustBIC, data = salary) • BICSummary • salaryMclustBIC <- mclustBIC(salary) • salaryMclustBIC • #posterior probability • salaryMclust$z • #mean matrix • salaryMclust$parameters

More Related