1 / 54

Feature Selection, Dimensionality Reduction, and Clustering

Summer Course: Data Mining. Feature Selection, Dimensionality Reduction, and Clustering. Presenter: Georgi Nalbantov. August 2009. Structure. Dimensionality Reduction Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS)

Download Presentation

Feature Selection, Dimensionality Reduction, and Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summer Course: Data Mining Feature Selection, Dimensionality Reduction, and Clustering Presenter: Georgi Nalbantov August 2009

  2. Structure • Dimensionality Reduction • Principal Components Analysis (PCA) • Nonlinear PCA (Kernel PCA, CatPCA) • Multi-Dimensional Scaling (MDS) • Homogeneity Analysis • Feature Selection • Filtering approach • Wrapper approach • Embedded methods • Clustering • Density estimation and clustering • K-means clustering • Hierarchical clustering • Clustering with Support Vector Machines (SVMs)

  3. Feature Selection, Dimensionality Reduction, and Clustering in the KDD Process U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)

  4. Feature Selection • In the presence of millions of features/attributes/inputs/variables, select the most relevant ones.Advantages: build better, faster, and easier to understand learning machines. m features X m’ n

  5. Feature Selection • Goal: select the two best features individually • Any reasonable objective J will rank the features • J(x1) > J(x2) = J(x3) > J(x4) • Thus, features chosen [x1,x2] or [x1,x3]. • However, x4 is the only feature that provides complementary information to x1

  6. Feature Selection • Filtering approach: ranks features or feature subsets independently of the predictor (classifier). • …using univariate methods: consider one variable at a time • …using multivariate methods: consider more than one variables at a time • Wrapper approach: uses a classifier to assess (many) features or feature subsets. • Embedding approach:uses a classifier to build a (single) model with a subset of features that are internally selected.

  7. Feature Selection: univariate filtering approach • Issue:determine the relevance of a given single feature. m m density, P(Xi| Y) density, P(Xi| Y) -1 s- xi xi s-

  8. Feature Selection: univariate filtering approach • Issue:determine the relevance of a given single feature. • Under independence: • P(X, Y) = P(X) P(Y) • Measure of dependence (Mutual Information): • MI(X, Y) =  P(X,Y) log dX dY • = KL( P(X,Y) || P(X)P(Y) )

  9. Feature Selection: univariate filtering approach • Correlation and MI • Note: Correlation is a measure of linear dependence

  10. Feature Selection: univariate filtering approach • Correlation and MI under the Gaussian distribution

  11. Feature Selection: univariate filtering approach. Criteria for measuring dependence.

  12. Feature Selection: univariate filtering approach m- m+ m+ m-, Legend: Y=1Y=-1 DensityP(Xi| Y=-1)P(Xi| Y=1) -1 s- s- s+ xi xi s+

  13. Feature Selection: univariate filtering approach P(Xi| Y=1) = P(Xi| Y=-1) P(Xi| Y=1) /= P(Xi| Y=-1) Legend: Y=1Y=-1 density -1 s- s- s+ xi xi s+

  14. Feature Selection: univariate filtering approach m- m+ Is this distance significant? T-test • Normally distributed classes, equal variance s2 unknown; estimated from data as s2within. • Null hypothesis H0: m+ = m- • T statistic: If H0 is true, then • t= (m+ - m-)/(swithin1/m++1/m-) ~ Student(m++m--2 d.f.) -1 s- s+ xi

  15. Feature Selection: multivariate filtering approach Guyon-Elisseeff, JMLR 2004; Springer 2006

  16. Feature Selection: search strategies Kohavi-John, 1997 N features, 2N possible feature subsets!

  17. Feature Selection: search strategies • Forward selection or backward elimination. • Beam search: keep k best path at each step. • GSFS: generalized sequential forward selection – when (n-k) features are left try all subsets of g features. More trainings at each step, but fewer steps. • PTA(l,r): plus l , take away r – at each step, run SFS l times then SBS r times. • Floating search: One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far.

  18. Feature Selection: filters vs. wrappers vs. embedding • Main goal: rank subsets of useful features

  19. m1 m2 m3 Feature Selection: feature subset assessment (wrapper) N variables/features Split data into 3 sets: training, validation, and test set. • 1) For each feature subset, train predictor on training data. • 2) Select the feature subset, which performs best on validation data. • Repeat and average if you want to reduce variance (cross-validation). • 3) Test on test data. M samples • Danger of over-fitting with intensive search!

  20. Feature Selection via Embedded Methods:L1-regularization sum(|beta|) sum(|beta|)

  21. Univariate Multivariate Linear T-test, AUC, feature ranking RFE with linear SVM or LDA Non-linear Mutual information feature ranking Nearest Neighbors Neural Nets Trees, SVM Feature Selection: summary

  22. Dimensionality Reduction • In the presence of may of features, select the most relevant subset of (weighted) combinations of features. Feature Selection: Dimensionality Reduction:

  23. Dimensionality Reduction:(Linear) Principal Components Analysis • PCA finds a linear mapping of dataset X to a dataset X’ of lower dimensionality. The variance of X that is remained in X’ is maximal. Dataset X is mapped to dataset X’, here of the same dimensionality. The first dimension in X’ (= the first principal component) is the direction of maximal variance. The second principal component is orthogonal to the first.

  24. Dimensionality Reduction:Nonlinear (Kernel) Principal Components Analysis Original dataset X Map X to a HIGHER-dimensional space, and carry out LINEAR PCA in that space (If necessary,) map the resulting principal components back to the origianl space

  25. Dimensionality Reduction:Multi-Dimensional Scaling • MDS is a mathematical dimension reduction technique that maps the distances between observations from the original (high) dimensional space into a lower (for example, two) dimensional space. • MDS attempts to retain pairwise Euclidean distances in the low-dimensional space . • Error on the fit is measured using a so-called “stress” function • Different choices for a stress function are possible

  26. Dimensionality Reduction:Multi-Dimensional Scaling • Raw stress function (identical to PCA): • Sammon cost function:

  27. Dimensionality Reduction:Multi-Dimensional Scaling (Example) Input: Output:

  28. Dimensionality Reduction:Homogeneity analysis • Homals finds a lower-dimensional representation of categorical data matrix X. It may be considered as a type of nonlinear extension of PCA.

  29. Clustering:Similarity measures for hierarchical clustering • k-th Nearest Neighbour • Parzen Window • Unfolding, Conjoint Analysis, Cat-PCA Clustering ClassificationRegression + X 2 X 2 + + + + + + + + + + + - + + + + + + - + + + - - + + + + + + + + - + - + X 1 X 1 X 1 • Linear Discriminant Analysis, QDA • Logistic Regression (Logit) • Decision Trees, LSSVM, NN, VS • Classical Linear Regression • Ridge Regression • NN, CART

  30. Clustering • Clustering in an unsupervised learning technique. • Task: organize objects into groups whose members are similar in some way • Clustering finds structures in a collection of unlabeled data • A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters

  31. Density estimation and clustering Bayesian separation curve (optimal)

  32. Clustering:K-means clustering • Minimizes the sum of the squared distances to the cluster centers (reconstruction error) • Iterative process: • Estimate current assignments (construct Voronoi partition) • Given the new cluster assignments, set cluster center to center-of-mass

  33. Clustering:K-means clustering • Step 1 • Step 2 • Step 3 • Step 4

  34. Clustering:Hierarchical clustering • Clustering based on (dis)similarities. Multilevel clustering: level 1 has n clusters, level n has one cluster • Agglomerative HC: starts with N clusters and combines clusters iteratively • Divisive HC: starts with one cluster and divides iteratively • Disadvantage: wrong division cannot be undone Dendrogram

  35. Clustering:Nearest Neighbor algorithm for hierarchical clustering 1. Nearest Neighbor, Level 2, k = 7 clusters. 2. Nearest Neighbor, Level 3, k = 6 clusters. 3. Nearest Neighbor, Level 4, k = 5 clusters.

  36. Clustering:Nearest Neighbor algorithm for hierarchical clustering 4. Nearest Neighbor, Level 5, k = 4 clusters. 5. Nearest Neighbor, Level 6, k = 3 clusters. 6. Nearest Neighbor, Level 7, k = 2 clusters.

  37. Clustering:Nearest Neighbor algorithm for hierarchical clustering 7. Nearest Neighbor, Level 8, k = 1 cluster.

  38. Clustering:Similarity measures for hierarchical clustering

  39. Clustering: Similarity measures for hierarchical clustering • Pearson Correlation: Trend Similarity

  40. Clustering: Similarity measures for hierarchical clustering • Euclidean Distance

  41. Clustering: Similarity measures for hierarchical clustering • Cosine Correlation +1  Cosine Correlation  – 1

  42. Clustering: Similarity measures for hierarchical clustering • Cosine Correlation: Trend + Mean Distance

  43. Clustering: Similarity measures for hierarchical clustering

  44. Clustering: Similarity measures for hierarchical clustering Similar?

  45. Clustering: Grouping strategies for hierarchical clustering C1 Merge which pair of clusters? C2 C3

  46. Clustering: Grouping strategies for hierarchical clustering Single Linkage Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “long chains”

  47. Clustering: Grouping strategies for hierarchical clustering Complete Linkage Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “clumps”

  48. Clustering: Grouping strategies for hierarchical clustering Average Linkage Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster). + + C2 C1

  49. Clustering: Grouping strategies for hierarchical clustering Average Group Linkage Dissimilarity between two clusters = Distance between two cluster means. + + C2 C1

  50. Clustering: Support Vector Machines for clustering The not-noisy case Objective function: Ben-Hur, Horn, Siegelmann and Vapnik, 2001

More Related