1 / 16

Generalized Model Selection For Unsupervised Learning in High Dimension

Generalized Model Selection For Unsupervised Learning in High Dimension. Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99. Abstract. Bayesian approach to model selection in unsupervised learning

druce
Download Presentation

Generalized Model Selection For Unsupervised Learning in High Dimension

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generalized Model SelectionFor Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS’99

  2. Abstract • Bayesian approach to model selection in unsupervised learning • propose a unified objective function whose arguments include both the feature space and number of clusters. • determining feature set (dividing feature set into noise features and useful features • determining the number of clusters • marginal likelihood with Bayesian scheme vs. cross-validation(cross-validated likelihood). • DC (Distributional clustering of terms) for initial feature selection.

  3. Model Selection in Clustering • Bayesian approaches1), cross-validation2) techniques, MDL approaches3). • Need for unified objective function • the optimal number of clusters is dependent on the feature space in which the clustering is performed. • c.f. feature selection in clustering

  4. Model Selection in Clustering (Cont’d) • Generalized model for clustering • data D = {d1,…,d}, feature space T with dimension M • likelihood P(DT|) maximization, where (with parameter ) is the structure of the model (# of clusters, the partitioning of the feature set into U(useful set), N(noise set) and the assignment of patterns to clusters). • Bayesian approach to model selection • regularization using marginal likelihood

  5. Bayesian Approach to Model Selection for Clustering • Data • data D = {d1,…,dn}, feature space T with dimension M • Clustering D • finding and such that • where  is the structure of the model and  is the set of all parameter vectors • the model structure  consists of the # of clusters + the partitioning of the feature set and the assignment of patterns to clusters.

  6. lack of • regularization • marginal or integrated likelihood Assumptions • The feature sets T represented by U and N are conditionally independent and the data is independent. 2. Data = {d1,…,dn} is i.i.d

  7. computationally • very expensive • pruning of search space by reducing the number of feature partitions model complexity 3. All parameter vectors are independent. • marginal likelihood • Approximations to Marginal Likelihood/Stochastic Complexity

  8. Document Clustering • Marginal likelihood (11) adapting multinomial models using term counts as the features assuming that priors (..) is conjugate to the Dirichlet distribution NLML (Negative Log Marginal Likelihood)

  9. Document Clustering (cont’) • Cross-Validated likelihood

  10. Distributional clustering for feature subset selection • heuristic method to obtain a subset of tokens that are topical and can be used as features in the bag-of-words model to cluster documents • reduce feature size M to C • by clustering words based on their distributions over the documents. • A histogram for each token • the first bin: # of documents with zero occurrences of the token • the second bin: # of documents consisting of a single occurrence of the token • the third bin: # of documents that contain two or more occurrence of the term

  11. DC for feature subset selection(Cont’d) • measure of similarity of the histograms • relative entropy or the K-L distance (.||.) • e.g. for two terms with prob. p1(.), p2(.) • k-means DC

  12. Experimental Setup • AP Reuters Newswire articles from the TREC-6 • 8235 documents from the routing track, 25 classes, disregard multiple classes • 32450 unique terms (discarding terms that appeared in less than 3 documents) • Evaluation measure of clustering • MI

  13. function words Results of Distributional Clustering • cluster 32450 tokens into 3,4,5 clusters. • eliminating function words Figure 1. centroid of a typical high-frequency function-words cluster

  14. Finding the Optimum Features and Document Clusters for a Fixed Number of Clusters • Now, apply the objective function (11) to the feature subsets selected by DC • EM/CEM (Classification EM: hard-thresholded version of the EM)1) • initialization: k-means algorithm

  15. Comparison of feature-selection heuristics • FBTop20: Removal of the top 20% of the most frequent terms • FBTop40: Removal of the top 40% of the most frequent terms • FBTop40Bot10: Removal of top 40% of the most frequent terms and removal of all tokens that do not appear in at least 10 documents • NF: No feature selection • CSW: Common stop words removed

More Related