1 / 17

A Semi-supervised Document Clustering Algorithm based on EM

A Semi-supervised Document Clustering Algorithm based on EM. Leonardo Rigutini and Marco Maggini Department of Information Engineering University of Siena – Siena – Italy {rigutini,maggini}@dii.unisi.it. Outline. Document clustering and Semi-supervised clustering

steve
Download Presentation

A Semi-supervised Document Clustering Algorithm based on EM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Semi-supervised Document Clustering Algorithm based on EM Leonardo Rigutini and Marco Maggini Department of Information Engineering University of Siena – Siena – Italy {rigutini,maggini}@dii.unisi.it

  2. Outline • Document clustering and Semi-supervised clustering • EM algorithm and limitations • Using feature selection filtering to improve the EM algorithm • The proposed algorithm • Experimental results • Conclusions L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  3. Document Clustering • Document clustering is a very hard task in Automatic Text Processing • It requires to extract regular patterns from a document collection without a priori knowledge on the category structure • Difficult task even for humans • many different but valid partitions may exist for the same collection • Lack of information about categories • Difficulty in using effective feature selection techniques to reduce the noise in the representation of texts L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  4. Semi-supervised clustering • In between automatic categorization and auto-organization of data • A supervisor is not required to specify a set of classes, but to split a set of examples into groups • The initial examples are very few documents (from 1 to 10 at maximum) for each group • The initial examples could be also sets of keywords describing the desired groups L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  5. Feature Selection • Document Clustering • Impossible to use global information to filter words (no information on classes is available): • IG, TS, DotRatio are not usable • In text representation it is a very important issue • Very high dimensional space representation • Distances between documents are very similar • Semi-supervised Clustering • An initial filtering can be performed using a small amount of initial information L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  6. EM Algorithm • A general algorithm to adjust the parameters of the model to the data distribution • E step: the unlabeled data are labeled by the classifier assuming the current configuration as correct • M step: the parameters of the classifier are re-estimated using the data labeled at the previous E-step, assuming the labels to be correct • The precedure is iterated until a convergence is reached L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  7. EM algorithm: limitations • The initialization of the classifier is an important issue for the correct final cluster composition • If the initial centroids are not distribuited as the final user would like, the algorithm can form clusters with a semantics not matching the user’s criteria • The iterative form of the EM algorithm produces a reinforcement effect on the badly labeled data • If at time t, in the expectation step (E), some documents are badly classified, these data influence the reestimation step (M) and at time t+1 other documents will be badly classified • This effect is increased with the successive iterations of the E-M steps L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  8. Distribution of distances • The distance between two similar documents is very close to the one between two dissimilar documents • It is very probable that the E step badly labels some boundary documents • EM reaches a trivial solution very often: • A large central cluster including the major part of the documents • Various peripheral small clusters including outliers L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  9. Feature Selection • At each iteration of EM, the badly labeled data influence the reestimation of the parameters, moving the centroids to a wrong direction • We can reduce the influence of bad labeled documents in the M step using a feature selction filtering in the EM algorithm • We use the labeled dataset produced by the E step to filter out the not significative words for each class • In this way, the noisy words introduced by the badly classified documents in the E step, will not contribute to the M step L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  10. The proposed algorithm • ssads L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  11. The algorithm • The small initial labeled dataset is used to initialize the parameters of the classifier in the EM algorithm • To extract the most significative words from the training dataset an Information Gain filter IG1 is used • Once the unlabeled data have been labeled, the Information Gain filter IG2 avoids that wrong documents influence the reestimation step • The algorithm ends when the confusion matrix does not change in two successive iterations L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  12. Experimental results • Dataset: • We download about 24.000 messages from English newsgroups • Three different groups • Auto • Hardware • Sport • We divided the dataset into 2 subsets • Init repository to pick up the start documents • Unlabeled datadocuments to cluster L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  13. Experimental results • We decided to test the algorithm with 4 different initial configurations:1,3,5 and 7 starting documents random sampled from the initial dataset • All results are averaged on a ten fold cross-validation • Baseline: • K-means on the unlabeled data initialized with the initial dataset • Proposed algorithm • To speed up the clustering task, we ran the algorithm on a subset of unlabeled data and then we used the trained classifier to categorize the remaining unlabeled data • Two size for the small unlabeled dataset: 100 and 300 documents L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  14. Baseline experiment • K-means on the unlabeled dataset initialized with 1,3,5 and 7 documents • The poor performance depends on the fact that no regularization can be applied for the k-means algorithm and an assignment of a document to a wrong cluster produces a movement of the centroids of the two clusters which reinforces the wrong assignment L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  15. Proposed algorithm: test 1 • Proposed algorithm • 1,3,5 and 7 documents to inizialise the classifier • k1=100 and k2=1000 for IG filters • 100 documents in the unlabeled dataset L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  16. Proposed algorithm: test 2 • Proposed algorithm • 1,3,5 and 7 documents to inizialise the classifier • k1=100 and k2=1000 for IG filters • 300 documents in the unlabeled dataset L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

  17. Conclusions • We presented a semi-supervised version of the EM algorithm for document clustering • It uses an initial small amount of knowledge to guide the EM algorithm in forming the clusters • The system partitions a large collection of documents providing a small initial amount of information about the clusters (for example some keywords describing each cluster) and it shows quite good results • The novel proposal is mainly the use of a regularization step which exploits a feature selection technique in an EM algorithm • With a different initialization technique which does not require the supervision of a human expert, the algorithm could be completely unsupervised L.Rigutini, M. Maggini - A Semi-supervised Document Clustering Algorithm based on EMWI 2005

More Related