Clustering tagged documents with labeled and unlabeled documents

Clustering tagged documents with labeled and unlabeled documents Presenter : Jian-Ren ChenAuthors : Chien-Liang Liu*, Wen-Hoar Hsaio, Chia-Hoang Lee, Chun-Hsien Chen2013 , IPM

Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

Motivation Tagscan provide semantic information about the resources and they can help machines perform the classiﬁcation or clustering tasks accurately. Probabilistic latent semantic analysis (PLSA) - aspect model - statistical clustering model

Objectives • This study employs Constrained-PLSA to cluster tagged documents with a small amount of seeds. • The Constrained-PLSA is based on statistical clustering model rather than aspect model.

Methodology - PLSA E-step M-step Terms (keywords) of the document collection documents

Methodology - Constrained-PLSA E-step M-step

Experiments -Data set A (CiteULike)

Experiments (Data set A)

Experiments -Data set B (CiteULike)

Experiments (Data set B)

Conclusions • The performance of ‘‘tags as words’’ representation scheme is more stable than ‘‘words + tags’’ representation scheme. • Unsupervised learning methods fail to function properly in the data set with noisy information, but Constrained-PLSA function properly and stable even though only a small amount of labeled data is available.

Comments • Advantages - Constrained-PLSA outperforms the other methods • Disadvantage - too much artificial processing in experiment • Applications • text mining • tagged document clustering

Clustering tagged documents with labeled and unlabeled documents

Clustering tagged documents with labeled and unlabeled documents

Presentation Transcript

Documents

Working with Documents

Documents

Documents

Clustering for web documents

Text Classification from Labeled and Unlabeled Documents using EM

Clustering Documents

Clustering Documents

Documents

DOCUMENTS

Pseudo-supervised Clustering for Text Documents

Clustering of Web Documents Jinfeng Chen

Documents

Text Classification from Labeled and Unlabeled Documents using EM

Clustering Documents in a Web Directory

DOCUMENTS

Text Classification from Labeled and Unlabeled Documents using EM

Documents

Documents

Documents