Clustering tagged documents with labeled and unlabeled documents Presenter : Jian-Ren ChenAuthors : Chien-Liang Liu*, Wen-Hoar Hsaio, Chia-Hoang Lee, Chun-Hsien Chen2013 , IPM
Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments
Motivation Tagscan provide semantic information about the resources and they can help machines perform the classiﬁcation or clustering tasks accurately. Probabilistic latent semantic analysis (PLSA) - aspect model - statistical clustering model
Objectives • This study employs Constrained-PLSA to cluster tagged documents with a small amount of seeds. • The Constrained-PLSA is based on statistical clustering model rather than aspect model.
Methodology - PLSA E-step M-step Terms (keywords) of the document collection documents
Methodology - Constrained-PLSA E-step M-step
Conclusions • The performance of ‘‘tags as words’’ representation scheme is more stable than ‘‘words + tags’’ representation scheme. • Unsupervised learning methods fail to function properly in the data set with noisy information, but Constrained-PLSA function properly and stable even though only a small amount of labeled data is available.
Comments • Advantages - Constrained-PLSA outperforms the other methods • Disadvantage - too much artificial processing in experiment • Applications • text mining • tagged document clustering