Loading in 2 Seconds...

Incorporating User Provided Constraints into Document Clustering

Loading in 2 Seconds...

- 124 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Incorporating User Provided Constraints into Document Clustering' - howard

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Incorporating User Provided Constraints into Document Clustering

Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi

Department of Computer Science

Wayne State University

Detroit, MI48202

{chenyanh, rege, mdong, jinghua, fotouhi}@wayne.edu

Outline

- Introduction
- Overview of related work
- Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
- Theoretical result for SS-NMF
- Experiments and results
- Conclusion

Inter-cluster distances are maximized

Intra-cluster distances are minimized

What is clustering?- Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Science

Arts

Document Clustering- Grouping of text documents into meaningful clusters in an unsupervised manner.

Semi-supervised clustering: problem definition

- Input:
- A set of unlabeled objects
- A small amount of domain knowledge (labels or pairwise constraints)
- Output:
- A partitioning of the objects into k clusters
- Objective:
- Maximum intra-cluster similarity
- Minimum inter-cluster similarity
- High consistency between the partitioning and the domain knowledge

Must-link

Cannot-link

Semi-Supervised Clustering- According to different given domain knowledge:
- Users provide class labels(seeded points) a priori to some of the documents
- Users know about which few documents are related (must-link) or unrelated (cannot-link)

Why semi-supervised clustering?

- Large amounts of unlabeled data exists
- More is being produced all the time
- Expensive to generate Labels for data
- Usually requires human intervention
- Use human input to provide labels for some of the data
- Improve existing naive clustering methods
- Use labeled data to guide clustering of unlabeled data
- End result is a better clustering of data
- Potential applications
- Document/word categorization
- Image categorization
- Bioinformatics (gene/protein clustering)

Outline

- Introduction
- Overview of related work
- Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
- Theoretical work for SS-NMF
- Experiments and results
- Conclusion

Clustering Algorithm

- Document hierarchical clustering
- Bottom-up, agglomerative
- Top-down, divisive
- Document partitioning (flat clustering)
- K-means
- probabilistic clustering using the Naïve Bayes or Gaussian mixture model, etc.
- Document clustering based on graph model

Semi-supervised Clustering Algorithm

- Semi-supervised Clustering with labels (Partial label information is given ):
- SS-Seeded-Kmeans ( Sugato Basu, et al. ICML 2002)
- SS-Constraint-Kmeans ( Sugato Basu, et al. ICML 2002)
- Semi-supervised Clustering with Constraints (Pairwise Constraints (Must-link, Cannot-link) is given):
- SS-COP-Kmeans (Wagstaff et al. ICML01)
- SS-HMRF-Kmeans (Sugato Basu, et al. ACM SIGKDD 2004)
- SS-Kernel-Kmeans (Brian Kulis, et al. ICML 2005)
- SS-Spectral-Normalized-Cuts (X. Ji, et al. ACM SIGIR 2006)

Overview of K-means Clustering

- K-means is a partition clustering algorithm based on iterative relocation that partitions a dataset into k clusters.
- Objective function: Locally minimizes sum of squared distance between the data points and their correspondingcluster centers:

Algorithm:

Initialize k cluster centers randomly. Repeat until convergence:

- Cluster Assignment Step: Assign each data point xito the cluster fh such that distance of xi from center of fh is minimum
- Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

Semi-supervised Kernel K-means (SS-KK) [Brian Kulis, et al. ICML 2005]

- Semi-supervised Kernel K-means algorithm:

where is kernel function mapping from , is centroid,

is the cost of violating the constraint between two points

- First term: kernel k-means objective function
- Second term: reward function for satisfying must-link constraints
- Third term: penalty function for violating cannot-link constraints

Overview of Spectral Clustering

- Spectral clustering is a graph-theoretic clustering algorithm

Weighted Graph G=(V, E, A)

min between-cluster similarities (weights : Aij)

Spectral Normalized Cuts

- Min similarity between & :

Balance weights:

Cluster indicator:

- Graph partition becomes:
- Solution is eigenvector of:

Semi-supervised Spectral Normalized Cuts (SS-SNC)[X. Ji, et al. ACM SIGIR 2006]

- Semi-supervised Spectral Learning algorithm:

where ,

- First term: spectral normalized cut objective function
- Second term: reward function for satisfying must-link constraints
- Third term: penalty function for violating cannot-link constraints

Outline

- Introduction
- Related work
- Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
- NMF review
- Model formulation and algorithm derivation
- Theoretical result for SS-NMF
- Experiments and results
- Conclusion

Non-negative Matrix Factorization (NMF)

- NMF is to decompose matrix into two parts( D. Lee et al., Nature 1999)
- Symmetric NMF for clustering (C. Ding et al. SIAM ICDM 2005)

X

F

~

=

G

min || X – FGT||2

~

=

x

x

min || A – GSGT||2

SS-NMF

- Incorporate prior knowledge into NMF based framework for document clustering.
- Users provide pairwise constraints:
- Must-link constraints CML : two documents di and dj must belong to the same cluster.
- Cannot-link constraints CCL : two documents di and dj must belong to the different cluster.

- Constraints are defined by associated violation cost matrix W:
- W reward : cost of violating the constraint between document

di and dj if a constraint exists.

- Wpenalty : cost of violating the constraints between document

di and dj if a constraint exists.

Outline

- Introduction
- Overview of related work
- Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
- Theoretical result for SS-NMF
- Experiments and results
- Conclusion

Algorithm Correctness and Convergence

- Based on constraint optimization theory, auxiliary function, we can prove SS-NMF:
- Correctness:Solution converges to local minimum
- 2. Convergence:Iterative algorithm converges

(Details in paper [1], [2])

[1] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints

into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October 2007.

(Regular Paper, acceptance rate 7.2%)

[2] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization

for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems,

to appear, 2008.

Orthogonal SymmetricSemi-supervised NMF is equivalent to Semi-supervised

Kernel K-means (SS-KK) and Semi-supervised Spectral Normalized Cuts (SS-SNC)!

SS-NMF: General Framework for Semi-supervised ClusteringProof: (1)

(2)

(3)

Outline

- Introduction
- Overview of related work
- Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
- Theoretical result for SS-NMF
- Experiments and results
- Artificial Toy Data
- Real Data
- Conclusion

Experiments on Toy Data

- 1.Artificial toy data: consisting of two natural clusters

Results on Toy Data (SS-KK and SS-NMF)

- Hard Clustering: Each object belongs to a single cluster
- Soft Clustering: Each object is probabilisticallyassigned to clusters.

Right Table:

Difference between cluster indicator G of SS-KK (hard clustering) and SS-NMF (soft clustering) for the toy data

Results on Toy Data (SS-SNC and SS-NMF)

(b) Data distribution in the SS-NMF subspace of two column vectors of G. The data points from the two clusters get distributed along the two axes.

(a) Data distribution in the SS-SNC subspace of the first two singular vectors. There is no relationship between the axes and the clusters.

Time Complexity Analysis

Up Figure: Computational Speed comparison for SS-KK, SS-SNC and SS-NMF ( )

Experiments on Text Data

2. Summary of data sets[1] used in the experiments.

[1]http://www.cs.umn.edu/~han/data/tmdata.tar.gz

- Evaluation Metric:

where n is the total number of documents in the experiment, δis the delta function that equals one if , is the estimated label, is the ground truth.

Results on Text Data (Compare with Unsupervised Clustering)

- (1) Comparison with unsupervised clustering approaches:

Note: SS-NMF adds 3% constraints

Results on Text Data(Before Clustering and After Clustering)

(c) Document-document similarity matrix after clustering with SS-NMF (k=5)

(b) Document-document similarity matrix after clustering with SS-NMF (k=2)

(a) Typical document-document matrix before clustering

Results on Text Data (Clustering with Different Constraints)

Left Table:

Comparison of confusion matrix C and normalized cluster centroid matrix S of SS-NMF for different percentage of documents pairwise constrained

Results on Text Data (Compare with Semi-supervised Clustering)

- (2) Comparison with SS-KK and SS-SNC

(b) England-Heart

(c) Interest-Trade

(a) Graft-Phos

Results on Text Data (Compare with Semi-supervised Clustering)

- Comparison with SS-KK and SS-SNC (Fbis2, Fbis3, Fbis4, Fbis5)

Experiments on Image Data

3. Image data sets[2] used in the experiments.

Up Figure: Sample images for images categorization.

(From up to down: O-Owls, R-Roses, L-Lions, E-Elephants, H-Horses)

[2] http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.data.html

Results on Image Data (Compare with Unsupervised Clustering)

- (1) Comparison with unsupervised clustering approaches:

Up Table : Comparison of image clustering accuracy between KK, SNC, NMF

and SS-NMF with only 3% pair-wise constraints on the images.

It shows that SS-NMF consistently outperforms other well-established unsupervised image clustering methods.

Results on Image Data (Compare with Semi-supervised Clustering)

- (2) Comparison with SS-KK and SS-SNC:

Left Figure:

Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS-NMF for different percentages of images pairs constrained (a) O-R, (b) L-H, (c) R-L, (d) O-R-L.

Results on Image Data (Compare with Semi-supervised Clustering)

- (2) Comparison with SS-KK and SS-SNC:

Left Figure:

Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS-NMF for different percentages of images pairs constrained (e) L-E-H, (f) O-R-L-E, (g) O-L-E-H, (h) O-R-L-E-H

Outline

- Introduction
- Related work
- Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering
- Theoretical result for SS-NMF
- Experiments and results
- Conclusion

Conclusion

- Semi-supervised Clustering:

-many real world applications

- outperform the traditional clustering algorithms

- Semi-supervised NMF algorithm provides a unified mathematic framework for semi-supervised clustering.
- Many existing semi-supervised clustering algorithms can be extended to achieve multi-type objects co-clustering tasks.

Reference

[1] Y. Chen, M. Rege, M. Dong and F. Fotouhi, “Deriving Semantics for Image Clustering from Accumulated User Feedbacks”, Proc. of ACM Multimedia, Germany, 2007.

[2] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October 2007. (Regular Paper, acceptance rate 7.2%)

[3] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, invited as a best paper of ICDM 07, to appear 2008.

Download Presentation

Connecting to Server..