SE Dept. • Adaptive SpectralClustering of Text Information Students: AmitSharabi Irena Gorlik Supervisors: Dr. OrlyYahalom Prof. ZeevVolkovich Date: January 2012
Concise description • In our project we implement clustering oftext information using the spectral clustering approach. • The project is based on the NGW (Ng, Michael and Weiss) spectral clustering algorithm. • By using Brent’s method to modify the NGW algorithm, we will obtain a better clustering results.
In this presentation • Project description • Intro • Short reminder • Bag of words method • Brent’s method • Clustering quality function NGW Algorithm – In details • The Software Engineering (SE) design • Activity diagram • GUI • Results and conclusions • Algorithm executions results • Conclusions base on the results
Intro • As there is no exact definition of what a good clustering is, a clustering algorithm that yields good results for one dataset may not fit another one. • Our purpose is to calibrate the NGW algorithm by fine tuning through finding the optimum scaling factor .
Short Reminder - Bag of words method • Our dataset will be derived from the text using the Bag of words method. • In this method, a text is represented as an unordered collection of words, disregarding grammar and even word order.
Brent’s method • Brent's method is a numerical optimization algorithm which combines the inverse parabolic interpolation and the golden section search.
Clustering quality function • is the set of pairs of points lying in the same cluster and is the set of pairs of points lying in different clusters. • We attempt to find the minimum value of the function by using Brent’s method.
The scaling factor • The scaling factor controls how rapidly the affinity matrix falls off with the distance between and . • The NGW algorithm does not specify a value for .
NGW algorithm • This is a spectral clustering algorithm that cluster points using eigenvectors of matrices derived from the data. • The algorithm steps: • Given a set of points • Form the affinity matrix , , . • Define diagonal matrix , form the matrix . • Stack the k largest eigenvectors of L to form the columns of the new matrix X. • Form matrix Yby renormalize each of X’s rows. • Cluster with k-means or PAM the rows of Y as points in . • Assign to cluster iffrow of Y was assigned to cluster .
Flowchart Data Affinity matrix Brent’s method for f() Clustering Eigenvectors Spectrum
Execution 1: • Input Books: • The New Testament. • Harry Potter and the Goblet of Fire. • Harry Potter and the Sorcerer’s Stone. • The Dead Zone – Stephen King. • Each book divided to: 10 parts • Input parameters: • Clustering Algorithm: K-Means • Number of Clusters: 3 • Number of Runs: 1 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 0.707 • Scaling Factor: 0.1
Execution 2: • Input Books: • The New Testament. • Harry Potter and the Goblet of Fire. • Harry Potter and the Sorcerer’s Stone • The Dead Zone – Stephen King • Each book divided to: 10 parts • Input parameters: • Clustering Algorithm: K-Means • Number of Clusters: 3 • Number of Runs: 10 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 1 • Scaling Factor: 0.182
Execution 3: • Input Books: • Harry Potter and the Chamber of Secrets. • Harry Potter and the Deathly Hallows. • Each Book Divided to: 10 parts • Input parameters: • Clustering Algorithm: PAM • Number of Clusters: 2 • Number of Runs: 10 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 0.905 • Scaling Factor: 0.113
Execution 4: • Input Books: • Harry Potter and the Goblet of Fire • Harry Potter and the Chamber of Secrets • The Starts, List Dust – Isaac Asimov • The Dead Zone – Stephen King • Each Book Divided to: 10 parts • Input parameters: • Clustering Algorithm: PAM • Number of Clusters: 4 • Number of Runs: 10 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 0.825 • Scaling Factor: 0.136
Conclusion: • Within our experiments we have received results which led us to the next conclusions: • This algorithm shows quite accurate and reliable results. • The proposed learning process significantly influences and improves the quality of results. • During the experiments we concluded that the best division of the book is ten parts.
Conclusion (Cont.): • The algorithm can distinguish between different books written by the same author. • To avoid numerical instabilities we need to normalize the dataset. • The PAM algorithm is more stable, whereas the K-means algorithm sometimes converges to some local optimum.
Final Remark • In this project we present an algorithm capable to produce optimal clustering results for a given dataset by improving the scaling factor of the algorithm. • The uniqueness of our approach is the ability to adapt itself to the dataset.