Students: Amit Sharabi Irena Gorlik

1 / 23

# Students: Amit Sharabi Irena Gorlik - PowerPoint PPT Presentation

## Students: Amit Sharabi Irena Gorlik

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. SE Dept. • Adaptive SpectralClustering of Text Information Students: AmitSharabi Irena Gorlik Supervisors: Dr. OrlyYahalom Prof. ZeevVolkovich Date: January 2012

2. Concise description • In our project we implement clustering oftext information using the spectral clustering approach. • The project is based on the NGW (Ng, Michael and Weiss) spectral clustering algorithm. • By using Brent’s method to modify the NGW algorithm, we will obtain a better clustering results.

3. In this presentation • Project description • Intro • Short reminder • Bag of words method • Brent’s method • Clustering quality function NGW Algorithm – In details • The Software Engineering (SE) design • Activity diagram • GUI • Results and conclusions • Algorithm executions results • Conclusions base on the results

4. Intro • As there is no exact definition of what a good clustering is, a clustering algorithm that yields good results for one dataset may not fit another one. • Our purpose is to calibrate the NGW algorithm by fine tuning through finding the optimum scaling factor .

5. Short Reminder - Bag of words method • Our dataset will be derived from the text using the Bag of words method. • In this method, a text is represented as an unordered collection of words, disregarding grammar and even word order.

6. Brent’s method • Brent's method is a numerical optimization algorithm which combines the inverse parabolic interpolation and the golden section search.

7. Clustering quality function • is the set of pairs of points lying in the same cluster and is the set of pairs of points lying in different clusters. • We attempt to find the minimum value of the function by using Brent’s method.

8. The scaling factor • The scaling factor controls how rapidly the affinity matrix falls off with the distance between and . • The NGW algorithm does not specify a value for .

9. NGW algorithm • This is a spectral clustering algorithm that cluster points using eigenvectors of matrices derived from the data. • The algorithm steps: • Given a set of points • Form the affinity matrix , , . • Define diagonal matrix , form the matrix . • Stack the k largest eigenvectors of L to form the columns of the new matrix X. • Form matrix Yby renormalize each of X’s rows. • Cluster with k-means or PAM the rows of Y as points in . • Assign to cluster iffrow of Y was assigned to cluster .

10. Flowchart Data Affinity matrix Brent’s method for f() Clustering Eigenvectors Spectrum

11. Preliminary SE documents

12. Activity diagram

13. GUI – Main Menu Tab

14. GUI – Cluster Results Tab

15. Results and Conclusions

16. Execution 1: • Input Books: • The New Testament. • Harry Potter and the Goblet of Fire. • Harry Potter and the Sorcerer’s Stone. • The Dead Zone – Stephen King. • Each book divided to: 10 parts • Input parameters: • Clustering Algorithm: K-Means • Number of Clusters: 3 • Number of Runs: 1 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 0.707 • Scaling Factor: 0.1

17. Execution 2: • Input Books: • The New Testament. • Harry Potter and the Goblet of Fire. • Harry Potter and the Sorcerer’s Stone • The Dead Zone – Stephen King • Each book divided to: 10 parts • Input parameters: • Clustering Algorithm: K-Means • Number of Clusters: 3 • Number of Runs: 10 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 1 • Scaling Factor: 0.182

18. Execution 3: • Input Books: • Harry Potter and the Chamber of Secrets. • Harry Potter and the Deathly Hallows. • Each Book Divided to: 10 parts • Input parameters: • Clustering Algorithm: PAM • Number of Clusters: 2 • Number of Runs: 10 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 0.905 • Scaling Factor: 0.113

19. Execution 4: • Input Books: • Harry Potter and the Goblet of Fire • Harry Potter and the Chamber of Secrets • The Starts, List Dust – Isaac Asimov • The Dead Zone – Stephen King • Each Book Divided to: 10 parts • Input parameters: • Clustering Algorithm: PAM • Number of Clusters: 4 • Number of Runs: 10 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 0.825 • Scaling Factor: 0.136

20. Conclusion: • Within our experiments we have received results which led us to the next conclusions: • This algorithm shows quite accurate and reliable results. • The proposed learning process significantly influences and improves the quality of results. • During the experiments we concluded that the best division of the book is ten parts.

21. Conclusion (Cont.): • The algorithm can distinguish between different books written by the same author. • To avoid numerical instabilities we need to normalize the dataset. • The PAM algorithm is more stable, whereas the K-means algorithm sometimes converges to some local optimum.

22. Final Remark • In this project we present an algorithm capable to produce optimal clustering results for a given dataset by improving the scaling factor of the algorithm. • The uniqueness of our approach is the ability to adapt itself to the dataset.

23. Questions?