Students: Amit Sharabi Irena Gorlik

SE Dept. • Adaptive SpectralClustering of Text Information Students: AmitSharabi Irena Gorlik Supervisors: Dr. OrlyYahalom Prof. ZeevVolkovich Date: January 2012

Concise description • In our project we implement clustering oftext information using the spectral clustering approach. • The project is based on the NGW (Ng, Michael and Weiss) spectral clustering algorithm. • By using Brent’s method to modify the NGW algorithm, we will obtain a better clustering results.

In this presentation • Project description • Intro • Short reminder • Bag of words method • Brent’s method • Clustering quality function NGW Algorithm – In details • The Software Engineering (SE) design • Activity diagram • GUI • Results and conclusions • Algorithm executions results • Conclusions base on the results

Intro • As there is no exact definition of what a good clustering is, a clustering algorithm that yields good results for one dataset may not fit another one. • Our purpose is to calibrate the NGW algorithm by fine tuning through finding the optimum scaling factor .

Short Reminder - Bag of words method • Our dataset will be derived from the text using the Bag of words method. • In this method, a text is represented as an unordered collection of words, disregarding grammar and even word order.

Brent’s method • Brent's method is a numerical optimization algorithm which combines the inverse parabolic interpolation and the golden section search.

Clustering quality function • is the set of pairs of points lying in the same cluster and is the set of pairs of points lying in different clusters. • We attempt to find the minimum value of the function by using Brent’s method.

The scaling factor • The scaling factor controls how rapidly the affinity matrix falls off with the distance between and . • The NGW algorithm does not specify a value for .

NGW algorithm • This is a spectral clustering algorithm that cluster points using eigenvectors of matrices derived from the data. • The algorithm steps: • Given a set of points • Form the affinity matrix , , . • Define diagonal matrix , form the matrix . • Stack the k largest eigenvectors of L to form the columns of the new matrix X. • Form matrix Yby renormalize each of X’s rows. • Cluster with k-means or PAM the rows of Y as points in . • Assign to cluster iffrow of Y was assigned to cluster .

Flowchart Data Affinity matrix Brent’s method for f() Clustering Eigenvectors Spectrum

Preliminary SE documents

Activity diagram

GUI – Main Menu Tab

GUI – Cluster Results Tab

Results and Conclusions

Execution 1: • Input Books: • The New Testament. • Harry Potter and the Goblet of Fire. • Harry Potter and the Sorcerer’s Stone. • The Dead Zone – Stephen King. • Each book divided to: 10 parts • Input parameters: • Clustering Algorithm: K-Means • Number of Clusters: 3 • Number of Runs: 1 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 0.707 • Scaling Factor: 0.1

Execution 2: • Input Books: • The New Testament. • Harry Potter and the Goblet of Fire. • Harry Potter and the Sorcerer’s Stone • The Dead Zone – Stephen King • Each book divided to: 10 parts • Input parameters: • Clustering Algorithm: K-Means • Number of Clusters: 3 • Number of Runs: 10 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 1 • Scaling Factor: 0.182

Execution 3: • Input Books: • Harry Potter and the Chamber of Secrets. • Harry Potter and the Deathly Hallows. • Each Book Divided to: 10 parts • Input parameters: • Clustering Algorithm: PAM • Number of Clusters: 2 • Number of Runs: 10 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 0.905 • Scaling Factor: 0.113

Execution 4: • Input Books: • Harry Potter and the Goblet of Fire • Harry Potter and the Chamber of Secrets • The Starts, List Dust – Isaac Asimov • The Dead Zone – Stephen King • Each Book Divided to: 10 parts • Input parameters: • Clustering Algorithm: PAM • Number of Clusters: 4 • Number of Runs: 10 • Brent’s Method Tolerance: 1e-6 • Results: • Cramer’s V: 0.825 • Scaling Factor: 0.136

Conclusion: • Within our experiments we have received results which led us to the next conclusions: • This algorithm shows quite accurate and reliable results. • The proposed learning process significantly influences and improves the quality of results. • During the experiments we concluded that the best division of the book is ten parts.

Conclusion (Cont.): • The algorithm can distinguish between different books written by the same author. • To avoid numerical instabilities we need to normalize the dataset. • The PAM algorithm is more stable, whereas the K-means algorithm sometimes converges to some local optimum.

Final Remark • In this project we present an algorithm capable to produce optimal clustering results for a given dataset by improving the scaling factor of the algorithm. • The uniqueness of our approach is the ability to adapt itself to the dataset.

Questions?

Students: Amit Sharabi Irena Gorlik

Students: Amit Sharabi Irena Gorlik

Presentation Transcript

Relationships: The foundation of learning

Understanding Students Basic Psychological Needs

How most People Know LSU

UNIT XI

Some very important fact about career

Amit Sheth LSDIS Lab, University of Georgia

Arc Welding: Introduction and Fundamentals

Engaging Them All

TESTING STUDENTS WITH DISABILITIES

Education is about teaching students how to think, not what to think.

开放英语 Ⅰ⑴

WELCOME TO FYE

MSU Nursing Students MDG Orientation

CORPORATE FINANCE MANAGEMENT 2

Accommodating Students with Special Dietary Needs

Amit Sheth LSDIS Lab , University of Georgia, Athens, Georgia, UGA

Using Core Vocabulary to Support Non Verbal Students