1 / 22

Aaron Sherman

Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data Bases. Aaron Sherman. Presentation. What is this presentation about? Definitions and Algorithms Evaluations with Generated Data

Download Presentation

Aaron Sherman

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Categorical Data: An Approach Based on Dynamical Systems (1998)David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data Bases Aaron Sherman

  2. Presentation • What is this presentation about? • Definitions and Algorithms • Evaluations with Generated Data • Real World test • Conclusions + Q&A

  3. Categorize this! • Categorizing int’s are easy, but what about words like “red,” “blue,” “august,” and “Moorthy?” • STIRR – Sieving Through iterated Relational Reinforcement

  4. Why is STIRR Better? • No a Priori Quantization • Correlation vs. Categorical Similarity • New Methods for Hypergraph Clustering

  5. Definitions • Table of Relational Data – Set T of Tuples • Set of K Fields – many possible values (Columns) • Abstract Node – each possible field • Г є T – consists of one node from each field • Configuration – weight wv to each node v –w • N(w) – Normalization Function – rescale all weights so their squares add up to 1 • Dynamical System – repeated application of f • Fixed Point – point u where f(u) = u

  6. Where is all this going?

  7. Weighting Scheme • To update the weight wv: • For each tuple Г = {v,u1,…uk-1} containing v • X Г  § (u1,…uk-1 ) • Wv  Σ Г X Г • N()  f(w)

  8. Combining Operator П • Product Operator П: §(w1…wk ) = w1 w2… wk • Non-linear term – encode co-occurrence strongly • Does not converge • Relatively small # of large basins • Very useful data in early iterations

  9. Combining Operator + • Addition Operator +: §(w1…wk ) = w1 +w2+…+ wk • Linear • Does a good job converging

  10. Combining Operator Sp • Sp – Combining Rule: §(w1…wk ) = • Non-linear term – encode co-occurrence strongly • Does a good job converging

  11. Combining Operator Sω • Sω – Limiting version of Sp • Take the largest value among the weights • Easy to compute, sum like properties • Converges the best of all options shown

  12. Initial Configuration • Uniform Initialization – all weights = 1 • Random Initialization – independently choose o1 for each weight then normalize • Some operators more sensitive to initial configurations then others • Masking / Modification – specific rule for certain nodes to set to higher or lower value

  13. Run Time - Linear

  14. Quasi-Random Input • Create semi random data, and then add tuples to the data to create artificial clusters • Use this to test whether STIRR works • Questions • # of iterations • Density of cluster to background

  15. How well does STIRR distil a cluster in nodes with above average co-occurrence • # of iterations • Purity

  16. How well does STIRR separate distinct planted clusters? Will the data partition? How long to partition? S(A,B) = (|a0 – b0| + |a1 –b1| ) / total nodes Clusters A,B, a0 nodes from cluster, and a1 nodes at other end

  17. How well does STIRR cope with clusters in a few columns with the rest random? • Want to mask irrelevant factors (columns)

  18. Effect of Convergence Operator • Max function is the best • Product rule does not converge • Sum rule is good, but slow

  19. Real World Data • Papers on theory and Database Systems • (Author 1, Author 2, Journal Year) • The two sets of papers were clearly separated in the STIRR representation • Done using Sp • Grouped most theoretical papers around 1976

  20. Login Data from IBM Servers • Masked one user who logged in / out very frequently • 4 highest weight (similar) users – root, help, 2 administrators names • 8pm-12am very similar

  21. Conclusion • Powerful technique to categorize data • Relatively fast algorithm O(n) • Questions?

  22. Additional References • Data Clustering Techniques - Qualifying Oral Examination Paper - Periklis Andritsos • http://www.cs.toronto.edu/~periklis/pubs/depth.pdf

More Related