Aaron Sherman

Clustering Categorical Data: An Approach Based on Dynamical Systems (1998)David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data Bases Aaron Sherman

Presentation • What is this presentation about? • Definitions and Algorithms • Evaluations with Generated Data • Real World test • Conclusions + Q&A

Categorize this! • Categorizing int’s are easy, but what about words like “red,” “blue,” “august,” and “Moorthy?” • STIRR – Sieving Through iterated Relational Reinforcement

Why is STIRR Better? • No a Priori Quantization • Correlation vs. Categorical Similarity • New Methods for Hypergraph Clustering

Definitions • Table of Relational Data – Set T of Tuples • Set of K Fields – many possible values (Columns) • Abstract Node – each possible field • Г є T – consists of one node from each field • Configuration – weight wv to each node v –w • N(w) – Normalization Function – rescale all weights so their squares add up to 1 • Dynamical System – repeated application of f • Fixed Point – point u where f(u) = u

Where is all this going?

Weighting Scheme • To update the weight wv: • For each tuple Г = {v,u1,…uk-1} containing v • X Г  § (u1,…uk-1 ) • Wv  Σ Г X Г • N()  f(w)

Combining Operator П • Product Operator П: §(w1…wk ) = w1 w2… wk • Non-linear term – encode co-occurrence strongly • Does not converge • Relatively small # of large basins • Very useful data in early iterations

Combining Operator + • Addition Operator +: §(w1…wk ) = w1 +w2+…+ wk • Linear • Does a good job converging

Combining Operator Sp • Sp – Combining Rule: §(w1…wk ) = • Non-linear term – encode co-occurrence strongly • Does a good job converging

Combining Operator Sω • Sω – Limiting version of Sp • Take the largest value among the weights • Easy to compute, sum like properties • Converges the best of all options shown

Initial Configuration • Uniform Initialization – all weights = 1 • Random Initialization – independently choose o1 for each weight then normalize • Some operators more sensitive to initial configurations then others • Masking / Modification – specific rule for certain nodes to set to higher or lower value

Run Time - Linear

Quasi-Random Input • Create semi random data, and then add tuples to the data to create artificial clusters • Use this to test whether STIRR works • Questions • # of iterations • Density of cluster to background

How well does STIRR distil a cluster in nodes with above average co-occurrence • # of iterations • Purity

How well does STIRR separate distinct planted clusters? Will the data partition? How long to partition? S(A,B) = (|a0 – b0| + |a1 –b1| ) / total nodes Clusters A,B, a0 nodes from cluster, and a1 nodes at other end

How well does STIRR cope with clusters in a few columns with the rest random? • Want to mask irrelevant factors (columns)

Effect of Convergence Operator • Max function is the best • Product rule does not converge • Sum rule is good, but slow

Real World Data • Papers on theory and Database Systems • (Author 1, Author 2, Journal Year) • The two sets of papers were clearly separated in the STIRR representation • Done using Sp • Grouped most theoretical papers around 1976

Login Data from IBM Servers • Masked one user who logged in / out very frequently • 4 highest weight (similar) users – root, help, 2 administrators names • 8pm-12am very similar

Conclusion • Powerful technique to categorize data • Relatively fast algorithm O(n) • Questions?

Additional References • Data Clustering Techniques - Qualifying Oral Examination Paper - Periklis Andritsos • http://www.cs.toronto.edu/~periklis/pubs/depth.pdf

Aaron Sherman

Aaron Sherman

Presentation Transcript

Cindy Sherman

Sherman Alexie

Cindy Sherman

William Tecumseh Sherman

Aaron

M4A1 Sherman

Aaron Jarden aaron@jarden

Richard sherman

SHERMAN BIOTECH

Paul Sherman

Sherman

Sherman Alexie

Sherman Alexie

Sherman Alexie

Sherman Alexie

Sherman Alexie

CINDY SHERMAN

Sherman Alexie