1 / 71

Fast Effective Clustering for Graphs and Documents

Fast Effective Clustering for Graphs and Documents. William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with : Frank Lin and Ramnath Balasubramanyan. Introduction: trends in machine learning 1.

simsw
Download Presentation

Fast Effective Clustering for Graphs and Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Effective Clusteringfor Graphs and Documents William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with: Frank Lin and Ramnath Balasubramanyan

  2. Introduction: trends in machine learning1 • Supervised learning: given data (x1,y1),…,(xn,yn), learn to predict y from x • y is a real number or member of small set • xis a (sparse) vector • Semi-supervised learning: given data (x1,y1),…,(xk,yk),xk+1,…,xn learn to predict y from x • Unsupervised learning: given data x1,…,xn find a “natural” clustering

  3. Introduction: trends in machine learning2 • Supervised learning: given data (x1,y1),…,(xn,yn), learn to predict y from x • y is a real number or member of small set • xis a (sparse) vector • x’s are all i.i.d., independent of each other • ydepends only on the corresponding x • Structured learning: x’s and/or y’s are related to each other

  4. Introduction: trends in machine learning Introduction: trends in machine learning2 • Structured learning: x’s and/or y’s are related to each other • General: x and y are in two parallel 1d arrays • x’s are words in a document, y is POS tag • x’s are words, y=1 if x is part of a company name • x’s are DNA codons, y=1 if x is part of a gene • … • More general: x’s are nodes in a graph, y’s are labels for these nodes

  5. Examples of classification in graphs • x is a web page, edge is hyperlink, y is topic • x is a word, edge is co-occurrence in similar contexts, y is semantics (distributional clustering) • x is a protein, edge is interaction, y is subcellular location • x is a person, edge is email message, y is organization • x is a person, edge is friendship, y=1 if x smokes • … • x,y are anything, edge from x1to x2indicates similarity between x1and x2

  6. Examples: Zachary’s karate club, political books, protein-protein interactions, ….

  7. Political blog network Adamic & Glance “Divided They Blog:…” 2004

  8. Outline • Spectral methods • Variant: Power Iteration Clustering [Lin & Cohen, ICML 2010] • Variant: PIC for document clustering • Stochastic block models • Mixed-membership sparse block model [Parkinnen et al, 2007] • Variants: BlockLDA with entropic regularization, BlockLDA with annotated documents

  9. This talk: • Typical experiments: • For networks with known “true” labels … • can unsupervised learning can recover these labels?

  10. Spectral Clustering: Graph = Matrix C A B G I H J F D E

  11. Spectral Clustering: Graph = MatrixTransitively Closed Components = “Blocks” C A B G I H J F D E Of course we can’t see the “blocks” unless the nodes are sorted by cluster…

  12. Spectral Clustering: Graph = MatrixVector = Node  Weight v M C A B G H I J F D E M

  13. Spectral Clustering: Graph = MatrixM*v1 = v2“propogates weights from neighbors” * v1 = v2 M C A B G H I J F D E M

  14. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” W * v1 = v2 W: normalized so columns sum to 1 C A B G H I J F D E

  15. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” Q: How do I pick v to be an eigenvector for a block-stochastic matrix?

  16. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” How do I pick v to be an eigenvector for a block-stochastic matrix?

  17. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” λ1 λ2 e1 e3 “eigengap” λ3 λ4 e2 λ5,6,7,…. [Shi & Meila, 2002]

  18. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” e2 0.4 0.2 x x x x x x x x x x x 0.0 x -0.2 y z y y z z z e3 -0.4 y z z z z z z z y e1 e2 -0.4 -0.2 0 0.2 [Shi & Meila, 2002]

  19. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” • If W is connected but roughly block diagonal with k blocks then • the top eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks M

  20. Spectral Clustering: Graph = MatrixW*v1 = v2“propogates weights from neighbors” • If W is connected but roughly block diagonal with k blocks then • the “top” eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks • Spectral clustering: • Find the top k+1 eigenvectors v1,…,vk+1 • Discard the “top” one • Replace every node a with k-dimensional vector xa= <v2(a),…,vk+1 (a) > • Cluster with k-means M

  21. Spectral Clustering: Pros and Cons Elegant, and well-founded mathematically Tends to avoid local minima Optimal solution to relaxed version of mincut problem (Normalized cut, aka NCut) Works quite well when relations are approximately transitive (like similarity, social connections) Expensive for very large datasets Computing eigenvectors is the bottleneck Approximate eigenvector computation not always useful Noisy datasets sometimes cause problems Picking number of eigenvectors and k is tricky “Informative” eigenvectors need not be in top few Performance can drop suddenly from good to terrible

  22. Experimental results: best-case assignment of class labels to clusters

  23. Spectral Clustering: Graph = MatrixM*v1 = v2“propogates weights from neighbors” * v1 = v2 M C A B G H I J F D E M

  24. Repeated averaging with neighbors as a clustering method • Pick a vector v0 (maybe at random) • Compute v1 = Wv0 • i.e., replace v0[x] with weighted average of v0[y] for the neighbors y of x • Plot v1[x] for each x • Repeat for v2, v3, … • Variants widely used for semi-supervised learning • clamping of labels for nodes with known labels • Without clamping, will converge to constant vt • What are the dynamics of this process?

  25. Repeated averaging with neighbors on a sample problem… blue green ___red___ • Create a graph, connecting all points in the 2-D initial space to all other points • Weighted by distance • Run power iteration for 10 steps • Plot node id x vs v10(x) • nodes are ordered by actual cluster number g g g g g g g g g b b r r r r … b b r r r b

  26. Repeated averaging with neighbors on a sample problem… blue green ___red___ blue green ___red___ blue green ___red___ larger smaller

  27. Repeated averaging with neighbors on a sample problem… blue green ___red___ blue green ___red___ blue green ___red___ blue green ___red___ blue green ___red___

  28. Repeated averaging with neighbors on a sample problem… very small

  29. PIC: Power Iteration Clusteringrun power iteration (repeated averaging w/ neighbors) with early stopping • V0: random start, or “degree matrix” D, or … • Easy to implement and efficient • Very easily parallelized • Experimentally, often better than traditional spectral methods • Surprising since the embedded space is 1-dimensional!

  30. Experiments • “Network” problems: natural graph structure • PolBooks: 105 political books, 3 classes, linked by copurchaser • UMBCBlog: 404 political blogs, 2 classes, blogroll links • AGBlog: 1222 political blogs, 2 classes, blogroll links • “Manifold” problems: cosine distance between classification instances • Iris: 150 flowers, 3 classes • PenDigits01,17: 200 handwritten digits, 2 classes (0-1 or 1-7) • 20ngA: 200 docs, misc.forsale vs soc.religion.christian • 20ngB: 400 docs, misc.forsale vs soc.religion.christian • 20ngC: 20ngB + 200 docs from talk.politics.guns • 20ngD: 20ngC + 200 docs from rec.sport.baseball

  31. Experimental results: best-case assignment of class labels to clusters

  32. Outline • Spectral methods • Variant: Power Iteration Clustering [Lin & Cohen, ICML 2010] • Experiments • Analysis • Variant: PIC for document clustering • Stochastic block models • Mixed-membership sparse block model [...] • Variants: BlockLDA etc

  33. Analysis: why is this working?

  34. Analysis: why is this working?

  35. Analysis: why is this working? L2 distance scaling? differences might cancel? “noise” terms

  36. Analysis: why is this working? • If • eigenvectors e2,…,ek are approximately piecewise constant on blocks; • λ2,…, λk are “large” and λk+1,… are “small”; • e.g., if matrix is block-stochastic • the ci’s for v0 are bounded; • for any a,b from distinct blocks there is at least one ei with ei(a)-ei(b) “large” • Then exists an R so that • spec(a,b) small  R*pic(a,b) small

  37. Analysis: why is this working? • Sum of differences vs sum-of-squared differences • “soft” eigenvector selection

  38. Outline • Spectral methods • Variant: Power Iteration Clustering [Lin & Cohen, ICML 2010] • Variant: PIC for document clustering • Stochastic block models • Mixed-membership sparse block model [...] • Variants: BlockLDA with entropic regularization, BlockLDA with annotated documents

  39. Motivation: Experimental Datasets are… • “Network” problems: natural graph structure • PolBooks: 105 political books, 3 classes, linked by copurchaser • UMBCBlog: 404 political blogs, 2 classes, blogroll links • AGBlog: 1222 political blogs, 2 classes, blogroll links • Also: Zachary’s karate club, citation networks, ... • “Manifold” problems: cosine distance between all pairs of classification instances • Iris: 150 flowers, 3 classes • PenDigits01,17: 200 handwritten digits, 2 classes (0-1 or 1-7) • 20ngA: 200 docs, misc.forsale vs soc.religion.christian • 20ngB: 400 docs, misc.forsale vs soc.religion.christian • … Gets expensive fast

  40. Lazy computation of distances and normalizers • Recall PIC’s update is • vt = W * vt-1 = = D-1A * vt-1 • …where D is the [diagonal] degree matrix: D=A*1 • My favorite distance metric for text is length-normalized TFIDF: • Def’n: A(i,j)=<vi,vj>/||vi||*||vj|| • Let N(i,i)=||vi|| … and N(i,j)=0 for i!=j • Let F(i,k)=TFIDF weight of word wk in document vi • Then: A = N-1FTFN-1 1 is a column vector of 1’s <u,v>=inner product ||u|| is L2-norm

  41. Lazy computation of distances and normalizers Equivalent to using TFIDF/cosine on all pairs of examples but requires only sparse matrices • Recall PIC’s update is • vt = W * vt-1 = = D-1A * vt-1 • …where D is the [diagonal] degree matrix: D=A*1 • Let F(i,k)=TFIDF weight of word wk in document vi • Compute N(i,i)=||vi|| … and N(i,j)=0 for i!=j • Don’t compute A = N-1FTFN-1 • Let D(i,i)= N-1FTFN-1*1 where 1 is an all-1’s vector • Computed as D=N-1(FT(F(N-1*1))) for efficiency • New update: • vt = D-1A * vt-1 = D-1 N-1FTFN-1 *vt-1

  42. Experimental results • RCV1 text classification dataset • 800k + newswire stories • Category labels from industry vocabulary • Took single-label documents and categories with at least 500 instances • Result: 193,844 documents, 103 categories • Generated 100 random category pairs • Each is all documents from two categories • Range in size and difficulty • Pick category 1, with m1 examples • Pick category 2 such that 0.5m1<m2<2m1

  43. Results • NCUTevd: Ncut with exact eigenvectors • NCUTiram: Implicit restarted Arnoldi method • No stat. signif. diffs between NCUTevd and PIC

  44. Results

  45. Results

  46. Results

  47. Outline • Spectral methods • Variant: Power Iteration Clustering [Lin & Cohen, ICML 2010] • Variant: PIC for document clustering • Stochastic block models • Mixed-membership sparse block model [Parkinnen et al, 2007] • Variants: BlockLDA with entropic regularization, BlockLDA with annotated documents

  48. Question: How to model this? MMSBM of Airoldi et al • Draw K2 Bernoulli distributions • Draw a θifor each protein • For each entry i,j, in matrix • Draw zi* from θi • Draw z*j from θj • Draw mij from a Bernoulli associated with the pair of z’s.

  49. Question: How to model this? p1, p2 do interact MMSBM of Airoldi et al Index of protein 2 • Draw K2 Bernoulli distributions • Draw a θifor each protein • For each entry i,j, in matrix • Draw zi* from θi • Draw z*j from θj • Draw mij from a Bernoulli associated with the pair of z’s. Index of protein 1

  50. Question: How to model this? p1, p2 do interact we prefer… Sparse block model of Parkinnen et al, 2007 • Draw K multinomial distributions β • For each row in the link relation: • Draw (zL*,z*R)from  • Draw a protein i from left multinomial associated with zL • Draw a protein j from right multinomial associated with z*R • Add i,j to the link relation Index of protein 2 These define the “blocks” Index of protein 1

More Related