1 / 32

Detecting Community Structure in Network

2004 summer intensive studies on complex networks. Detecting Community Structure in Network. Seung Woo Son KAIST. 2004. 8. 11. http://cnrl.snu.ac.kr/. metric. o. D-dimensional vector space. Clustering of data. Partitional clustering methods Important technique in data analysis

caine
Download Presentation

Detecting Community Structure in Network

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2004 summer intensive studies on complex networks Detecting Community Structure in Network Seung Woo Son KAIST 2004. 8. 11.

  2. http://cnrl.snu.ac.kr/

  3. metric o D-dimensional vector space Clustering of data • Partitional clustering methods • Important technique in data analysis • Divide the data according to natural classes • Pattern recognition, learning, astrophysics, and network analysis N multivariable data points

  4. applications Web page having same topic, hidden social relationship, distribute processes to processors in a parallel computer, etc. On network N vertices (nodes) No prior information Only know the edge (link) connectivity : Structural information How can we divide the network into several parts? =How can we find the “community” structure?

  5. Community, cluster • Functional modules in cellular and genetic network • P. Holme, M. Huss, and H. Jeong, Bioinformatics 19, 532 (2003). • D. Wilkinson and B. A. Huberman, Proc. Natl. Acad. Sci. USA 10.1073/pnas.0307740100 (2004). • A. Vespignani, Nature Genetics 35, 118 (2003). • Cultural society or important source of a person’s identity in social network • J. Scott, Social Network Analysis: A Handbook, Sage Publications 2nd ed. (2000). • A bundle of web pages on common topics etc. • Community, module, (cohesive) subgroup, cluster, clique etc. • Computer science, mathematics, sociology, biology, and physics are related in this community finding problem.

  6. Modularity Structural definition of community • Groups of vertices within which connections are dense, but between which connections are sparser. • Because we don’t have any prior information about network.

  7. Key points (Highlight) • What property or measure of network is used in this algorithm or method? • eigenvalue and eigenvector, spectrum of adjacency matrix. • Edge betweenness, information centrality. • Distance, dissimilarity index, edge clustering coefficient, etc. • Agglomerativeor divisive? • What is the required prior information here? • Whether there is community or not. • How many modules are there. • Performance of partitioning results and computational complexity. • We will review about 11 different methods recently studied.  If you are boring, ask me a question. Physical meaning?

  8. 3 1 4 2 5 is always eigenvector with eigenvalue 0. 1. Spectral bisection (old one) M. Fiedler, Czrch. Math. J. 23, 298 (1973)A. Pothen, H. Simon, and K.-P. Liou, SIAM J. Matrix Anal. Appl. 11, 430 (1990)F. R. K. Chung, Spectral Graph Theory, Amer. Math. Soc. (1997) http://www.cs.berkeley.edu/~demmel/cs267/lecture20/lecture20.html LaplacianL of n-vertex undirected graph G - D is the diagonal matrix of vertex degree k. - A is the adjacency matrix.

  9. 3 1 The spectral bisection method is reasonably fast.General n by n matrix case, O(n3) time complexity. However, sparse matrix case, Lancozos methodreduces it to approximately . 4 2 5 Bisect ! Algebraic connectivity The eigenvector corresponding to the lowest eigenvalue must haveboth positive and negative elements. : How good the split is, with smaller values corresponding to better splits. G. H. Golub and C. F. Van Loan, Matrix computations. Johns Hopkins University Press, Baltimore, MD (1989)

  10. A B 2. The Kernighan-Lin (KL) algorithm B. W. Kernighan and S. Lin, Bell System Technical Journal 49, 291 (1970)http://www.cs.berkeley.edu/~demmel/cs267/lecture18/lecture18.html Benefit function Q The number of edges that lie within the two groups minus the number that lie between them. 1. We should Specify the size of the two groups. N(A), N(B)2. Calculate the ∆Q for all possible exchange pair from A and B. 3. Choose the pair that maximizes the change of Q. (greedy algorithm) 4. Repeat 2 & 3 until all vertices have been swapped once. (any vertex that has been swapped is never swapped. ) 5. Go back over the sequence of swaps and find the highest Q. Bisect ! - This algorithm requires a priori what the size of the groups will be. - It runs moderately quickly, in worst case time O(n2). However, if we don’t know the size, It will increase to O(n3). - The best values of Q are always achieved for very asymmetric trivial division.

  11. Generally the number of ways to divide n vertices into g non-emptygroups is given by the Stirling number of the second kind S(n,g),and hence the number of distinct community divisions is . Modularity 3. Newman fast algorithm M. E. J. Newman, cond-mat/0309508 (PRE in press) Maximize Q by greedy algorithm ! • Separate each vertex solely into n community. • Calculate the increase of Q for all possible community pairs. • Choose the mergence of the greatest increase in Q. • Repeat 2 & 3 until the modularity Q reaches the maximal value. Time Complexity - O(mn) O(n2) on sparse graph. Agglomerative hierarchical clustering method!

  12. 4. q-state Potts method or RB method (Reihardt-Bornholdt method) J. Reichardt and S. Bornholdt, cond-mat/0402349 (2004) q-state Potts model on network Hamiltonian : Nearest neighbor ferromagnetic interaction of the Potts model : homogeneous distribution of spin Diversity : global anti-ferromagnetic interaction. q = N/5 is reasonable for application. Monte-Carlo heat-bath algorithm and simulated annealing magnetization

  13. 128 nodes computer-generated (proposed by Newman) network, 4 groups of 32 nodes each. Average of 16 links ( zin+zout=16 )

  14. 5. Hierarchical clustering Dendrogram metric • Measure of similarity xijbetween pairs (i,j) of vertices. • Single linkage, complete linkage, or average linkage. Structural equivalence : Two vertices are said to be structurally equivalent if they have the same set of neighbours. How many same friends they have. Pearson correlation Euclidean distance K-components : Two vertices in the same community have at least k independent paths between them. The count of edge-independent path (max-flow) betweenvertices. Time complexity Max(O(mn), O(n2logn) ) because of the sorting of n2 similarity.

  15. 6. Zhou dissimilarity index method H. Zhou, Phys. Rev. E 67, 061901 (2003) H. Zhou, Phys. Rev. E 67, 041908 (2003) The distance dij from vertex i to vertex j is defined as the average number of steps needed for a Brownian particle on this network to move from vertex i to vertex j. Transfer matrix(jumping probability) I is N by N identity matrix. Distance B(j) is equals to P except that Blj(j) = 0 for all l. Dissimilarity index

  16. A B 7. Girvan-Newman (GN) algorithm M. Girvan and M. E. J. Newman, PNAS 99, 7821 (2002) M. E. J. Newman and M. Girvan, Phys. Rev. E 69, 026113 (2004) The few edges that lie between communities can be thought of as forming “bottlenecks” between the communities. Betweenness and edge betweenness : The number of geodesic (i.e., shortest) paths between vertex pairs that run along the edge in question, summed over all vertex pairs. Edge removal : After calculating the betweenness of all edges in the network, remove the one with highest betweenness. Recalculate after edge removal and repeat it until the modularity Q is maximum. Time complexity O(m2n)

  17. 8. Tyler-Wilkinson-Huberman (TWM) method J. R. Tyler, D. M. Wilkinson, and B. A. Huberman, cond-mat/0303264 (2003) Variation of Girvan-Newman algorithm to improve the calculating speed. Tyler et al. suggest instead summing up over all node only a subset of vertices i be summed over, giving partial betweenness score for all edges; if a random sample is chosen, this will give a Monte Carlo estimate of betweenness. The number of vertices sampled is chosen so as to make the betweenness of at least one edge in the network greater than a certain threshold. This stochastic approach reduces the time complexity from O(m2n) to O(m2)

  18. Definition of community in a strong sense : Definition of community in a weak sense : i j 6 5 i j 6 5 9. RCCLP method or Parisi method (Radicchi-Castellano-Cecconi-Loreto-Parisi method) F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, PNAS 101, 2658 (2004) Edge clustering coefficient : the number of triangles built on that edge. Edge coefficient of order g :

  19. Edge clustering coefficient is strongly negatively correlated with edge betweenness. Time complexity O(m4/n2) ~ O(n2) This algorithm relies on the presence of triangles in the network. Clearly if a network has few triangles in the first place, then the edge clustering coefficient will be small for all edges, and the algorithm will be fail to find the communities.

  20. 10. Information centrality method (Fortunato-Latora-Marchiori method) S. Fortunato, V. Latora, and M. Marchiori, cond-mat/042522 (2004) Network efficiency E Information centrality CI Time complexity O(m3n) Iterative removal of the edges with the highest information centrality 64 nodes computer-generated network. 256 edges, 4 groups of 16 nodes each.

  21. 11. Flake’s max-flow method (Flake-Lawrence-Giles-Coetzee method) G. W. Flake, S. R. Lawrence, C. L. Giles, and F. M. Coetzee, IEEE Computer 35, 66 (2002) Web community Find the boundary of community using max-flow and min-cut. Without the text information only link information.Ex) Page Rank, Hyperlink Induced Topic Search(HIT) Starting page orseed Web sites

  22. Simple example of max-flow, min-cut

  23. Optimization Approach :Hamiltonian, benefit function, or modularity Q Spectral analysis : eignevalue and eigenvector of Laplacian or transfer matrix Edge removal :betweenness, information centrality, clustering coefficient, etc. Hierarchical clustering :metric ( Euclidian, correlation, similarity, etc. )

  24. 12. ESMS method or K. Sneppen method ( Eriksen-Simonsen-Maslov-Sneppen method ) K. A. Eriksen, I. Simonsen, S. Maslov, and K. Sneppen, Phys. Rev. Lett. 90, 148701 (2003)

  25. 13. CSCC method or Capocci method ( Capocci-Servedio-Caldarelli-Colaiori method ) A. Capocci, V. D. P. Servedio, G. Caldarelli, and F. Colaiori, cond-mat/0402499

  26. 14. Donetti-Muñoz (DM) method L. Donetti and M. A. Muñoz, cond-mat/0404652

  27. 15. Wu-Huberman (WH) method F. Wu and B. A. Huberman, cond-mat/310600

  28. 16. Costa’s Hub-based flooding method L. F. Costa, cond-mat/0405022 (2004)

More Related