Graph P artitioning a nd Clustering for Community Detection

Graph PartitioningandClustering forCommunity Detection Presented By: Group One

Outline • Introduction: Hong Hande • Graph Partitioning: Muthu Kumar C and Xie Shudong • PartitionalClustering: AgusPratondo • Spectral Clustering: Li Furong and Song Chonggang • Summary and Applications of Community Detection: AleksandrFarseev

introduction -BY HONG HANDE

Facebook Group https://www.facebook.com/thebeatles?rf=111113312246958

Flickr group http://www.flickr.com/groups/49246928@N00/pool/with/417646359/#photo_417646359

CS6234 Advanced Algorithms Whole class as a community Sub-community

Graph construction from web data(1) Webpage www.x.com href = “www.y.com” href = “www.z.com” x Webpage www.y.com href = “www.x.com” href = “www.a.com” href = “www.b.com” z y a b Webpage www.z.com href = “www.a.com”

Graph construction from web data(2)

Web pages as a graph Cnn.com Lots of links, lots of images. (1316 tags) http://www.aharef.info/2006/05/websites_as_graphs.htm

Internet as a graph nodes = service providers edges = connections hierarchical structure S. Carmi,S. Havlin, S. Kirkpatrick, Y. Shavitt, E. Shir. A model of Internet topology using k-shell decomposition. PNAS 104 (27), pp. 11150-11154, 2007

Emerging structures • Graph (from web, daily life) present certain structural characteristics • Group of nodes interacting with each other Dense inter-connections functional/topical associations Community a.k.a. group, subgroup, module, cluster

Community Types • Explicit • The result of conscious human decision • Implicit • Emerging from the interactions & activities of users • Need special methods to be discovered

Defining Communities • Often communities are defined with respect to a graph, G = (V,E) representing a set of objects (V) and their relations (E). • Even if such graph is not explicit in the raw data, it is usually possible to construct, e.g. feature vectors distances graph

Communities and graphs • Given a graph, a community is defined as a set of nodes that are more densely connected to each other than to the rest of the network nodes Internal edge External edge

Graph cuts • A cut is a partition of the vertices of a graph into two disjoint subsets. • The cut-set of the cut is the set of edges whose end points are in different subsets of the partition.

Community detection methods • Graph partitioning • Node clustering • K-means clustering • Spectral clustering

Graph partitioning MUTHU KUMAR C

Graph Partitioning • Dividing vertices into groups of predefined size. • Given a graph G = (V, E, WE), with vertices V, edges E and edge weights WE.Choose a partition such that: • V = V1U V2U … U VP • V1∩ V2 …. ∩ Vp= Ø • Bisectioning: Partitioning into twoequal sizedgroups of vertices.

How many partitions? • There exists many possible partitioning to search. • Just to divide into 2 partitions there are: which is exponential in n. • Choosing optimal partitioning is NP-complete. 5 5 5 5 1 1 1 1 6 6 6 6 2 2 2 2 3 3 3 3 7 7 7 7 4 4 4 4 8 8 8 8

Kernighan/Lin Algorithm1 • An iterative, 2-way, balanced partitioning (bi-sectioning) heuristic. • The algorithm can also be extended to solve more general partitioning problems. • Given and find a partition such that: • Cutsize T between A and B is minimized.where 1. Kernighan, B. W., & Lin, S. (1970). An efficient heuristic procedure for partitioning graphs. Bell system technical journal, 49(2), 291-307.

Kernighan-Lin: Definitions • Let and be two vertices. • External Cost • Internal Cost • Moving a node from A to B increases T by and decreases T by • This is measured as • , and are defined analogously for b in B.

K/L Algorithm: Swap a b a b A B Cutsize b a A A B B Cutsize

Kernighan-Lin Algorithm • // Kernighan-Lin Page 1 of 2 • Compute T = Cost(A,B) for Initial A, B • Repeat // sweep begins • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (ai,bi) with maximal gai,bi(i) • Mark ‘a’ and ‘b’ • Update D(v) for all unmarked v • Endwhile Each sweep greedily computes |V|/2 possible X A, Y  B to swap, picks a sequence of best such swaps. but do not swap them. as though ‘a’ and ‘b’ had been swapped.  (1) (2)

Kernighan-Lin Algorithm • // Kernighan-Lin Page 2 of 2 • We have now computed: • *) a sequence of pairs(a1,b1), … , (ak,bk) and • *) gains g(1),…., g(k) where k = |V|/2, • numbered in the order in which we marked them • Pick m ≤ k, which maximizes gain. • Gain= • If Gain > 0 then // it is worth swapping • Update newA = A - { a1,…,am } U { b1,…,bm } • Update newB = B - { b1,…,bm } U { a1,…,am } • Update T = T – Gain • endif • Until Gain <= 0 // sweep ends Gain is reduction in cost from swapping (a1,b1) through (am,bm)

Kernighan-Lin Example Edges are unweighted in this example 5 1 6 2 3 7 4 8 Cut cost: 9 Unmarked: 1,2,3,4,5,6,7,8

Kernighan/Lin Example 5 1 6 2 3 7 4 8 Cut cost: 9 Unmarked : 1,2,3,4,5,6,7,8 Calculate D values to find best pair Costs D(v) of each node: D(1) = 1 D(5) = 1D(2) = 1 D(6) = 2D(3) = 2D(7) = 1D(4) = 1 D(8) = 1 Nodes that lead to maximum gain

Kernighan/Lin Example 5 1 6 2 3 7 4 8 Cut cost: 9 Unmarked : 1,2,3,4,5,6,7,8 Mark the identified pair as a candidate swap. Costs D(v) of each node: D(1) = 1 D(5) = 1D(2) = 1 D(6) = 2D(3) = 2D(7) = 1D(4) = 1 D(8) = 1g1 = 2+1-0 = 3 Swap (3,5) G1 = g1 =3 Nodes that lead to maximum gain Gain after node swapping Gain in the current pass

Kernighan/Lin Example 5 5 1 1 6 6 2 2 3 3 7 7 4 4 8 8 Cut cost: 9 Unmarked: 1,2,3,4,5,6,7,8 Cut cost: 6 Unmarked: 1,2,4,6,7,8 New partitions and cut cost D(1) = 1 D(5) = 1D(2) = 1 D(6) = 2D(3) = 2D(7) = 1D(4) = 1 D(8) = 1g1 = 2+1-0 = 3 Swap (3,5) G1 = g1 =3

Kernighan/Lin Example 5 5 1 1 6 6 2 2 3 3 7 7 4 4 8 8 Cut cost: 9 Unmarked: 1,2,3,4,5,6,7,8 Cut cost: 6 Unmarked: 1,2,4,6,7,8 D(1) = 1 D(5) = 1D(2) = 1 D(6) = 2D(3) = 2D(7) = 1D(4) = 1 D(8) = 1g1 = 2+1-0 = 3 Swap (3,5) G1 = g1 =3 D(1) = -1 D(6) = 2D(2) = -1 D(7)=-1D(4) = 3D(8)=-1

Kernighan/Lin Example 5 5 5 1 1 1 6 6 6 2 2 2 3 3 3 7 7 7 4 4 4 8 8 8 Cut cost: 9 Unmarked: 1,2,3,4,5,6,7,8 Cut cost: 6 Unmarked: 1,2,4,6,7,8 D(1) = 1 D(5) = 1D(2) = 1 D(6) = 2D(3) = 2D(7) = 1D(4) = 1 D(8) = 1g1 = 2+1-0 = 3 Swap (3,5) G1 = g1 =3 D(1) = -1 D(6) = 2D(2) = -1 D(7)=-1D(4) = 3D(8)=-1g2 = 3+2-0 = 5 Swap (4,6) G2 = G1+g2=8 Nodes that lead to maximum gain Gain after node swapping Gain in the current pass

Nodes that lead to maximum gain Kernighan/Lin Example 5 5 5 5 1 1 1 1 6 6 6 6 2 2 2 2 3 3 3 3 7 7 7 7 4 4 4 4 8 8 8 8 Cut cost: 9 Unmarked: 1,2,3,4,5,6,7,8 Cut cost: 6 Unmarked: 1,2,4,6,7,8 Cut cost: 1 Unmarked: 1,2,7,8 Cut cost: 7 Unmarked: 2,8 D(1) = 1 D(5) = 1D(2) = 1 D(6) = 2D(3) = 2D(7) = 1D(4) = 1 D(8) = 1g1 = 2+1-0 = 3 Swap (3,5) G1 = g1 =3 D(1) = -1 D(6) = 2D(2) = -1 D(7)=-1D(4) = 3D(8)=-1g2 = 3+2-0 = 5 Swap (4,6) G2 = G1+g2=8 D(1) = -3D(7)=-3D(2) = -3 D(8)=-3g3 = -3-3-0 = -6 Swap (1,7) G3= G2+g3= 2 Gain after node swapping Gain in the current pass

Kernighan/Lin Example 5 5 5 5 5 1 1 1 1 1 6 6 6 6 2 6 2 2 2 2 3 3 3 3 7 3 7 7 7 7 4 4 4 4 4 8 8 8 8 8 Cut cost: 6 Unmarked: 1,2,4,6,7,8 Cut cost: 1 Unmarked: 1,2,7,8 Cut cost: 7 Unmarked: 2,8 Cut cost: 9 Unmarked: – Cut cost: 9 Unmarked: 1,2,3,4,5,6,7,8 D(1) = 1 D(5) = 1D(2) = 1 D(6) = 2D(3) = 2D(7) = 1D(4) = 1 D(8) = 1g1 = 2+1-0 = 3 Swap (3,5) G1 = g1 =3 D(1) = -1 D(6) = 2D(2) = -1 D(7)=-1D(4) = 3D(8)=-1g2 = 3+2-0 = 5 Swap (4,6) G2 = G1+g2=8 D(1) = -3D(7)=-3D(2) = -3 D(8)=-3g3 = -3-3-0 = -6 Swap (1,7) G3= G2+g3= 2 D(2) = -1D(8)=-1 g4 = -1-1-0 = -2 Swap (2,8) G4 = G3+g4 = 0

Kernighan/Lin Example D(1) = 1 D(5) = 1D(2) = 1 D(6) = 2D(3) = 2D(7) = 1D(4) = 1 D(8) = 1g1 = 2+1-0 = 3 Swap (3,5) G1 = g1 =3 D(1) = -1 D(6) = 2D(2) = -1 D(7)=-1D(4) = 3D(8)=-1g2 = 3+2-0 = 5 Swap (4,6) G2 = G1+g2=8 D(1) = -3D(7)=-3D(2) = -3 D(8)=-3g3 = -3-3-0 = -6 Swap (1,7) G3= G2+g3= 2 D(2) = -1D(8)=-1 g4 = -1-1-0 = -2 Swap (2,8) G4 = G3+g4 = 0 Maximum positive gain Gm = 8 with m = 2. 5 1 Since Gm> 0,the first m = 2 swaps (3,5) and (4,6) are executed. 6 2 3 7 Since Gm> 0, more passes are needed until Gm 0. 4 8

Escaping Local minima • Non monotonically increasing gains, that is, in the sequence of m swaps chosen, some may be negative. • Possibly escape “local minima”. • But there is no guarantee of optimal solution.

Demerits • Bi-sectioning does not generalize well to k-way partitioning. • Partition to predefined sizes limits utility to niche applications.

Analysis of K/L Algorithm XIE SHUDONG

K/L Algorithm: Analysis Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then … it is worth swapping • Update newA = A - { a1, …, am } ∪ { b1, …, bm} • Update newB = B - { b1, …, bm } ∪ { a1, …, am} • Update T = T – G • endif • Until Gain <= 0

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then … it is worth swapping • Update newA = A - { a1, …, am } ∪ { b1, …, bm} • Update newB = B - { b1, …, bm } ∪ { a1, …, am} • Update T = T – G • endif • Until Gain <= 0 A B Edges |V|/2 Nodes |V|/2 All Ext Edges |V|²/4 = = * *

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then • Update newA = A - { a1, …, am } ∪ { b1, …, bm} • Update newB = B - { b1, …, bm } ∪ { a1, …, am} • Update T = T – G • endif • Until Gain <= 0 a O(|V|²) For one node a: D(a) = – I(a) O(|V|) E(a) For all |V| nodes O(|V|²) b A B

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then • Update newA = A - { a1, …, am } ∪ { b1, …, bm} • Update newB = B - { b1, …, bm } ∪ { a1, …, am} • Update T = T – G • endif • Until Gain <= 0 O(|V|²) O(|V|)

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then • Update newA = A - { a1, …, am } ∪ { b1, …, bm} • Update newB = B - { b1, …, bm } ∪ { a1, …, am} • Update T = T – G • endif • Until Gain <= 0 O(|V|²) O(|V|) O(|V|²)

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then • Update newA = A - { a1, …, am } ∪ { b1, …, bm} • Update newB = B - { b1, …, bm } ∪ { a1, …, am} • Update T = T – G • endif • Until Gain <= 0 O(|V|²) O(|V|) O(|V|²) O(1)

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then • Update newA = A - { a1, …, am } ∪ { b1, …, bm} • Update newB = B - { b1, …, bm } ∪ { a1, …, am} • Update T = T – G • endif • Until Gain <= 0 newD(a’) = D(a’) + 2*w(a’, a) - 2*w(a’, b) O(1) O(|V|²) (i+1)-th loop: |V|-2i Unmarked Nodes O(|V|) O(|V|²) O(1) O(|V|)

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then • Update newA = A - { a1, …, am } ∪ { b1, …, bm} • Update newB = B - { b1, …, bm } ∪ { a1, …, am} • Update T = T – G • endif • Until Gain <= 0 O(|V|²) |V|/2 pairs to be found O(|V|) O(|V|³) O(|V|²) O(1) O(|V|)

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B g(1) g(1) + g(2) … … g(1) + g(2) + … + g(m) + … + g(|V|/2) • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then • Update newA = A - { a1, …, am } ∪ { b1, …, bm } • Update newB = B - { b1, …, bm } ∪ { a1, …, am} • Update T = T – G • endif • Until Gain <= 0 O(|V|²) O(|V|) g(1) + g(2) + … + g(m) → G O(|V|³) O(|V|) O(|V|²) O(1) O(|V|) O(|V|)

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then • Update newA = A - { a1, …, am } ∪ { b1, …, bm } • Update newB = B - { b1, …, bm } ∪ { a1, …, am} • Update T = T – G • endif • Until Gain <= 0 O(|V|²) O(|V|) O(|V|³) O(|V|²) O(1) O(|V|) O(|V|) O(|V|) O(|V|) A

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then • Update newA = A - { a1, …, am } ∪ { b1, …, bm } • Update newB = B - { b1, …, bm } ∪ { a1, …, am } • Update T = T – G • endif • Until Gain <= 0 O(|V|²) O(|V|) O(|V|³) O(|V|²) O(1) O(|V|) O(|V|) O(|V|) O(|V|) O(|V|) O(1)

K/L Algorithm: Analysis O(|V|²) Compute T = Cost(A,B) for Initial A, B • Repeat • Compute costs D(v) for all v in V • Unmark all vertices in V • While there are unmarked nodes • Find an unmarked pair (a, b) with maximal g(a, b) • Mark ‘a’ and ‘b’ (but do not swap them) • Update D(v) for all unmarked v, as though ‘a’ and ‘b’ had been swapped • Endwhile • Pick m maximizing • If Gain > 0 then • Update newA = A - { a1, …, am } ∪ { b1, …, bm } • Update newB = B - { b1, …, bm } ∪ { a1, …, am } • Update T = T – G • endif • Until Gain <= 0 (p iterations) O(p |V|³) O(|V|²) O(|V|) O(|V|³) Empirical testing by Kernighan and Lin on small graphs (|V|<=360) showed convergence after 2 to 4 passes O(|V|²) O(1) O(|V|) O(|V|) How many Iterations? O(|V|) O(|V|) O(|V|) O(1)

K-means Clustering by Agus Pratondo

Graph in Rn 1 2 5 a 2 1 3 b y 1 e x d c

Graph P artitioning a nd Clustering for Community Detection