1 / 39

Analysis of Large Graphs: Overlapping Communities

cleta
Download Presentation

Analysis of Large Graphs: Overlapping Communities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Analysis of Large Graphs:Overlapping Communities Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University http://www.mmds.org

  2. Identifying Communities Can we identify node groups? (communities, modules, clusters) Nodes: Football Teams Edges: Games played J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  3. NCAA Football Network NCAA conferences Nodes: Football Teams Edges: Games played J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  4. Protein-Protein Interactions Can we identify functional modules? Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  5. Protein-Protein Interactions Functional modules Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  6. Facebook Network Can we identify social communities? Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  7. Facebook Network Social communities High school Summerinternship Stanford (Basketball) Stanford (Squash) Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  8. Overlapping Communities • Non-overlapping vs. overlapping communities J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  9. Non-overlapping Communities Nodes Nodes Adjacency matrix Network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  10. Communities as Tiles! • What is the structure of community overlaps:Edge density in the overlaps is higher! Communities as “tiles” J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  11. Recap so far… This is what we want! Communitiesin a network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  12. Plan of attack • 1) Given a model, we generate the network: • 2) Given a network, find the “best” model Generative model for networks B B F F A A E E D D G G C C H H Generative model for networks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  13. Model of networks • Goal:Define a model that can generate networks • The model will have a set of “parameters” that we will later want to estimate (and detect communities) • Q: Given a set of nodes, how do communities “generate” edges of the network? B F A E D G Generative model for networks C H J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  14. Community-Affiliation Graph • Generative model B(V, C, M, {pc}) for graphs: • Nodes V, Communities C, Memberships M • Each community c has a single probability pc • Later we fit the model to networks to detect communities pA pB Communities, C Model Memberships, M Nodes, V Network Model J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  15. AGM: Generative Process • AGM generates the links: For each • For each pair of nodes in community , we connect them with prob. • The overall edge probability is: pA pB Model Communities, C Memberships, M Nodes, V Network Community Affiliations … set of communities node belongs to If share nocommunities: • Think of this as an “OR” function: If at least 1 community says “YES” we create an edge J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  16. Recap: AGM networks Model Network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  17. AGM: Flexibility • AGM can express a variety of community structures:Non-overlapping, Overlapping, Nested J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  18. How do we detect communities with AGM?

  19. Detecting Communities • Detecting communities with AGM: B F A E D G C H Given a Graph , find the Model Affiliation graph M Number of communities C Parameters pc J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  20. Maximum Likelihood Estimation • Maximum Likelihood Principle (MLE): • Given: Data • Assumption: Data is generated by some model • … model • … model parameters • Want to estimate : • The probability that our model (with parameters ) generated the data • Now let’s find the most likely model that could have generated the data: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  21. Example: MLE • Imagine we are given a set of coin flips • Task: Figure out the bias of a coin! • Data: Sequence of coin flips: • Model: return 1 with prob. else return 0 • What is ? Assuming coin flips are independent • So, • What is ? Simple, • Then, • For example: • What did we learn?Our data was most likely generated by coin with bias J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  22. MLE for Graphs • How do we do MLE for graphs? • Model generates a probabilistic adjacency matrix • We then flip all the entries of the probabilistic matrix to obtain the binary adjacency matrix • The likelihood of AGM generating graph G: Flip biased coins For every pair of nodes AGM gives the prob. of them being linked J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  23. Graphs: Likelihood P(G|Θ) • Given graph G(V,E)and Θ,we calculate likelihood that Θgenerated G:P(G|Θ) G A B Θ=B(V, C, M, {pc}) G P(G|Θ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  24. MLE for Graphs • Our goal:Find such that: • How do we find that maximizes the likelihood? P( | ) AGM arg max J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  25. MLE for AGM • Our goal is to find such that: • Problem: Finding B means finding the bipartite affiliation network. • There is no nice way to do this. • Fitting is too hard, let’s change the model (so it is easier to fit)! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  26. From AGM to BigCLAM • Relaxation: Memberships have strengths • The membership strength of node to community (: no membership) • Each community links nodes independently: u v J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  27. Factor Matrix • Community membership strength matrix Communities • Probability of connection is proportional to the product of strengths • Notice: If one node doesn’t belong to the community () then • Prob. that at least one common community links the nodes: Nodes … vector of community membershipstrengths of … strength of ’s membership to J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  28. From AGM to BigCLAM • Community links nodes independently: • Then prob. at least one common links them: • Example matrix: Then: And: But: Node community membership strengths J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  29. BigCLAM: How to find F • Task: Given a network , estimate • Find that maximizes the likelihood: • where: • Many times we take the logarithm of the likelihood, and call it log-likelihood: • Goal: Find that maximizes : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  30. BigCLAM: V1.0 • Compute gradient of a single row of : • Coordinate gradient ascent: • Iterate over the rows of : • Compute gradient of row (while keeping others fixed) • Update the row : • Project back to a non-negative vector: If : • This is slow! Computing takes linear time! .. Set out outgoing neighbors J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  31. BigCLAM: V2.0 • However, we notice: • We cache • So, computing now takes linear timein the degree of • In networks degree of a node is much smaller to the total number of nodes in the network, so this is a significant speedup! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  32. BigClam: Scalability • BigCLAM takes 5 minutes for 300k node nets • Other methods take 10 days • Can process networks with 100M edges! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  33. Extension: Directed memberships

  34. Extension: Beyond Clusters J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  35. Extension: Directed AGM • Extension:Make community membership edges directed! • Outgoing membership: Nodes “sends” edges • Incoming membership: Node “receives” edges J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  36. Example: Model and Network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  37. Directed AGM • Everything is almost the same except now we have 2 matrices: and • … out-going community memberships • … in-coming community memberships • Edge prob.: • Network log-likelihood: which we optimize the same way as before J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  38. Predator-prey Communities J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  39. More details at… • Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013. • Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014. • Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

More Related