1 / 45

Size matters: 1) Cluster structure of large networks 2) Searching the world’s social network

Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney, Kevin Lang, Aniraban Dasgupta. Size matters: 1) Cluster structure of large networks 2) Searching the world’s social network.

Download Presentation

Size matters: 1) Cluster structure of large networks 2) Searching the world’s social network

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney, Kevin Lang, AnirabanDasgupta Size matters:1) Cluster structure of large networks2) Searching the world’s social network

  2. Rich data: Networks • Large on-line computing applications have detailed records of human activity: • On-line communities: Facebook (120 million) • Communication: Instant Messenger (~1 billion) • News and Social media: Blogging (250 million) • We model the data as a network (an interaction graph) Can observe and study phenomena at scales not possible before Communication network

  3. Outline • The Small-world experiment: • On a 240 million node communication network of Microsoft Instant Messenger • Small vs. large networks: • Modeling community (cluster) structure of large networks Zachary’s karate club (N=34) Tiny part of a large social network

  4. How expressed are communities? S • How community like is a set of nodes? • Idea:Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure. • Conductance (normalized cut) S’ • Φ(S) = # edges cut / # edges inside • SmallΦ(S) corresponds to more community-like sets of nodes

  5. Community score (quality) What is “best” community of 5 nodes? • Score: Φ(S) = # edges cut / # edges inside

  6. Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/6 = 0.83 • Score: Φ(S) = # edges cut / # edges inside

  7. Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/7 = 0.7 Better community Φ=2/5 = 0.4 • Score: Φ(S) = # edges cut / # edges inside

  8. Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/7 = 0.7 Best community Φ=2/8 = 0.25 Better community Φ=2/5 = 0.4 • Score: Φ(S) = # edges cut / # edges inside

  9. Network Community Profile Plot • We define: Network community profile (NCP) plot Plot the score of best community of size k k=5 k=7 log Φ(k) Φ(5)=0.25 Φ(7)=0.18 Community size, log k

  10. NCP plot: Low-dimensional and random graphs Hierarchically nested clusters d-dimensional meshes

  11. NCP plot: Zachary’s karate club • Zachary’s university karate club social network • During the study club split into 2 • The split (squares vs. circles) corresponds to cut B

  12. NCP plot: Network Science • Collaborations between scientists in Networks [Newman, 2005]

  13. Present work: Large networks • Previous work mostly focused on community structure of smallnetworks (~100 nodes) • We examined 108 different largenetworks

  14. Example of a large network • Typical example: General relativity collaboration network (4,158 nodes, 13,422 edges)

  15. More NCP plots of networks

  16. NCP: LiveJournal (N=5M, E=42M) Better and better communities Communities get worse and worse Φ(k), (conductance) Best community has ~100 nodes k, (community size)

  17. Explanation: Downward part NCP plot Small clusters on the edge of the network are responsible for downward part of NCP plot Best cluster

  18. Explanation: Upward part • Each additional edge inside the cluster costs more: Φ=1/3 = 0.33 Φ=2/4 = 0.5 NCP plot Φ=8/6 = 1.3 Φ=64/14 = 4.5 Each node has twice as many children

  19. Suggested network structure Denser and denser core of the network Core contains ~60% nodes and ~80% edges Whiskers are responsible for good communities Network structure: Core-periphery (jellyfish, octopus)

  20. What is a good model? • What is a good model that explains such network structure? Flat and Down Flat Down and Flat Geometric Pref. Attachment Small World Pref. attachment

  21. Forest Fire model works • Notes: • Preferential attachment flavor - second neighbor is not uniform at random. • Copying flavor - since burn seed’s neighbors. • Hierarchical flavor - seed is parent. • “Local” flavor - burn “near” -- in a diffusion sense -- the seed vertex. • Forest Fire [LKF05]: connections spread like a fire • New node joins the network • Selects a seed node • Connects to some of its neighbors • Continue recursively As community grows it blends into the core of the network

  22. Forest Fire NCP plot rewired network

  23. Typical cluster size • How does the size of best cluster scale with the size of the network?

  24. Size of best cluster over time • Cluster size remains constant (even if one allows nesting) over time Linked in network over time

  25. Cluster size vs. network size • Each dot is a different network

  26. Connections • The Dunbar number • 150 individuals is maximum community size • What edges “mean” and community identification • Using node and edge types/attributes • Implications for machine learning • No large clusters • No/little (assortative) hierarchical structure • Can’t be well embedded – no underlying geometry

  27. Joint work with Eric Horvitz, Microsoft Research The small-world of the MSN Instant Messenger

  28. The Small-world experiment • The Small-world experiment[Milgram ’67, Dodds-Muhamad-Watts ‘03] • People send letters from Nebraska to Boston • How many steps does it take? • 6.2 on the average, thus “6 degrees of separation” Milgram’s small world experiment

  29. The Small-world experiment • 1) Short paths exist in a social network • 2) People are able to find them (using only partial knowledge of the network) Local search: forwarding a message Good nodes: d=h-1 t s Target d(s,t)=h Bad nodes: d≥h

  30. Our dataset: Instant Messaging • Contact (buddy) list • Messaging window

  31. MSN communication • We collected the data for June 2006 4.5Tb of compressed data: • 245 million users logged in • 180 million users engaged in conversations • 255 billion exchanged messages • 1 billion conversations / day

  32. MSN network The network:180M nodes, 1.3B undirected edges

  33. MSN: path lengths MSN Messenger network Number of steps between pairs of people Avg. path length 6.6 90% of the people can be reached in < 8 hops

  34. Degree distribution: A node that exchanged messages with ~2 million people

  35. Robustness of shortest paths Short paths exist and they are robust Both way links All links Randomized network (same degree distr.)

  36. Learning to search in a network • What is the decision function that makes me forward the message to the target? Good nodes: d=h-1 t s Target What are the characteristics of shortest paths? How hard is it to find them? d(s,t)=h Bad nodes: d≥h

  37. Does geography help? s t

  38. Does geography help? s t

  39. How hard is to find a good node? s t

  40. How hard is to find a good node? Probability of success if we forward to a random neighbor s t

  41. Algorithm accuracy at hops s t

  42. Algorithm accuracy at hops Use a decision tree to learn a classifier: Model: 0.4128 Random : 0.0207 s t

  43. The learned model Green bar is prob. that node is good

  44. Comparing search heuristics • Pick a pair of nodes: start at s • Walk until hit the target twhere next node is chosen: Search alg. % found Mean path length Random 0.0008 3,709 MinGeoDist 0.0282 778 MaxDeg 0.0158 4,964 Deg/Geo2 0.1446 2,676 Cntry 0.0108 402 Cntry*Deg 0.1313 3,114 Lang 0.0055 1,699 Lang*Deg 0.0496 3,163 Age 0.0012 2,890 Age*Deg 0.0203 5,324 It works! (in a network with 180 million nodes) -- Milgram’s path completion is 29% -- Dodds,Muhhamad, Watts: 0.015% comp s t

  45. Conclusions and reflections • Why are networks the way they are? • Only recently have basic properties been observed on a large scale • Confirms social science intuitions; calls others into question • Benefits of working with large data • Observe structures not visible at smaller scales

More Related