Dynamics of networks

Jure Leskovec Machine Learning Department Carnegie Mellon University Dynamics of networks

Networks: Rich data • Today: Large on-line systems have detailed records of human activity • On-line communities: • Facebook (64 million users, billion dollar business) • MySpace (300 million users) • Communication: • Instant Messenger (~1 billion users) • News and Social media: • Blogging (250 million blogs world-wide, presidential candidates run blogs) • On-line worlds: • World of Warcraft (internal economy 1 billion USD) • Second Life (GDP of 700 million USD in ‘07) Opportunities for impact in science and industry

The networks c) Social networks b) Internet (AS) a) World wide web d) Communication e) Citations f) Protein interactions

Networks: What do we know? • We know lots about the network structure: • Properties: Scale free [Barabasi ’99], 6-degrees of separation [Milgram ’67], Navigation [Adamic-Adar ’03, LibenNowell ’05],Bipartite cores [Kumar et al. ’99], Network motifs [Milo et al. ‘02], Communities [Newman ‘99], Conductance [Mihail-Papadimitriou-Saberi ‘06], Hubs and authorities [Page et al. ’98, Kleinberg ‘99] • Models: Preferential attachment [Barabasi ’99], Small-world [Watts-Strogatz ‘98], Copying model [Kleinberg el al. ’99], Heuristically optimized tradeoffs [Fabrikant et al. ‘02],Congestion [Mihail et al. ‘03], Searchability [Kleinberg ‘00],Bowtie [Broder et al. ‘00], Transit-stub [Zegura ‘97], Jellyfish [Tauro et al. ‘01] We know much less about processes and dynamics of networks

My research: Network dynamics • Network Dynamics: • Network evolution • How network structure changes as the network grows and evolves? • Diffusion and cascading behavior • How do rumors and diseases spread over networks?

My research: Scale matters • We need massive network data for the patterns to emerge: • MSN Messenger network[WWW ’08, Nature ‘08] (the largest social network ever analyzed) • 240M people, 255B messages, 4.5 TB data • Product recommendations [EC ‘06] • 4M people, 16M recommendations • Blogosphere[work in progress] • 60M posts, 120M links

My research: The structure

Diffusion and Cascades • Behavior that cascades from node to node like an epidemic • News, opinions, rumors • Word-of-mouth in marketing • Infectious diseases • As activations spread through the network they leave a trace – a cascade Cascade (propagation graph) Network

[w/ Adamic-Huberman, EC ’06] 10% credit 10% off Setting 1: Viral marketing • People send and receive product recommendations, purchase products • Data:Large online retailer:4 million people, 16 million recommendations, 500k products

[w/ Glance-Hurst et al., SDM ’07] Setting 2: Blogosphere • Bloggers write posts and refer (link) to other posts and the information propagates • Data:10.5 million posts, 16 million links

[w/ Kleinberg-Singh, PAKDD ’06] Q1) What do cascades look like? • Are they stars? Chains? Trees? • Information cascades (blogosphere): propagation • Viral marketing (DVD recommendations): (ordered by frequency) • Viral marketing cascades are more social: • Collisions (no summarizers) • Richer non-tree structures

My research: The structure 

Cascade & outbreak detection • Blogs – information epidemics • Which are the influential/infectious blogs? • Viral marketing • Who are the trendsetters? • Influential people? • Disease spreading • Where to place monitoring stations to detect epidemics?

[w/ Krause-Guestrin et al., KDD ’07] The problem: Detecting cascades (best student paper) How to quickly detect epidemics as they spread? c1 c3 c2

[w/ Krause-Guestrin et al., KDD ’07] Two parts to the problem (best student paper) • Cost: • Cost of monitoring is node dependent • Reward: • Minimize the number of affected nodes: • If A are the monitored nodes, let R(A) denote the number of nodes we save We also consider other rewards: • Minimize time to detection • Maximize number of detected outbreaks A R(A) ( )

Optimization problem • Given: • Graph G(V,E), budget M • Data on how cascades C1, …, Ci,…,CKspread over time • Select a set of nodes A maximizing the reward subject to cost(A) ≤ M • Solving the problem exactly is NP-hard • Max-cover [Khuller et al. ’99] Reward for detecting cascade i

[w/ Krause-Guestrin et al., KDD ’07] Detection: Solution outline (best student paper) • Problem structure: Submodularityof the reward functions • CELF: algorithm with approximate guarantee • Speed up: Lazy evaluation to speed-up CELF We extend the result of [Kempe-Kleinberg-Tardos ’03]

[w/ Krause-Guestrin et al., KDD ’07] Reward function are submodular (best student paper) New monitored node: • Theorem: Reward function R is submodular(diminishing returns, think of it as “convexity”) S1 S1 S’ S’ R(A  {u}) – R(A) ≥ R(B  {u}) – R(B) S2 S3 Adding S’helps a lot Adding S’helps very little Gain of adding a node to a small set Gain of adding a node to a large set S2 S4 Placement A={S1, S2} Placement B={S1, S2, S3, S4} A B

[w/ Krause-Guestrin et al., KDD ’07] Solution: CELF Algorithm (best student paper) • We develop CELF(cost-effective lazy forward-selection) algorithm: • Two independent runs of a modified greedy • Solution set A’: ignore cost, greedily optimize reward • Solution set A’’: greedily optimize reward/cost ratio • Pick best of the two: arg max(R(A’), R(A’’)) • Theorem: If R is submodular then CELF is near optimal • CELF achieves ½(1-1/e) factor approximation For the size of our problems naïve CELF is too slow

Scaling up: Lazy evaluation • Observation: Submodularity guarantees that marginal rewards decrease with the solution size • Idea: • Use marginal reward from previous step as a upper bound on current marginal reward d b d a e Marginal reward c

CELF: Lazy evaluation • CELF algorithm: • Keep an ordered list of marginal rewards ri from previous step • Re-evaluate rionlyfor the top node a d b b a c e c d e Marginal reward

b c d e CELF: Lazy evaluation • CELF algorithm: • Keep an ordered list of marginal rewards ri from previous step • Re-evaluate rionlyfor the top node a d b a e c Marginal reward

b e CELF: Lazy evaluation • CELF algorithm: • Keep an ordered list of marginal rewards ri from previous step • Re-evaluate rionlyfor the top node a d d b a e c c Marginal reward

Blogs: Information epidemics • Which blogs should one read to catch big stories? For more info see our website: www.blogcascade.org CELF Reward (higher is better) In-links (used by Technorati) Out-links # posts Random Number of selected blogs (sensors)

CELF: Scalability Exhaustive search Greedy Run time (seconds) (lower is better) CELF runs 700x faster than simple greedy algorithm CELF Number of selected blogs (sensors)

[w/ Krause et al., J. of Water Resource Planning] Same problem: Water Network • Given: • a real city water distribution network • data on how contaminants spread over time • Place sensors (to save lives) • Problem posed by the US Environmental Protection Agency c1 S S c2

[w/ Ostfeld et al., J. of Water Resource Planning] Water network: Results CELF • Our approach performed best at the Battle of Water Sensor Networks competition Degree Random Population saved (higher is better) Population Flow Number of placed sensors

My research: The structure  

Background: Network models • Empirical findings on real graphs led to new network models • Such models make assumptions/predictions about other network properties • What about network evolution? Model Explains log prob. Power-law degree distribution Preferential attachment log degree

[w/ Kleinberg-Faloutsos, KDD ’05] Q4) Network evolution (best research paper) Internet • Networks are denser over time • Densification Power Law: a … densification exponent (1 ≤ a ≤ 2) • What is the relation between the number of nodes and the edges over time? • Prior work assumes: constant average degree over time E(t) a=1.2 N(t) Citations E(t) a=1.6 N(t)

[w/ Kleinberg-Faloutsos, KDD ’05] Q4) Network evolution (best research paper) Internet • Prior models and intuition say that the network diameter slowly grows (like log N, log log N) • Diameter shrinks over time • as the network grows the distances between the nodes slowly decrease diameter size of the graph Citations diameter time

Q5) Generating realistic graphs • Want to generate realistic networks: • Why synthetic graphs? • Anomaly detection, Simulations, Predictions, Null-model, Sharing privacy sensitive graphs, … • Q:What is a good model that we can fit to the data? • A: Next slide.  • Q:Which network properties do we care about? • A:Don’t commit, let’s match adjacency matrices Given a real network Generate a synthetic network Compare graphs properties, e.g., degree distribution

[w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05] The model: Kronecker graphs Edge probability • We prove Kronecker graphs mimic real graphs: • Power-law degree distribution, Densification, Shrinking/stabilizing diameter, Spectral properties pij (3x3) (9x9) (27x27) Initiator Kronecker product of graph adjacency matrices

[w/ Faloutsos, ICML ’07] Kronecker graphs: Estimation • Maximum likelihood estimation • Naïve estimation takes O(N!N2): • N!for different node labelings: • Our solution: Metropolis sampling: N! (big) const • N2 for traversing graph adjacency matrix • Our solution: Kronecker product (E << N2):N2E • Do stochastic gradient descent P( | ) Kronecker arg max We estimate the model in O(E)

[w/ Faloutsos, ICML ’07] Estimation: Epinions (N=76k, E=510k) • We search the space of ~101,000,000 permutations • Fitting takes 2 hours • Real and Kronecker are very close Degree distribution Path lengths “Network” values probability # reachable pairs network value node degree rank number of hops

My research: The structure   A2, A3:CELF  

Other work • P1) Query Projections: • Predicting search result quality without page content • P2) MSN Instant Messenger: • Social network of the whole world

[w/ Dumais-Horvitz, WWW ’07] P1) Query Projections • User types in a query to a search engine • Search engine returns results: Result returned by the search engine Hyperlinks Non-search results connecting the graph Is this a good set of search results?

Predictions Query projections: Idea Query Results Projection on the web graph • -- -- ---- • --- --- ---- • ------ --- • ----- --- -- • ------ ----- • ------ ----- Q Query projection graph Query connection graph Generate graphical features Construct case library

Extracting graph features • The idea • Find features that describe the structure of the graph • Then use the features for machine learning • Want features that describe • Connectivity of the graph • Centrality of projection and connector nodes • Clustering and density of the core of the graph vs.

Search results quality • Dataset: • Graph: 40 million nodes, 720 million edges • 30,000 queries with top 20 results for each • Human assigned relevance. 6-point scale: Perfect to Bad • Task: • Predict the highest rating in the set of results • 2-class problem: “Good” (top 3 ratings) vs. “Poor” (bottom 3) • Result: • Predict search result quality with 80% accuracy just from the connection patterns between the results

[w/ Dumais-Horvitz, WWW ’07] Query Projections: Intuition Good? Poor? Predict “Good” Predict “Poor”

Search quality: The model Good result sets have: • Search result nodes are hub nodes in the graph • Small connector node degrees • Big connected component • Few isolated nodes in projection graph • Few connector nodes

[w/ Horvitz, WWW ’08, Nature ‘08] P2) Planetary look on a small-world • Small-world experiment [Milgram ‘67] • People send letters from Nebraska to Boston • How many steps does it take? • Messenger social network – largest network analyzed • 240M people, 30B conversations, 4.5TB data MSN Messenger network Milgram’s small world experiment • Lots of engineering effort and good hardware: • 4 dual-core Opteron server • 48GB memory • 6.8TB of fast SCSI disks • How to learn short paths? Number of steps between pairs of people (i.e., hops + 1)

My research: The structure   Big data matters  

Future direction 1: Diffusion Obscure technology story Small tech blog • How do news and information spread • New ranking and influence measures for blogs • Sentiment analysis from cascade structure • Crawling 2 million blogs/day (with Spinn3r) Slashdot Wired New Scientist New York Times CNN BBC

Future direction 1: Diffusion • Models of information diffusion • When, where and what post will create a cascade? • Where should one tap the network to get the effect they want? • Social Media Marketing • How to handle richer classes of graphs • Interaction of network structure and node/edge attributes

Future direction 2: Evolution • Why are networks the way they are? • Continue work on fundamental properties of networks • Build models and understanding • Network community structure • Health of a social network • Steer the network evolution • Adoption of social networking services (with Facebook) • Predictive modeling of large communities • Online multi-player games are closed worlds with detailed traces of human activity • Predict success of guilds, battles, elections (with EVE online)

Future direction 3: Scaling up • Map-reduce for massive graphs • Algorithms and techniques for massive graphs • With Yahoo! Research: 4000 node Hadoop cluster • world’s 50th fastest supercomputer • 3 TB memory • 1.5 PB storage • We already have new ways to peek into the structure of massive networks

Dynamics of networks