E N D
1. Jun 23, 2009 CPM 1 Modeling and Algorithmic Challenges in Online Social Networks Ravi Kumar
Yahoo! Research
Sunnyvale, CA
ravikumar@yahoo-inc.com
2. Jun 23, 2009 CPM 2 Online social networks Social network: Graph that represents relationships between people
Social connections (Facebook, MySpace, LiveJournal, Orkut, LinkedIn, Twitter)
Messaging (IM, chat, address book)
Social bookmarking (Digg, Delicious)
Content sharing (Flickr, YouTube)
3. Jun 23, 2009 CPM 3 A social network dear to you!
4. Jun 23, 2009 CPM 4 Analyzing social networks More and more of people’s interactions are happening online
The resulting data is a goldmine for sociological studies
Massive amounts, even hundreds of millions of people
Ability to measure, record, and track individual activities at the finest resolution (eg, when did she befriend another, when did he buy a dvd, when did she tag a photo, when did he tweet)
A double-edged sword
Data is inherently noisy
Cyber-world connections do not always represent real-world friends and vice versa
Large scale of data leads to algorithmic challenges
5. Jun 23, 2009 CPM 5 Through the ages Social network analysis goes back to almost half a century of work in sociology and social psychology
6. Jun 23, 2009 CPM 6 Static analysis: Aggregate views
7. Jun 23, 2009 CPM 7 Dynamic analysis: Temporal views Social networks evolve with additions and deletions of nodes and edges
Examine the dynamics of/in social networks
How do social networks evolve in terms of (graph) properties?
What are the individual processes behind the evolution and how can they be modeled?
What are the dynamics of processes that take place in social networks, such as the spread of information, influence of one or more individual over the actions of others, etc?
8. Jun 23, 2009 CPM 8 Dynamic analysis via snapshots Evolution of web pages and content
Fetterly, Manasse, Najork, Wiener 2004
Ntoulas, Cho, Olston 2004
Adar, Teevan, Dumais, Elsas 2009
Social network evolution
Kossinetts, Watts 2006
Citation graphs, autonomous systems
Katz 2005
Leskovec, Kleinberg, Faloutsos 2005
Densification and shrinking diameters
Process dynamics
Backstrom, Huttenlocher, Kleinberg, Lan 2006
Leskovec, Adamic, Huberman 2006
9. Jun 23, 2009 CPM 9 Outline of the talk
10. Jun 23, 2009 CPM 10 Flickr: Photosharing network
11. Jun 23, 2009 CPM 11 A photographer’s delight Link to slideshow – make point that quality content emergesLink to slideshow – make point that quality content emerges
12. Jun 23, 2009 CPM 12 Degree distribution: The usual
13. Jun 23, 2009 CPM 13 Density and diameter over time
14. Jun 23, 2009 CPM 14 Components: A grand canyon view
15. Jun 23, 2009 CPM 15 Dynamics of the components How do components consolidate?
Singletons merge with non-GC and GC
Non-GCs merge with GC
Almost never a non-GC merges with another non-GC
Why is singleton attracted to a non-GC?
Is there a special attractor in a non-GC?
16. Jun 23, 2009 CPM 16 Structure of non-GC: Stars Informal definition of star
There are one or more centers (high degree nodes)
There are many degree-one nodes (twinkles) connected to these centers
17. Jun 23, 2009 CPM 17 Structure of GC: Core There is a small core of very high connectivity inside GC
The core is not comprised of star centers
GC connectivity does not depend on star centers
18. Jun 23, 2009 CPM 18 Informal model with user types At each time step
A person joins the network and is chosen to be one of three types: passive user, inviter, linker
Few friendships (ie, edges) arrive
Source of edge chosen from inviters/linkers with degree-biased probabilities (ie, preferential attachment)
If source = inviter, destination = a new passive user
If source = linker, destination chosen from linkers and inviters, degree-biased
Empirically, this model generates the observed temporal characteristics (fraction of components, stars, core)
Open question. Can we say anything formal?
19. Jun 23, 2009 CPM 19 The Gnp model A random graph model with parameter p
Connect each pair of nodes with probability p
Erdos-Renyi model, studied almost half-century ago
If p = O(c/n), then the graph is sparse
Many properties are well understood
Threshold phenomenon: eg, giant component, connectivity
Global properties such as diameter, eigen-values, coloring
Probability that a node has degree k is Poisson
P(k) = exp(-?) ?k / k!
Evolving version: A new node picks an existing node uniformly at random to add an edge
20. Jun 23, 2009 CPM 20 Preferential attachment model Barabasi, Albert, Jeong 1999
Observation: Rich-get-richer
Popular papers get more citations
Popular people get more friends
Each step has one new incoming node along with an edge
Probability this new node links to a pre-existing node is proportional to how popular is the latter, ie, its degree
Pr[new node links to node i] = di / ? dj
Theorem. In this model, b = 3
Intuitive proof. ?di / ?t = di / (2t)
If node i was added at time ti, then di(t) = (t/ti)0.5
Pr[di(t) < k] = Pr[ti > t/k2] = 1/k2
21. Jun 23, 2009 CPM 21 Other models Copying model Kumar et al 2000
Observation: People copy their friend’s webpage when creating a new one or copy their friend’s contacts when joining a social network
When a new node arrives, it copies edges from a pre-existing node with probability 1 - ?
The degree distribution is a power-law with exponent
(2 - ?)/(1 - ?)
Can explain communities: The number of dense bipartite cliques in this model is large
Forest-fire model Leskovec, Kleinberg, Faloutsos 2005
An iterated version of the copying model
In addition to the above, leads to densification and shrinking diameters
Affiliation networks Lattanzi Sivakumar 2009
22. Jun 23, 2009 CPM 22 A microscopic look at evolution We are starting to have more nuanced models
Processes such as preferential attachment are assumed without actually being observed
There are many ways to generate power-law degrees
What is the best way to compare two graph models?
Maximum-likelihood (standard tool in ML)
Efficiency issues: Bezakova, Kalai, Santhanam 2006
We have edge-by-edge arrival information, so can take an edge and compare the likelihood of its existence in competing models
23. Jun 23, 2009 CPM 23 How does the network evolve? Three processes govern the evolution
Node arrival process. Nodes enter the network
Edge initiation process. Each node decides when to initiate an edge
Edge destination process. Determines destination after a node decides to initiate
We will present a complete model of network evolution by mining the node and edge creation data
24. Jun 23, 2009 CPM 24 Node arrivals: Rate
25. Jun 23, 2009 CPM 25 Node arrivals: Lifetime Lifetime a: time between node’s first and last edge
26. Jun 23, 2009 CPM 26 How are edges initiated? Edge gap ?(d): time between dth and d+1st edge
27. Jun 23, 2009 CPM 27 How do ? and ? change with degree?
28. Jun 23, 2009 CPM 28 What we know so far
29. Jun 23, 2009 CPM 29 Does preferential attachment happen? We unroll the true network edge arrivals and measure node degrees where edges attach
30. Jun 23, 2009 CPM 30 How local are the added edges? Just before edge (u,v) is placed, how far are u and v?
Normalize this by number of nodes at that hop distance
31. Jun 23, 2009 CPM 31 New triangle-closing edge (u,w) appears next
We model this as
Choose u’s neighbor v
Choose v’s neighbor w
Add edge (u,w)
25 strategies for choosing v and then w
Random, degree preferentially, number of common friends, time of last activity, combination
Can compute likelihood of each strategy
Under random-random: p(u,w) = 1/6*1/2+1/6*1/4 Closing triangles
32. Jun 23, 2009 CPM 32 Triangle closing strategies Log-likelihood improvement over the baseline
33. Jun 23, 2009 CPM 33 What we know so far
34. Jun 23, 2009 CPM 34 Semi-simulation: Closer to truth Take the network at T/2 and evolve it using preferential attachment (PA) and random-random (RR) for edge addition events
35. Jun 23, 2009 CPM 35 The complete network model Nodes arrive using the arrival rate
Node u arrives
It has lifetime a ? ? exp(-?a)
Adds first edge to node v with probability proportional to degree of v
A node u with degree d has gap ? ? ?-? exp(-?d ?) and goes to sleep for ? time steps
When u wakes up, if its lifetime still valid, creates a random-random triangle-closing edge
36. Jun 23, 2009 CPM 36 An analysis Theorem. The out-degrees are distributed according to a power-law with exponent
1 + (? / ?) ? (?(2-?) / ?(1-?))
For Flickr, true exponent = 1.73, ? = 0.0092, ? = 0.84, ? = 0.002, calculated exponent = 1.74
Analogous results hold for del.icio.us, Yahoo! Answers, LinkedIn
Interesting as temporal behavior leads to power-law degree distribution
37. Jun 23, 2009 CPM 37 Social influence and correlation How similar is the behavior of connected users?
Why is it important?
Analysis: Predicting the dynamics of a system
Whether a new norm of behavior, technology, or idea can diffuse like an epidemic
Detect disease outbreaks early
Identifying influencers
Design: For designing a system to induce a particular behavior
Eg, vaccination strategies (random, targeting a demographic group, random acquaintances), viral marketing campaigns, whom to give free products to achieve maximum sales, where to place sensors to detect contamination early
Word of mouth effect
38. Jun 23, 2009 CPM 38 Correlation: Health (obesity) Logistic regression, taking many attributes into account (eg, age, gender, education level, smoking cessation)
Taking advantage of data that is available over time via the edge-reversal test
Having an obese friend increases chance of obesity by 57%, obese sibling by 40%, obese spouse by 37%
39. Jun 23, 2009 CPM 39 Obesity study
40. Jun 23, 2009 CPM 40 Correlation: Happiness Happiness = answers to four questions
People who are surrounded by happy people and who are central in the network are likely to become happy
A friend who lives within a mile and who becomes happy increases the probability of becoming happy by 25%
Co-resident spouses (8%)
Siblings in a mile (14%)
Next door neighbors (34%)
Happiness affects one’s friends’ friends’ friend
41. Jun 23, 2009 CPM 41 Correlation: Community membership
42. Jun 23, 2009 CPM 42 Correlation: Tag overlap
43. Jun 23, 2009 CPM 43 Three sources of social correlation Influence: An individual performing an action can cause her contacts to do the same
By providing information
By increasing the value of the action to them
Homophily: Similar individuals are more likely to be friends
Two stringologists are more likely to know each other than two random people
Environment: Influence from external elements
Friends are more likely to live in the same area, thus attend and take pictures of similar events, and tag them with similar tag
44. Jun 23, 2009 CPM 44 Social influence Focus on a particular action A
Eg, buying a product, joining a community, publishing in a conference, using a particular tag, getting premier membership, …
A person who performs A is called active
x has influence over y if x performing action A increases the likelihood that y also performs action A
45. Jun 23, 2009 CPM 45 Game-theoretic models
Each individual modeled as a player in a game and the utility that a person derives depends on what her friends do
Individuals decide whether to become active to maximize their utility
Example: adoption of a common technology, eg, cell phone, IM, facebook
Morris 2000, Immorlica et al 2007
Probabilistic models
Independent cascade model: Every neighbor u of v who becomes active gets an independent chance to influence v with probability puv
Linear threshold model: Each person has a threshold, becomes active if sum of weights of active friends exceeds threshold
Kempe, Kleinberg, Tardos 2003
Ising-type models from physics Two schools of models
46. Jun 23, 2009 CPM 46 Probabilistic models Probabilistic models are more predictive
Allows optimization (find the best seed set)
Allows fitting the data to estimate parameters of the system
Can also include the element of time
47. Jun 23, 2009 CPM 47 Influence model Graph (static or dynamic)
Edge (u, v): node u has a chance to influence node v
At any time period a number of individuals can become active
For each t, every inactive node becomes active with probability p(a), where a is the number of its active contacts/friends
Let W be the set of active nodes at the end
48. Jun 23, 2009 CPM 48 Influence model: Choosing p(a) Natural choice for p(a): logistic regression function
with ln(a+1) as the explanatory variable, ie,
Coefficient ? measures
social correlation
49. Jun 23, 2009 CPM 49 Measuring social correlation Given time-stamped data, we compute the maximum likelihood estimate for parameters ?, ?
Let Ya = number of pairs (user u, time t) where u is not active and has a active friends at the beginning of time step t, and becomes active in this step
Let Na = … does not become active in this step
Find ?, ? to maximize the likelihood function
50. Jun 23, 2009 CPM 50 Influence vs non-influence models Influence model
First G is selected
Set of active nodes (W) picked from a distribution depending on G
Non-influence models
Homophily: First W is picked, then G is picked from a distribution that depends on W
Environment: Both G and W are picked from distributions that depend on a separate external variable
We consider this general non-influence (correlation) model
(G,W) are selected from a joint distribution
Each person in W picks an activation time i.i.d. from a distribution on [0,T]
51. Jun 23, 2009 CPM 51 The influence tests Given a sequence of time-stamped node activations
Assume edge (u, v), where u, v become active in some order
Intuition: If influence exists, u is expected to become active before v and if there is no influence, each is equally likely to become active first
Even though an individual’s probability of activation can depend on friends, her timing of activation is independent
Shuffle test: Randomly shuffle the time-stamp of all actions and re-estimate the coefficient ?
Edge-reversal test: Reverse the direction of all edges, and re-estimate ?
If re-estimated ? is different from original ?, then social influence cannot be ruled out
52. Jun 23, 2009 CPM 52 An analysis Theorem. If the graph is large enough, shuffle test rules out general models of correlation
Proof. In the original graph we compute Ya, Na and the model parameters ?, ? via maximum-likelihood
In the shuffled graph we compute Y’a, N’a and the model parameters ?’, ?’ via maximum-likelihood
By assumption of the model, E[Ya] = E[Ya’], E[Na] = E[Na’]
1. Show concentration: Ya ˜ E[Ya] ˜ Ya’ (use martingales)
2. Show if Ya ˜ Ya’ and Na ˜ Na’ then ? ˜ ?’
Open question. A converse statement?
Open question. A similar statement about edge-reversal test?
53. Jun 23, 2009 CPM 53 Simulations Run the tests on randomly generated action data on flickr network (action = using a tag)
Baseline: No-correlation model, actions generated randomly at each time step
Influence model: Each inactive node flips a coin to see if it can become active, for different (®,¯) values
Correlation model: Given L, pick random centers and add the union of balls of radius 2 around these centers to W, and repeat until |W| = L
54. Jun 23, 2009 CPM 54 Simulations: Shuffle test
55. Jun 23, 2009 CPM 55 Simulations: Edge-reversal test
56. Jun 23, 2009 CPM 56 Influence tests on tag usage
57. Jun 23, 2009 CPM 57 Other aspects: Compressibility The web graph can be stored using 2-3 bits per edge Boldi, Vigna 2004
How compressible are social networks?
Chierichetti, Kumar, Lattanzi, Mitzenmacher, Panconesi, Raghavan 2009
May not be very compressible!
A combinatorial formulation for graph compressibility
Heuristics that work in practice
A fresh avenue with lots of open questions
58. Jun 23, 2009 CPM 58 Concluding remarks: Future directions Understanding why networks look the way they do
Pin down the underlying sociological forces
Algorithmic exploitation of network characteristics
Short paths, compression, I/O efficient storage
Dynamics in networks
We need a better understanding of influence
A positive test for influence is desirable
Recurring motifs/patterns in network evolution
Distinguish real-world networks
Different networks have different characteristics
Can we predict the future of a network, based on its structure or dynamics?
Can we engineer success?
59. Jun 23, 2009 CPM 59 Concluding remarks: The bigger picture Large amounts of social network data at the finest granularity are increasingly available to the research community
To understand the underlying processes, we need fine-grained mathematical modeling
Existing insights from sociology are crucial
To analyze and mine this data, we need efficient methods
Algorithmic thinking is indispensable
60. Jun 23, 2009 CPM 60 Thank you! ravikumar@yahoo-inc.com