Modeling and Algorithmic Challenges in Online Social Networks

1. Jun 23, 2009 CPM 1 Modeling and Algorithmic Challenges in Online Social Networks Ravi Kumar Yahoo! Research Sunnyvale, CA ravikumar@yahoo-inc.com

2. Jun 23, 2009 CPM 2 Online social networks Social network: Graph that represents relationships between people Social connections (Facebook, MySpace, LiveJournal, Orkut, LinkedIn, Twitter) Messaging (IM, chat, address book) Social bookmarking (Digg, Delicious) Content sharing (Flickr, YouTube)

3. Jun 23, 2009 CPM 3 A social network dear to you!

4. Jun 23, 2009 CPM 4 Analyzing social networks More and more of people�s interactions are happening online The resulting data is a goldmine for sociological studies Massive amounts, even hundreds of millions of people Ability to measure, record, and track individual activities at the finest resolution (eg, when did she befriend another, when did he buy a dvd, when did she tag a photo, when did he tweet) A double-edged sword Data is inherently noisy Cyber-world connections do not always represent real-world friends and vice versa Large scale of data leads to algorithmic challenges

5. Jun 23, 2009 CPM 5 Through the ages Social network analysis goes back to almost half a century of work in sociology and social psychology

6. Jun 23, 2009 CPM 6 Static analysis: Aggregate views

7. Jun 23, 2009 CPM 7 Dynamic analysis: Temporal views Social networks evolve with additions and deletions of nodes and edges Examine the dynamics of/in social networks How do social networks evolve in terms of (graph) properties? What are the individual processes behind the evolution and how can they be modeled? What are the dynamics of processes that take place in social networks, such as the spread of information, influence of one or more individual over the actions of others, etc?

8. Jun 23, 2009 CPM 8 Dynamic analysis via snapshots Evolution of web pages and content Fetterly, Manasse, Najork, Wiener 2004 Ntoulas, Cho, Olston 2004 Adar, Teevan, Dumais, Elsas 2009 Social network evolution Kossinetts, Watts 2006 Citation graphs, autonomous systems Katz 2005 Leskovec, Kleinberg, Faloutsos 2005 Densification and shrinking diameters Process dynamics Backstrom, Huttenlocher, Kleinberg, Lan 2006 Leskovec, Adamic, Huberman 2006

9. Jun 23, 2009 CPM 9 Outline of the talk

10. Jun 23, 2009 CPM 10 Flickr: Photosharing network

11. Jun 23, 2009 CPM 11 A photographer�s delight Link to slideshow � make point that quality content emergesLink to slideshow � make point that quality content emerges

12. Jun 23, 2009 CPM 12 Degree distribution: The usual

13. Jun 23, 2009 CPM 13 Density and diameter over time

14. Jun 23, 2009 CPM 14 Components: A grand canyon view

15. Jun 23, 2009 CPM 15 Dynamics of the components How do components consolidate? Singletons merge with non-GC and GC Non-GCs merge with GC Almost never a non-GC merges with another non-GC Why is singleton attracted to a non-GC? Is there a special attractor in a non-GC?

16. Jun 23, 2009 CPM 16 Structure of non-GC: Stars Informal definition of star There are one or more centers (high degree nodes) There are many degree-one nodes (twinkles) connected to these centers

17. Jun 23, 2009 CPM 17 Structure of GC: Core There is a small core of very high connectivity inside GC The core is not comprised of star centers GC connectivity does not depend on star centers

18. Jun 23, 2009 CPM 18 Informal model with user types At each time step A person joins the network and is chosen to be one of three types: passive user, inviter, linker Few friendships (ie, edges) arrive Source of edge chosen from inviters/linkers with degree-biased probabilities (ie, preferential attachment) If source = inviter, destination = a new passive user If source = linker, destination chosen from linkers and inviters, degree-biased Empirically, this model generates the observed temporal characteristics (fraction of components, stars, core) Open question. Can we say anything formal?

19. Jun 23, 2009 CPM 19 The Gnp model A random graph model with parameter p Connect each pair of nodes with probability p Erdos-Renyi model, studied almost half-century ago If p = O(c/n), then the graph is sparse Many properties are well understood Threshold phenomenon: eg, giant component, connectivity Global properties such as diameter, eigen-values, coloring Probability that a node has degree k is Poisson P(k) = exp(-?) ?k / k! Evolving version: A new node picks an existing node uniformly at random to add an edge

20. Jun 23, 2009 CPM 20 Preferential attachment model Barabasi, Albert, Jeong 1999 Observation: Rich-get-richer Popular papers get more citations Popular people get more friends Each step has one new incoming node along with an edge Probability this new node links to a pre-existing node is proportional to how popular is the latter, ie, its degree Pr[new node links to node i] = di / ? dj Theorem. In this model, b = 3 Intuitive proof. ?di / ?t = di / (2t) If node i was added at time ti, then di(t) = (t/ti)0.5 Pr[di(t) < k] = Pr[ti > t/k2] = 1/k2

21. Jun 23, 2009 CPM 21 Other models Copying model Kumar et al 2000 Observation: People copy their friend�s webpage when creating a new one or copy their friend�s contacts when joining a social network When a new node arrives, it copies edges from a pre-existing node with probability 1 - ? The degree distribution is a power-law with exponent (2 - ?)/(1 - ?) Can explain communities: The number of dense bipartite cliques in this model is large Forest-fire model Leskovec, Kleinberg, Faloutsos 2005 An iterated version of the copying model In addition to the above, leads to densification and shrinking diameters Affiliation networks Lattanzi Sivakumar 2009

22. Jun 23, 2009 CPM 22 A microscopic look at evolution We are starting to have more nuanced models Processes such as preferential attachment are assumed without actually being observed There are many ways to generate power-law degrees What is the best way to compare two graph models? Maximum-likelihood (standard tool in ML) Efficiency issues: Bezakova, Kalai, Santhanam 2006 We have edge-by-edge arrival information, so can take an edge and compare the likelihood of its existence in competing models

23. Jun 23, 2009 CPM 23 How does the network evolve? Three processes govern the evolution Node arrival process. Nodes enter the network Edge initiation process. Each node decides when to initiate an edge Edge destination process. Determines destination after a node decides to initiate We will present a complete model of network evolution by mining the node and edge creation data

24. Jun 23, 2009 CPM 24 Node arrivals: Rate

25. Jun 23, 2009 CPM 25 Node arrivals: Lifetime Lifetime a: time between node�s first and last edge

26. Jun 23, 2009 CPM 26 How are edges initiated? Edge gap ?(d): time between dth and d+1st edge

27. Jun 23, 2009 CPM 27 How do ? and ? change with degree?

28. Jun 23, 2009 CPM 28 What we know so far

29. Jun 23, 2009 CPM 29 Does preferential attachment happen? We unroll the true network edge arrivals and measure node degrees where edges attach

30. Jun 23, 2009 CPM 30 How local are the added edges? Just before edge (u,v) is placed, how far are u and v? Normalize this by number of nodes at that hop distance

31. Jun 23, 2009 CPM 31 New triangle-closing edge (u,w) appears next We model this as Choose u�s neighbor v Choose v�s neighbor w Add edge (u,w) 25 strategies for choosing v and then w Random, degree preferentially, number of common friends, time of last activity, combination Can compute likelihood of each strategy Under random-random: p(u,w) = 1/6*1/2+1/6*1/4 Closing triangles

32. Jun 23, 2009 CPM 32 Triangle closing strategies Log-likelihood improvement over the baseline

33. Jun 23, 2009 CPM 33 What we know so far

34. Jun 23, 2009 CPM 34 Semi-simulation: Closer to truth Take the network at T/2 and evolve it using preferential attachment (PA) and random-random (RR) for edge addition events

35. Jun 23, 2009 CPM 35 The complete network model Nodes arrive using the arrival rate Node u arrives It has lifetime a ? ? exp(-?a) Adds first edge to node v with probability proportional to degree of v A node u with degree d has gap ? ? ?-? exp(-?d ?) and goes to sleep for ? time steps When u wakes up, if its lifetime still valid, creates a random-random triangle-closing edge

36. Jun 23, 2009 CPM 36 An analysis Theorem. The out-degrees are distributed according to a power-law with exponent 1 + (? / ?) ? (?(2-?) / ?(1-?)) For Flickr, true exponent = 1.73, ? = 0.0092, ? = 0.84, ? = 0.002, calculated exponent = 1.74 Analogous results hold for del.icio.us, Yahoo! Answers, LinkedIn Interesting as temporal behavior leads to power-law degree distribution

37. Jun 23, 2009 CPM 37 Social influence and correlation How similar is the behavior of connected users? Why is it important? Analysis: Predicting the dynamics of a system Whether a new norm of behavior, technology, or idea can diffuse like an epidemic Detect disease outbreaks early Identifying influencers Design: For designing a system to induce a particular behavior Eg, vaccination strategies (random, targeting a demographic group, random acquaintances), viral marketing campaigns, whom to give free products to achieve maximum sales, where to place sensors to detect contamination early Word of mouth effect

38. Jun 23, 2009 CPM 38 Correlation: Health (obesity) Logistic regression, taking many attributes into account (eg, age, gender, education level, smoking cessation) Taking advantage of data that is available over time via the edge-reversal test Having an obese friend increases chance of obesity by 57%, obese sibling by 40%, obese spouse by 37%

39. Jun 23, 2009 CPM 39 Obesity study

40. Jun 23, 2009 CPM 40 Correlation: Happiness Happiness = answers to four questions People who are surrounded by happy people and who are central in the network are likely to become happy A friend who lives within a mile and who becomes happy increases the probability of becoming happy by 25% Co-resident spouses (8%) Siblings in a mile (14%) Next door neighbors (34%) Happiness affects one�s friends� friends� friend

41. Jun 23, 2009 CPM 41 Correlation: Community membership

42. Jun 23, 2009 CPM 42 Correlation: Tag overlap

43. Jun 23, 2009 CPM 43 Three sources of social correlation Influence: An individual performing an action can cause her contacts to do the same By providing information By increasing the value of the action to them Homophily: Similar individuals are more likely to be friends Two stringologists are more likely to know each other than two random people Environment: Influence from external elements Friends are more likely to live in the same area, thus attend and take pictures of similar events, and tag them with similar tag

44. Jun 23, 2009 CPM 44 Social influence Focus on a particular action A Eg, buying a product, joining a community, publishing in a conference, using a particular tag, getting premier membership, � A person who performs A is called active x has influence over y if x performing action A increases the likelihood that y also performs action A

45. Jun 23, 2009 CPM 45 Game-theoretic models Each individual modeled as a player in a game and the utility that a person derives depends on what her friends do Individuals decide whether to become active to maximize their utility Example: adoption of a common technology, eg, cell phone, IM, facebook Morris 2000, Immorlica et al 2007 Probabilistic models Independent cascade model: Every neighbor u of v who becomes active gets an independent chance to influence v with probability puv Linear threshold model: Each person has a threshold, becomes active if sum of weights of active friends exceeds threshold Kempe, Kleinberg, Tardos 2003 Ising-type models from physics Two schools of models

46. Jun 23, 2009 CPM 46 Probabilistic models Probabilistic models are more predictive Allows optimization (find the best seed set) Allows fitting the data to estimate parameters of the system Can also include the element of time

47. Jun 23, 2009 CPM 47 Influence model Graph (static or dynamic) Edge (u, v): node u has a chance to influence node v At any time period a number of individuals can become active For each t, every inactive node becomes active with probability p(a), where a is the number of its active contacts/friends Let W be the set of active nodes at the end

48. Jun 23, 2009 CPM 48 Influence model: Choosing p(a) Natural choice for p(a): logistic regression function with ln(a+1) as the explanatory variable, ie, Coefficient ? measures social correlation

49. Jun 23, 2009 CPM 49 Measuring social correlation Given time-stamped data, we compute the maximum likelihood estimate for parameters ?, ? Let Ya = number of pairs (user u, time t) where u is not active and has a active friends at the beginning of time step t, and becomes active in this step Let Na = � does not become active in this step Find ?, ? to maximize the likelihood function

50. Jun 23, 2009 CPM 50 Influence vs non-influence models Influence model First G is selected Set of active nodes (W) picked from a distribution depending on�G Non-influence models Homophily: First W is picked, then G is picked from a distribution that depends on W Environment: Both G and W are picked from distributions that depend on a separate external variable We consider this general non-influence (correlation) model (G,W) are selected from a joint distribution Each person in W picks an activation time i.i.d. from a distribution on [0,T]

51. Jun 23, 2009 CPM 51 The influence tests Given a sequence of time-stamped node activations Assume edge (u, v), where u, v become active in some order Intuition: If influence exists, u is expected to become active before v and if there is no influence, each is equally likely to become active first Even though an individual�s probability of activation can depend on friends, her timing of activation is independent Shuffle test: Randomly shuffle the time-stamp of all actions and re-estimate the coefficient ? Edge-reversal test: Reverse the direction of all edges, and re-estimate ? If re-estimated ? is different from original ?, then social influence cannot be ruled out

52. Jun 23, 2009 CPM 52 An analysis Theorem. If the graph is large enough, shuffle test rules out general models of correlation Proof. In the original graph we compute Ya, Na and the model parameters ?, ? via maximum-likelihood In the shuffled graph we compute Y�a, N�a and the model parameters ?�, ?� via maximum-likelihood By assumption of the model, E[Ya] = E[Ya�], E[Na] = E[Na�] 1. Show concentration: Ya � E[Ya] � Ya� (use martingales) 2. Show if Ya � Ya� and Na � Na� then ? � ?� Open question. A converse statement? Open question. A similar statement about edge-reversal test?

53. Jun 23, 2009 CPM 53 Simulations Run the tests on randomly generated action data on flickr network (action = using a tag) Baseline: No-correlation model, actions generated randomly at each time step Influence model: Each inactive node flips a coin to see if it can become active, for different (�,�) values Correlation model: Given L, pick random centers and add the union of balls of radius 2 around these centers to W, and repeat until |W| = L

54. Jun 23, 2009 CPM 54 Simulations: Shuffle test

55. Jun 23, 2009 CPM 55 Simulations: Edge-reversal test

56. Jun 23, 2009 CPM 56 Influence tests on tag usage

57. Jun 23, 2009 CPM 57 Other aspects: Compressibility The web graph can be stored using 2-3 bits per edge Boldi, Vigna 2004 How compressible are social networks? Chierichetti, Kumar, Lattanzi, Mitzenmacher, Panconesi, Raghavan 2009 May not be very compressible! A combinatorial formulation for graph compressibility Heuristics that work in practice A fresh avenue with lots of open questions

58. Jun 23, 2009 CPM 58 Concluding remarks: Future directions Understanding why networks look the way they do Pin down the underlying sociological forces Algorithmic exploitation of network characteristics Short paths, compression, I/O efficient storage Dynamics in networks We need a better understanding of influence A positive test for influence is desirable Recurring motifs/patterns in network evolution Distinguish real-world networks Different networks have different characteristics Can we predict the future of a network, based on its structure or dynamics? Can we engineer success?

59. Jun 23, 2009 CPM 59 Concluding remarks: The bigger picture Large amounts of social network data at the finest granularity are increasingly available to the research community To understand the underlying processes, we need fine-grained mathematical modeling Existing insights from sociology are crucial To analyze and mine this data, we need efficient methods Algorithmic thinking is indispensable

60. Jun 23, 2009 CPM 60 Thank you! ravikumar@yahoo-inc.com

Modeling and Algorithmic Challenges in Online Social Networks

Modeling and Algorithmic Challenges in Online Social Networks

Presentation Transcript

Online Social Networks

Online Social Networks:

Measurement Challenges in Online Social Networks

Online Social Networks and Media

Online Social Networks and Media

Privacy in Online Social Networks

Online Social Networks and Media

Online Social Networks and Media

Online Social Networks and Social Integration

Online Social Networks and Media

Online Social Networks and Media

Online Social Networks and Media

Online Social Networks and Media

Analysis and Modeling of Social Networks

Online Social Networks and More!

Online Social Networks

Algorithmic Performance in Complex Networks

Measurement Challenges in Online Social Networks