1 stanford university 2 mpi for biological cybernetics 3 california institute of technology
Download
1 / 26

Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3 - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology. Inferring Networks of Diffusion and Influence. Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3. Networks and Processes.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3' - suzuki


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
1 stanford university 2 mpi for biological cybernetics 3 california institute of technology
1 Stanford University2 MPI for Biological Cybernetics3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez Rodriguez1,2Jure Leskovec1Andreas Krause3


Networks and processes
Networks and Processes

Many times hard to directly observe a social or information network:

Hidden/hard-to-reach populations:

Drug injection users

Implicit connections:

Network of information sharing in online media

Often easier to observe results of the processes taking place on such (invisible) networks:

Virus propagation:

People get sick, they see the doctor

Information networks:

Blogs mention information


Information diffusion network
Information Diffusion Network

Information diffuses through the network

  • We only see who mentions but not where they got the information from

  • Can we reconstruct the (hidden) diffusion network?


More examples
More Examples

Virus propagation

Word of mouth & Viral marketing

Viruses propagate through the network

We only observe when people get sick

But NOT who infected them

  • Recommendations and influence propagate

  • We only observe when people buy products

  • But NOT who influenced them

Process

We observe

It’s hidden

Can we infer the underlying social or information network?


Inferring the network
Inferring the Network

  • There is ahiddendirected network:

  • We only see times when nodes get infected:

    • Contagion c1: (a, 1), (c, 2), (b, 3), (e, 4)

    • Contagion c2: (c, 1), (a, 4), (b, 5), (d, 6)

  • Want to infer the who-infects-whom network

a

a

a

b

b

b

d

d

c

c

c

e

e


Our problem formulation
Our Problem Formulation

Plan for the talk:

Define a continuous time model of diffusion

Define the likelihood of the observed propagation data given a graph

Show how to efficiently compute the likelihood

Show how to efficiently optimize the likelihood

Find a graph G that maximizes the likelihood

There is a super-exponential number of graphs G

Our method finds a near-optimal solution in O(N2)!


Cascade generation model

Cascade generation model:

Cascade reaches u at tu, and spreads to u’s neighbors v:

with probability β cascade propagates along (u, v) and tv = tu + Δ, with Δ~ f()

Cascade Generation Model

ta

tb

tc

te

tf

Δ1

Δ2

Δ3

Δ4

a

a

a

b

b

b

c

c

c

d

We assume each node v has only one parent!

e

e

f

f


Likelihood of a cascade
Likelihood of a Cascade

  • If u infected v in a cascade c, its transmission probability is:

  • Pc(u, v)  f(tv - tu)with tv > tuand (u, v) are neighbors

a

a

b

b

  • To model that in reality any node v in a cascade can have been infected by an external influence m:Pc(m, j) =ε

d

c

c

m

  • Prob. that cascade c propagates in a tree T:

ε

e

e

ε

ε

Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)


Finding the diffusion network
Finding the Diffusion Network

There are many possible propagation trees:

c: (a, 1), (c, 2), (b, 3), (e, 4)

Need to consider all possible propagation trees T supported by G:

Good news

Computing P(c|G) is tractable:

  • Bad news

    • We actually want to find:

    • We have a super-exponential number of graphs!

a

a

a

a

a

a

b

b

b

b

b

b

For each c, consider all O(nn) possible transmission trees of G.

Matrix Tree Theorem can compute this sum in O(n3)!

d

d

d

c

c

c

c

c

c

e

e

e

e

e

e

  • Likelihood of a set of cascades C on G:

  • Want to find:


An alternative formulation
An Alternative Formulation

We consider only the most likely tree

Maximum log-likelihood for a cascade c under a graph G:

Log-likelihood of G given a set of cascades C:

The problem is NP-hard: MAX-k-COVER

Our algorithm can do it near-optimally in O(N2)


Max directed spanning tree
Max Directed Spanning Tree

Given a cascade c,

What is the most likely propagation tree?

where

  • A maximum directed spanning tree (MDST):

    • Just need to compute the MDST of a the sub-graph of G induced by c (i.e., a DAG)

    • For each node, just picks an in-edge ofmax-weight:

Subgraph of G induced by c doesn’t have loops (DAG)

a

a

3

5

b

b

2

c

c

4

6

1

d

d

Local greedy selection gives optimal tree!


Great news submodularity
Great News: Submodularity

Theorem:

Log-likelihood Fc(G) is monotonic, and submodular in the edges of the graph G

Fc(A  {e}) – Fc (A) ≥ Fc (B  {e}) – Fc (B)

Gain of adding an edge to a “small” graph

Gain of adding an edge to a “large“ graph

A B  VxV

  • Proof:

  • Single cascade c, edge e with weight x

A

r

i

i

w

x

  • Let w be max weight in-edge of s in A

s

k

k

  • Let w’ be max weight in-edge of s in B

B

  • We know: w ≤ w’

j

w’

  • Now: Fc(A  {e}) – Fc(A) = max (w, x) – w

    ≥ max (w’, x) – w’ = Fc(B {e}) – Fc(B)

o

a

Then, log-likelihood FC(G) is monotonic, and submodular too


Finding the diffusion graph
Finding the Diffusion Graph

Use the greedy hill-climbing to maximize FC(G):

At every step, pick the edge that maximizes the marginal improvement

Benefits

  • 1. Approximation guarantee (≈ 0.63 of OPT)

  • 2. Tight on-line bounds on the solution quality

  • 3. Speed-ups:

  • Lazy evaluation (by submodularity)

  • Localized update (by the structure of the problem)

Marginal gains

b

a

: 12

a

c

a

: 3

b

d

a

: 6

a

b

: 20

: 17

c

b

: 18

Localized update

d

Localized update

: 2

: 1

d

b

: 4

c

: 1

: 3

e

b

: 5

a

c

: 15

Localized update

b

c

: 6

: 8

e

b

d

: 16

Localized update

c

d

: 7

: 8

e

d

: 8

: 10

b

e

: 7

d

e

: 13


Experimental setup
Experimental Setup

We validate our method on:

Synthetic data

Generate a graph G on k edges

Generate cascades

Record node infection times

Reconstruct G

Real data

MemeTracker: 172m news articles

Aug ’08 – Sept ‘09

343m textual phrases (quotes)

  • How well do we optimize Fc(G)?

  • How many edges of G can we find?

    • Precision-Recall

    • Break-even point

  • How fast are we?

  • How many cascades do we need?


Small Synthetic Example

True network

Baseline network

Our method

  • Pick kstrongest edges:

  • Small synthetic network:

15


Synthetic networks
Synthetic Networks

1024 node hierarchical Kronecker exponential transmission model

1000 node Forest Fire (α = 1.1) power law transmission model

  • Our performance does not depend on the network structure:

    • Synthetic Networks: Forest Fire, Kronecker, etc.

    • Prob. of transmission: Exponential, Power Law

  • Break-even points of > 90% when the baseline gets 30-50%!


How good is our graph
How good is our graph?

We achieve ≈ 90 % of the best possible network!


How many cascades do we need
How many cascades do we need?

With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!


Running time
Running Time

Lazy evaluation and localized updates speed up 2 orders of magnitude!


Real data
Real Data

MemeTracker dataset:

172m news articles

Aug ’08 – Sept ‘09

343m textual phrases (quotes)

Times tc(w) when site w mentions phrase (quote) c

http://memetracker.org

  • Given times when sites mention phrases

  • We infer the network of information diffusion:

    • Who tends to copy (repeat after) whom


Real network
Real Network

  • We use the hyperlinks in the MemeTracker dataset to generate the edges of a ground truth G

  • From the MemeTracker dataset, we have the timestamps of:

  • 1. cascades of hyperlinks:

  • sites link other sites

  • 2.cascades of (MemeTracker) quotes

  • sites copy quotes from other sites

e

a

c

f

Are they correlated?

e

a

c

f

Can we infer the hyperlinks network from…

…cascades of hyperlinks?

…cascades of MemeTracker quotes?


Real network1
Real Network

500 node hyperlink network using hyperlinks cascades

500 node hyperlink network using MemeTracker cascades

  • Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!


Diffusion Network

Blogs

Mainstream media

  • 5,000 news sites:


Diffusion Network (small part)

Blogs

Mainstream media


Networks and processes1
Networks and Processes

We infer hidden networks based on diffusion data (timestamps)

Problem formulation in a maximum likelihood framework

NP-hard problem to solve exactly

We develop an approximation algorithm that:

It is efficient -> It runs in O(N2)

It is invariant to the structure of the underlying network

It gives a sub-optimal network with tight bound

Future work:

Learn both the network and the diffusion model

Extensions to other processes taking place on networks

Applications to other domains: biology, neuroscience, etc.


Thanks
Thanks!

For more (Code & Data):http://snap.stanford.edu/netinf


ad