1 stanford university 2 mpi for biological cybernetics 3 california institute of technology
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3 PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology. Inferring Networks of Diffusion and Influence. Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3. Hidden and implicit networks.

Download Presentation

Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


1 stanford university 2 mpi for biological cybernetics 3 california institute of technology

1 Stanford University2 MPI for Biological Cybernetics3 California Institute of Technology

Inferring Networks of Diffusion and Influence

Manuel Gomez Rodriguez1,2Jure Leskovec1Andreas Krause3


Hidden and implicit networks

Hidden and implicit networks

Many social or information networks are implicit or hard to observe:

Hidden/hard-to-reach populations:

Network of needle sharing between drug injection users

Implicit connections:

Network of information propagation in online news media

But we can observe results of the processes taking place on such (invisible) networks:

Virus propagation:

Drug users get sick, and we observe when they see the doctor

Information networks:

We observe when media sites mention information

  • Question: Can we infer the hidden networks?


Inferring the network

Inferring the Network

  • There is adirected social network over which diffusions take place:

a

a

a

b

b

b

d

d

c

c

c

e

e

  • But we do not observe the edges of the network

  • We only see the time when a node gets infected:

    • Cascade c1: (a, 1), (c, 2), (b, 6), (e, 9)

    • Cascade c2: (c, 1), (a, 4), (b, 5), (d, 8)

  • Task: inferring the underlying network


Examples and applications

Examples and Applications

Virus propagation

Word of mouth & Viral marketing

Viruses propagate through the network

We only observe when people get sick

But NOT who infected whom

Recommendations and influence propagate

We only observe when people buy products

But NOT who influenced whom

Process

We observe

It’s hidden

Can we infer the underlying network?


Our problem formulation

Our Problem Formulation

Plan for the talk:

Define a continuous time model of diffusion

Define the likelihood of the observed cascades given a network

Show how to efficiently compute the likelihood of cascades

Show how to efficiently find a graph G that maximizes the likelihood

Note:

There is a super-exponential number of graphs, O(NN*N)

Our method finds a near-optimal graph in O(N2)!


Cascade diffusion model

Cascade Diffusion Model

Continuous time cascade diffusion model:

Cascade c reaches node u at tu

and spreads to u’s neighbors:

With probability β cascade propagates along edge (u, v)

and we determine the infection time of node v

tv = tu + Δ

e.g.: Δ~ Exponential or Power-law

tu

tb

tc

Δ1

Δ2

a

u

u

b

b

b

c

c

c

d

We assume each node v has only one parent!


Likelihood of a single cascade

Likelihood of a Single Cascade

  • Probability that cascade c propagates from node u to node v is:

    Pc(u, v)  P(tv - tu)with tv > tu

a

a

b

b

  • Since not all nodes get infected by the diffusion process, we introduce the external influence node m:Pc(m, v) =ε

d

c

c

m

  • Prob. that cascade c propagates in a tree pattern T:

ε

e

e

ε

ε

Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)


Finding the diffusion network

Finding the Diffusion Network

There are many possible propagation trees that are consistent with the observed data:

c: (a, 1), (c, 2), (b, 3), (e, 4)

Good news

Computing P(c|G) is tractable:

Bad news

We actually want to search over graphs:

There is a super-exponential number of graphs!

a

a

a

a

a

a

b

b

b

b

b

b

Even though there are O(nn) possible propagation trees.

Matrix Tree Theorem can compute this in O(n3)!

d

d

d

c

c

c

c

c

c

e

e

e

e

e

e

  • Need to consider all possible propagation trees T supported by the graph G:

  • Likelihood of a set of cascades C:

  • Want to find a graph:


An alternative formulation

An Alternative Formulation

We consider only the most likely tree

Maximum log-likelihood for a cascade c under a graph G:

Log-likelihood of G given a set of cascades C:

  • The problem is still intractable (NP-hard)

  • But we present an algorithm that finds near-optimal networks in O(N2)


Max directed spanning tree

Max Directed Spanning Tree

Given a cascade c,

What is the most likely propagation tree?

where

  • A maximum directed spanning tree (MDST):

    • The sub-graph of G induced by the nodes in the cascade c is a DAG

      • Because edges point forward in time

    • For each node, just picks an in-edge ofmax-weight:

Greedy parent selection of each node gives globally optimal tree!


Objective function is submodular

Objective function is Submodular

Theorem:

Log-likelihood Fc(G) of cascade c is monotonic, and submodular in the edges of the graph G

Fc(A  {e}) – Fc (A) ≥ Fc (B  {e}) – Fc (B)

Gain of adding an edge to a “small” graph

Gain of adding an edge to a “large“ graph

A B  VxV

  • Proof:

  • We use greedy parent selectionand submodularity definition

  • Log-likelihoodFC(G) is a sum of submodular functions, then it is submodular too


Finding the diffusion graph

Finding the Diffusion Graph

Use the greedy hill-climbing to maximize FC(G):

For i=1…k:

At every step, pick the edge that maximizes the marginal improvement

Benefits:

1. Approximation guarantee (≈ 0.63 of OPT)

2. Tight on-line bounds on the solution quality

3. Speed-ups:

Lazy evaluation (by submodularity)

Localized update (by the structure of the problem)

Marginal gains

a

b

a

b

: 20

: 17

c

b

: 18

: 1

: 2

d

b

: 4

d

: 1

: 3

e

b

: 5

c

a

c

: 15

b

c

: 6

: 8

b

d

: 16

e

c

d

: 7

: 8

e

d

: 8

: 10

b

e

: 7

d

e

: 13


Experimental setup

Experimental Setup

We validate our method on:

Synthetic data

Generate a graph G on k edges

Generate cascades

Record node infection times

Reconstruct G

Real data

MemeTracker: 172m news articles

Aug ’08 – Sept ‘09

343m textual phrases (quotes)

  • How many edges of G can we find?

    • Precision-Recall

    • Break-even point

  • How well do we optimize the likelihood Fc(G)?

  • How fast is the algorithm?

  • How many cascades do we need?


Manuel gomez rodriguez 1 2 jure leskovec 1 andreas krause 3

Small Synthetic Example

True network

Baseline network

Our method

Pick kstrongest edges:

  • Small synthetic network:

14


Synthetic networks

Synthetic Networks

1024 node hierarchical Kronecker exponential transmission model

1000 node Forest Fire (α = 1.1) power law transmission model

  • Performance does not depend on the network structure:

    • Synthetic Networks: Forest Fire, Kronecker, etc.

    • Transmission time distribution: Exponential, Power Law

  • Break-even point of > 90%


How good is our graph

How good is our graph?

We achieve ≈ 90 % of the best possible network!


How many cascades do we need

How many cascades do we need?

With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!


Running time

Running Time

Lazy evaluation and localized updates speed up 2 orders of magnitude!

Can infer a networks of 10k nodes in several hours


Real data information diffusion

Real Data: Information diffusion

MemeTracker dataset:

172m news articles from Aug ’08 to Sept ‘09

343m textual phrases (quotes)

  • Want to infer the network of information diffusion

  • We use the hyperlinks between sites to generate the edges of a ground truth G

  • From the MemeTracker dataset, we have the timestamps of:

    1. cascades of hyperlinks:

    time when a site creates a link

    2.cascades of (MemeTracker) textual phrases:

    time when site mentions the information

e

a

c

f

e

a

c

f


Real network performance

Real Network: Performance

500 node hyperlink network using hyperlinks cascades

500 node hyperlink network using MemeTracker cascades

  • Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!


Manuel gomez rodriguez 1 2 jure leskovec 1 andreas krause 3

Diffusion Network

Blogs

Mainstream media

  • 5,000 news sites:


Manuel gomez rodriguez 1 2 jure leskovec 1 andreas krause 3

Diffusion Network (small part)

Blogs

Mainstream media


Networks and processes

Networks and Processes

We infer hidden networks based on diffusion data (timestamps)

Problem formulation in a maximum likelihood framework

NP-hard problem to solve exactly

We develop an approximation algorithm that:

It is efficient -> It runs in O(N2)

It is invariant to the structure of the underlying network

It gives a sub-optimal network with tight bound

Future work:

Learn both the network and the diffusion model

Applications to other domains: biology, neuroscience, etc.


Thanks

Thanks!

For more (Code & Data):http://snap.stanford.edu/netinf


  • Login