1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology. Inferring Networks of Diffusion and Influence. Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3. Hidden and implicit networks.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Many social or information networks are implicit or hard to observe:
Hidden/hard-to-reach populations:
Network of needle sharing between drug injection users
Implicit connections:
Network of information propagation in online news media
But we can observe results of the processes taking place on such (invisible) networks:
Virus propagation:
Drug users get sick, and we observe when they see the doctor
Information networks:
We observe when media sites mention information
a
a
a
b
b
b
d
d
c
c
c
e
e
Virus propagation
Word of mouth & Viral marketing
Viruses propagate through the network
We only observe when people get sick
But NOT who infected whom
Recommendations and influence propagate
We only observe when people buy products
But NOT who influenced whom
Process
We observe
It’s hidden
Can we infer the underlying network?
Plan for the talk:
Define a continuous time model of diffusion
Define the likelihood of the observed cascades given a network
Show how to efficiently compute the likelihood of cascades
Show how to efficiently find a graph G that maximizes the likelihood
Note:
There is a super-exponential number of graphs, O(NN*N)
Our method finds a near-optimal graph in O(N2)!
Continuous time cascade diffusion model:
Cascade c reaches node u at tu
and spreads to u’s neighbors:
With probability β cascade propagates along edge (u, v)
and we determine the infection time of node v
tv = tu + Δ
e.g.: Δ~ Exponential or Power-law
tu
tb
tc
Δ1
Δ2
a
u
u
b
b
b
c
c
c
d
We assume each node v has only one parent!
Pc(u, v) P(tv - tu)with tv > tu
a
a
b
b
d
c
c
m
ε
e
e
ε
ε
Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)
There are many possible propagation trees that are consistent with the observed data:
c: (a, 1), (c, 2), (b, 3), (e, 4)
Good news
Computing P(c|G) is tractable:
Bad news
We actually want to search over graphs:
There is a super-exponential number of graphs!
a
a
a
a
a
a
b
b
b
b
b
b
Even though there are O(nn) possible propagation trees.
Matrix Tree Theorem can compute this in O(n3)!
d
d
d
c
c
c
c
c
c
e
e
e
e
e
e
We consider only the most likely tree
Maximum log-likelihood for a cascade c under a graph G:
Log-likelihood of G given a set of cascades C:
Given a cascade c,
What is the most likely propagation tree?
where
Greedy parent selection of each node gives globally optimal tree!
Theorem:
Log-likelihood Fc(G) of cascade c is monotonic, and submodular in the edges of the graph G
Fc(A {e}) – Fc (A) ≥ Fc (B {e}) – Fc (B)
Gain of adding an edge to a “small” graph
Gain of adding an edge to a “large“ graph
A B VxV
Use the greedy hill-climbing to maximize FC(G):
For i=1…k:
At every step, pick the edge that maximizes the marginal improvement
Benefits:
1. Approximation guarantee (≈ 0.63 of OPT)
2. Tight on-line bounds on the solution quality
3. Speed-ups:
Lazy evaluation (by submodularity)
Localized update (by the structure of the problem)
Marginal gains
a
b
a
b
: 20
: 17
c
b
: 18
: 1
: 2
d
b
: 4
d
: 1
: 3
e
b
: 5
c
a
c
: 15
b
c
: 6
: 8
b
d
: 16
e
c
d
: 7
: 8
e
d
: 8
: 10
b
e
: 7
d
e
: 13
We validate our method on:
Synthetic data
Generate a graph G on k edges
Generate cascades
Record node infection times
Reconstruct G
Real data
MemeTracker: 172m news articles
Aug ’08 – Sept ‘09
343m textual phrases (quotes)
Small Synthetic Example
True network
Baseline network
Our method
Pick kstrongest edges:
14
1024 node hierarchical Kronecker exponential transmission model
1000 node Forest Fire (α = 1.1) power law transmission model
We achieve ≈ 90 % of the best possible network!
With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!
Lazy evaluation and localized updates speed up 2 orders of magnitude!
Can infer a networks of 10k nodes in several hours
MemeTracker dataset:
172m news articles from Aug ’08 to Sept ‘09
343m textual phrases (quotes)
1. cascades of hyperlinks:
time when a site creates a link
2.cascades of (MemeTracker) textual phrases:
time when site mentions the information
e
a
c
f
e
a
c
f
500 node hyperlink network using hyperlinks cascades
500 node hyperlink network using MemeTracker cascades
Diffusion Network
Blogs
Mainstream media
Diffusion Network (small part)
Blogs
Mainstream media
We infer hidden networks based on diffusion data (timestamps)
Problem formulation in a maximum likelihood framework
NP-hard problem to solve exactly
We develop an approximation algorithm that:
It is efficient -> It runs in O(N2)
It is invariant to the structure of the underlying network
It gives a sub-optimal network with tight bound
Future work:
Learn both the network and the diffusion model
Applications to other domains: biology, neuroscience, etc.
For more (Code & Data):http://snap.stanford.edu/netinf