Loading in 5 sec....

Lada Adamic, HP Labs, Palo Alto, CAPowerPoint Presentation

Lada Adamic, HP Labs, Palo Alto, CA

- 252 Views
- Uploaded on
- Presentation posted in: Internet / Web

Lada Adamic, HP Labs, Palo Alto, CA

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Information dynamics in the networked world

Lada Adamic, HP Labs, Palo Alto, CA

Talk outline

Information flow through blogs

Information flow through email

Search through email networks

Search within the enterprise

Search in an online community

Blog use:

Record real-world and virtual experiences

Note and discuss things “seen” on the net

Blog structure: blog-to-blog linking

Use + Structure

Great to track “memes” (catchy ideas)

Patterns of information flow

How does the popularity of a topic evolve over time?

Who is getting information from whom?

Ranking algorithms that take advantage of transmission patterns

Slashdot Effect

BoingBoing Effect

Popularity

Time

Blogdex, BlogPulse, etc. track the most popular links/phrases of the day

1

Major-news site (editorial content) – back of the paper

Products, etc.

Slashdotpostings

Front-pagenews

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

5

10

15

5

10

15

5

10

15

5

10

15

% of hits received on each day since first appearance

What do we need track specific info ‘epidemics’?

Timings

Underlying network

b2

b3

b1

t0

Time of infection

t1

Challenges

Root may be unknown

Multiple possible paths

Uncrawled space, alternate media (email, voice)

No links

b2

b3

bn

b1

?

?

t0

Time of infection

t1

Explicit blog to blog links (easy)

Via links are even better

Implicit/Inferred transfer (harder)

Use ML algorithm for link inference problem

Support Vector Machine (SVM)

Logistic Regression

What we can use

Full text

Blogs in common

Links in common

History of infection

Zoomgraph tool

Using GraphViz (by AT&T) layouts

Simple algorithm

If single, explicit link exists, draw it

Otherwise use ML algorithm

Pick the most likely explicit link

Pick the most likely possible link

Tool lets you zoom around space, control threshold, link types, etc.

http://www-idl.hpl.hp.com/blogstuff

Giant Microbes epidemic visualization

via link

inferred link

blog

explicit link

Find early sources of good information

using inferred information paths or timing

b1

True source

b2

Popular site

b3

b4

…

b5

bn

- Draw a weighted edge for all pairs of blogs that cite the same URL
- higher weight for mentions closer together
- run PageRank
- control for ‘spam’

t0

Time of infection

t1

02:00 AM Friday Mar. 05, 2004 PSTWired publishes:

"Warning: Blogs Can Be Infectious.”

7:25 AM Friday Mar. 05, 2004 PSTSlashdot posts:

"Bloggers' Plagiarism Scientifically Proven"

9:55 AM Friday Mar. 05, 2004 PSTMetafilter announces

"A good amount of bloggers are outright thieves."

Information flow in social groups

Fang Wu, Bernardo Huberman, Lada Adamic, Joshua Tyler

Spread of disease is affected

by the underlying network

co-worker

mom

college

friend

co-worker

mike

co-worker

Spread of computer viruses

is affected by the

underlying network

co-worker

mom

college

friend

co-worker

mike

co-worker

Difference between information flow and disease/virus spread

Viruses (computer and otherwise) are shared

indiscriminately (involuntarily)

Information is passed selectively from one host to another based on knowledge of the recipient’s interests

Spread of information is affected

by its content, potential recipients,

and network topology

co-worker

mom

college

friend

co-worker

mike

co-worker

homophily: individuals with like interests associate with one another

personal homepages at Stanford

distance between personal homepages

m=2

m=0

m=1

The Model:

Decay in transmission probability as a function of the

distance m between potential target and originating node

T(m) = (m+1)-b T

power-law implies slowest decay

Virus, information transmission on a scale free network

P(k)

outdegree k

Degree distribution of all senders of email passing through the HP email server

Wu et al. (2004)

Newman (2002)

Pastor-Satorras

& Vespignani (2001)

epidemics on scale free graphs

106 nodes, epidemic if 1% (104) infected

1

k

¥

b

=

,

=0

0.8

k

b

=100,

=0

k

b

=100,

=1

0.6

critical threshold

0.4

0.2

0

1

1.5

2

2.5

3

3.5

4

a

Study of the spread of URLs and attachments

40 participants (30 within HPL, 10 elsewhere in HP & other orgs)

6370 URLs and 3401 attachments crypotgraphically hashed

Question: How many recipients in our sample did each item reach?

caveats:

messages are deleted (still, the median number of messages > 2000)

non-uniform sample

forwarded

message

forwarded URLs

Only forwarded messages are counted

4

10

email attachments

-4.1

x

URLs

-3.6

3

x

10

2

number of items with so many recipients

10

1

10

0

10

0

1

10

10

number of recipients

short term expense

control

Results

average = 1.1 for attachments, and 1.2 for URLs

ads at the

bottom of

hotmail &

yahoo

messages

Simulate transmission on email log

each message has a probability p of transmitting information from an infected individual to the recipient

02/19/200315:45:33I-1I-2

02/19/200315:45:33I-1I-3

02/19/200315:45:40E-1I-4

02/19/200315:45:52I-5E-2

02/19/200315:45:55E-3I-6

02/19/200315:45:58I-7I-8

02/19/200315:46:00E-4I-9

02/19/200315:46:05I-10I-11

02/19/200315:46:10I-12I-13

02/19/200315:46:10I-12I-14

02/19/200315:46:10I-12I-15

02/19/200315:46:14I-16E-5

. . . .

. . . .

internal

node

external

node

Simulation of information transmission on

the actual HP Labs email graph

an individual is infected if they receive a particular piece

of information

individuals remain infected for 24 hours

start by infecting one individual at random

every time an infected individual sends an email they have

a probability p of infecting the recipient

track epidemic over the course of a week, most run their

course in 1-2 days

distance 1

distance 1

Introduce a decay in the transmission probability

based on the hierarchical distance

hAB = 5

distance 2

distance 2

B

A

7119 potential recipients

p0

Conclusions on info flow in social groups

Information spread typically does not reach epidemic proportions

Information is passed on to individuals with matching properties

The likelihood that properties match decreases with distance

from the source

Model gives a finite threshold

Results are consistent with observed URL & attachment frequencies

in a sample

Simulations following real email patterns also consistent

MA

NE

How to search in a small world

Milgram’s experiment:

Given a target individual and a particular property, pass the

message to a person you correspond with who is “closest” to the

target.

Small world experiment at Columbia

Dodds, Muhamad, Watts, Science 301, (2003)

email experiement conducted in 2002

18 targets in 13 different countries

24,163 message chains

384 reached their targets

average path length 4.0

Why study small world phenomena?

Curiosity:

Why is the world small?

How are people able to route messages?

Social Networking as a Business:

Friendster, Orkut, MySpace

LinkedIn, Spoke, VisiblePath

Six degrees of separation - to be expected

Pool and Kochen (1978) - average person has 500-1500 acquaintances

Ignoring clustering, other redundancy …

~ 103 first neighbors, 106 second neighbors, 109 third neighbors

But networks are clustered:

my friends’ friends tend to be my friends

Watts & Strogatz (1998) - a few random links in an otherwise clustered graph give an average shortest path close to that of a random graph

But how are people are able to find short paths?

How to choose among hundreds of acquaintances?

Strategy:

Simple greedy algorithm - each participant chooses correspondent

who is closest to target with respect to the given property

Models

geography

Kleinberg (2000)

hierarchical groups

Watts, Dodds, Newman (2001), Kleinberg(2001)

high degree nodes

Adamic, Puniyani, Lukose, Huberman (2001), Newman(2003)

Spatial search

Kleinberg (2000)

“The geographic movement of the [message] from Nebraska to

Massachusetts is striking. There is a progressive closing in on the target

area as each new person is added to the chain”

S.Milgram ‘The small world problem’, Psychology Today 1,61,1967

nodes are placed on a lattice and

connect to nearest neighbors

additional links placed with f(d)~ d(u,v)-r

if r = 2, can search in polylog (< (logN)2) time

Kleinberg: searching hierarchical structures

‘Small-World Phenomena and the Dynamics of Information’, NIPS 14, 2001

Hierarchical network models:

h is the distance between two individuals in hierarchy

with branching b

f(h) ~ b-ah

If a = 1, can search in O(log n) steps

Group structure models:

q = size of smallest group that two individuals belong to

f(q) ~ q-a

If a = 1, can achieve in O(log n) steps

Identity and search in social networks

Watts, Dodds, Newman (2001)

individuals belong to hierarchically nested groups

multiple independent hierarchies coexist

pij ~ exp(-a x)

Identity and search in social networks

Watts, Dodds, Newman (2001)

There is an attrition rate r

Network is ‘searchable’ if a fraction q of messages reach the target

N=102400

N=204800

N=409600

High degree search

Adamic et al. Phys. Rev. E, 64 46135 (2001)

Mary

Who could

introduce me to

Richard Gere?

Bob

Jane

67

63

54

1

power-law graph

number of

nodes found

94

6

2

19

15

11

7

3

1

Poisson graph

number of

nodes found

93

3

10

2

10

1

10

0

10

1

2

3

4

5

10

10

10

10

10

Scaling of search time with size of graph

Sharp cutoff at k~N1/a , 2nd degree neighbors

random walk

a

= 0.37 fit

degree sequence

a

=0.24 fit

covertime for half the nodes

size of graph

Testing the models on social networks

(w/Eytan Adar)

Use a well defined network:

HP Labs email correspondence over 3.5 months

Edges are between individuals who sent

at least 6 email messages each way

Node properties specified:

degree

geographical location

position in organizational hierarchy

Can greedy strategies work?

Strategy 1: High degree search

Degree distribution of all senders of email passing through the HP email server

outdegree

Filtered network

(6 messages sent each way)

Degree distribution no longer power-law, but Poisson

450 users

median degree = 10

mean degree = 13

average shortest

path = 3

High degree search

performance (poor):

median # steps = 16

mean =40

Strategy 2:

Geography

Communication across corporate geography

1U

1L

87 % of the

4000 links are

between individuals

on the same floor

4U

3U

2U

2L

3L

optimum for search

Cubicle distance vs. probability of being linked

Finding someone in a sea of cubicles

median = 7

mean = 12

Strategy 3: Organizational hierarchy

Email correspondence scrambled

Actual email correspondence

distance 1

distance 1

Example of search path

distance 2

distance 1

hierarchical distance = 5

search path distance = 4

Probability of linking vs. distance in hierarchy

in the ‘searchable’ regime: 0 < a < 2 (Watts 2001)

Results

Group size vs. probability of linking

optimum for

search (Kleinberg 2001)

Group size and probability of linking

group size g

Search Conclusions

- Individuals associate on different levels into groups.
- Group structure facilitates decentralized search using social ties.
- HP Labs as a social network is searchable but not quite optimal.
- searching using the organizational hierarchy is faster
- than using physical location
- A fraction of ‘important’ individuals are easily findable
- Humans may be much more resourceful in executing search tasks:
- making use of weak ties
- using more sophisticated strategies

PeopleFinder2 – a search engine for HP people

Extract & disambiguate names from publicly available documents

Enrich information available about individuals

Search for them by topic

Identify knowledge communities from co-occurrence of names

Live Demo

If live demo fails:

Current PeopleFinder functionality

PeopleFinder2 info on a person

Extracted topics for a person

Social network

Social network visualization

Search for individuals by topic

Visualize knowledge network

Find social network paths to experts

To find out more:

(papers, slides, other research in the group)

Information dynamics group (IDL) at HP Labs:

http://www.hpl.hp.com/research/idl

List of publications

http://www.hpl.hp.com/personal/Lada_Adamic/research.html