Slide1 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 63

Lada Adamic, HP Labs, Palo Alto, CA PowerPoint PPT Presentation


  • 242 Views
  • Uploaded on
  • Presentation posted in: Internet / Web

Information dynamics in the networked world. Lada Adamic, HP Labs, Palo Alto, CA. Talk outline. Information flow through blogs. Information flow through email. Search through email networks. Search within the enterprise. Search in an online community. Blog use:

Download Presentation

Lada Adamic, HP Labs, Palo Alto, CA

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide1 l.jpg

Information dynamics in the networked world

Lada Adamic, HP Labs, Palo Alto, CA


Slide2 l.jpg

Talk outline

Information flow through blogs

Information flow through email

Search through email networks

Search within the enterprise

Search in an online community


Implicit structure and dynamics of blogspace eytan adar li zhang lada adamic rajan lukose l.jpg

Blog use:

Record real-world and virtual experiences

Note and discuss things “seen” on the net

Blog structure: blog-to-blog linking

Use + Structure

Great to track “memes” (catchy ideas)

Implicit Structure and Dynamics of BlogSpaceEytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose


Approaches and uses of blog analysis l.jpg

Patterns of information flow

How does the popularity of a topic evolve over time?

Who is getting information from whom?

Ranking algorithms that take advantage of transmission patterns

Approaches and uses of blog analysis


Tracking popularity over time l.jpg

Slashdot Effect

BoingBoing Effect

Tracking popularity over time

Popularity

Time

Blogdex, BlogPulse, etc. track the most popular links/phrases of the day


Different kinds of information have different popularity profiles l.jpg

Different kinds of information have differentpopularity profiles

1

Major-news site (editorial content) – back of the paper

Products, etc.

Slashdotpostings

Front-pagenews

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

5

10

15

5

10

15

5

10

15

5

10

15

% of hits received on each day since first appearance


Micro example giant microbes l.jpg

Micro example: Giant Microbes


Microscale dynamics l.jpg

What do we need track specific info ‘epidemics’?

Timings

Underlying network

b2

b3

Microscale Dynamics

b1

t0

Time of infection

t1


Microscale dynamics9 l.jpg

Challenges

Root may be unknown

Multiple possible paths

Uncrawled space, alternate media (email, voice)

No links

b2

b3

Microscale Dynamics

bn

b1

?

?

t0

Time of infection

t1


Microscale dynamics who is getting info from whom l.jpg

Explicit blog to blog links (easy)

Via links are even better

Implicit/Inferred transfer (harder)

Use ML algorithm for link inference problem

Support Vector Machine (SVM)

Logistic Regression

What we can use

Full text

Blogs in common

Links in common

History of infection

Microscale Dynamics who is getting info from whom


Visualization l.jpg

Zoomgraph tool

Using GraphViz (by AT&T) layouts

Simple algorithm

If single, explicit link exists, draw it

Otherwise use ML algorithm

Pick the most likely explicit link

Pick the most likely possible link

Tool lets you zoom around space, control threshold, link types, etc.

Visualization

http://www-idl.hpl.hp.com/blogstuff


Slide12 l.jpg

Giant Microbes epidemic visualization

via link

inferred link

blog

explicit link


Irank l.jpg

Find early sources of good information

using inferred information paths or timing

iRank

b1

True source

b2

Popular site

b3

b4

b5

bn


Irank algorithm l.jpg

iRank Algorithm

  • Draw a weighted edge for all pairs of blogs that cite the same URL

  • higher weight for mentions closer together

  • run PageRank

  • control for ‘spam’

t0

Time of infection

t1


Do bloggers kill kittens l.jpg

02:00 AM Friday Mar. 05, 2004 PSTWired publishes:

"Warning: Blogs Can Be Infectious.”

7:25 AM Friday Mar. 05, 2004 PSTSlashdot posts:

"Bloggers' Plagiarism Scientifically Proven"

9:55 AM Friday Mar. 05, 2004 PSTMetafilter announces

"A good amount of bloggers are outright thieves."

Do Bloggers Kill Kittens?


Slide16 l.jpg

Information flow in social groups

Fang Wu, Bernardo Huberman, Lada Adamic, Joshua Tyler


Slide17 l.jpg

Spread of disease is affected

by the underlying network

co-worker

mom

college

friend

co-worker

mike

co-worker


Slide18 l.jpg

Spread of computer viruses

is affected by the

underlying network

co-worker

mom

college

friend

co-worker

mike

co-worker


Slide19 l.jpg

Difference between information flow and disease/virus spread

Viruses (computer and otherwise) are shared

indiscriminately (involuntarily)

Information is passed selectively from one host to another based on knowledge of the recipient’s interests


Slide20 l.jpg

Spread of information is affected

by its content, potential recipients,

and network topology

co-worker

mom

college

friend

co-worker

mike

co-worker


Slide21 l.jpg

homophily: individuals with like interests associate with one another

personal homepages at Stanford

distance between personal homepages


Slide22 l.jpg

m=2

m=0

m=1

The Model:

Decay in transmission probability as a function of the

distance m between potential target and originating node

T(m) = (m+1)-b T

power-law implies slowest decay


Slide23 l.jpg

Virus, information transmission on a scale free network

P(k)

outdegree k

Degree distribution of all senders of email passing through the HP email server


Slide24 l.jpg

Wu et al. (2004)

Newman (2002)

Pastor-Satorras

& Vespignani (2001)

epidemics on scale free graphs

106 nodes, epidemic if 1% (104) infected

1

k

¥

b

=

,

=0

0.8

k

b

=100,

=0

k

b

=100,

=1

0.6

critical threshold

0.4

0.2

0

1

1.5

2

2.5

3

3.5

4

a


Slide25 l.jpg

Study of the spread of URLs and attachments

40 participants (30 within HPL, 10 elsewhere in HP & other orgs)

6370 URLs and 3401 attachments crypotgraphically hashed

Question: How many recipients in our sample did each item reach?

caveats:

messages are deleted (still, the median number of messages > 2000)

non-uniform sample


Slide26 l.jpg

forwarded

message

forwarded URLs

Only forwarded messages are counted


Slide27 l.jpg

4

10

email attachments

-4.1

x

URLs

-3.6

3

x

10

2

number of items with so many recipients

10

1

10

0

10

0

1

10

10

number of recipients

short term expense

control

Results

average = 1.1 for attachments, and 1.2 for URLs

ads at the

bottom of

hotmail &

yahoo

messages


Slide28 l.jpg

Simulate transmission on email log

each message has a probability p of transmitting information from an infected individual to the recipient

02/19/200315:45:33I-1I-2

02/19/200315:45:33I-1I-3

02/19/200315:45:40E-1I-4

02/19/200315:45:52I-5E-2

02/19/200315:45:55E-3I-6

02/19/200315:45:58I-7I-8

02/19/200315:46:00E-4I-9

02/19/200315:46:05I-10I-11

02/19/200315:46:10I-12I-13

02/19/200315:46:10I-12I-14

02/19/200315:46:10I-12I-15

02/19/200315:46:14I-16E-5

. . . .

. . . .

internal

node

external

node


Slide29 l.jpg

Simulation of information transmission on

the actual HP Labs email graph

an individual is infected if they receive a particular piece

of information

individuals remain infected for 24 hours

start by infecting one individual at random

every time an infected individual sends an email they have

a probability p of infecting the recipient

track epidemic over the course of a week, most run their

course in 1-2 days


Slide30 l.jpg

distance 1

distance 1

Introduce a decay in the transmission probability

based on the hierarchical distance

hAB = 5

distance 2

distance 2

B

A


Slide31 l.jpg

7119 potential recipients

p0


Slide32 l.jpg

Conclusions on info flow in social groups

Information spread typically does not reach epidemic proportions

Information is passed on to individuals with matching properties

The likelihood that properties match decreases with distance

from the source

Model gives a finite threshold

Results are consistent with observed URL & attachment frequencies

in a sample

Simulations following real email patterns also consistent


Slide33 l.jpg

MA

NE

How to search in a small world

Milgram’s experiment:

Given a target individual and a particular property, pass the

message to a person you correspond with who is “closest” to the

target.


Slide34 l.jpg

Small world experiment at Columbia

Dodds, Muhamad, Watts, Science 301, (2003)

email experiement conducted in 2002

18 targets in 13 different countries

24,163 message chains

384 reached their targets

average path length 4.0


Slide35 l.jpg

Why study small world phenomena?

Curiosity:

Why is the world small?

How are people able to route messages?

Social Networking as a Business:

Friendster, Orkut, MySpace

LinkedIn, Spoke, VisiblePath


Slide36 l.jpg

Six degrees of separation - to be expected

Pool and Kochen (1978) - average person has 500-1500 acquaintances

Ignoring clustering, other redundancy …

~ 103 first neighbors, 106 second neighbors, 109 third neighbors

But networks are clustered:

my friends’ friends tend to be my friends

Watts & Strogatz (1998) - a few random links in an otherwise clustered graph give an average shortest path close to that of a random graph


Slide37 l.jpg

But how are people are able to find short paths?

How to choose among hundreds of acquaintances?

Strategy:

Simple greedy algorithm - each participant chooses correspondent

who is closest to target with respect to the given property

Models

geography

Kleinberg (2000)

hierarchical groups

Watts, Dodds, Newman (2001), Kleinberg(2001)

high degree nodes

Adamic, Puniyani, Lukose, Huberman (2001), Newman(2003)


Slide38 l.jpg

Spatial search

Kleinberg (2000)

“The geographic movement of the [message] from Nebraska to

Massachusetts is striking. There is a progressive closing in on the target

area as each new person is added to the chain”

S.Milgram ‘The small world problem’, Psychology Today 1,61,1967

nodes are placed on a lattice and

connect to nearest neighbors

additional links placed with f(d)~ d(u,v)-r

if r = 2, can search in polylog (< (logN)2) time


Slide39 l.jpg

Kleinberg: searching hierarchical structures

‘Small-World Phenomena and the Dynamics of Information’, NIPS 14, 2001

Hierarchical network models:

h is the distance between two individuals in hierarchy

with branching b

f(h) ~ b-ah

If a = 1, can search in O(log n) steps

Group structure models:

q = size of smallest group that two individuals belong to

f(q) ~ q-a

If a = 1, can achieve in O(log n) steps


Slide40 l.jpg

Identity and search in social networks

Watts, Dodds, Newman (2001)

individuals belong to hierarchically nested groups

multiple independent hierarchies coexist

pij ~ exp(-a x)


Slide41 l.jpg

Identity and search in social networks

Watts, Dodds, Newman (2001)

There is an attrition rate r

Network is ‘searchable’ if a fraction q of messages reach the target

N=102400

N=204800

N=409600


Slide42 l.jpg

High degree search

Adamic et al. Phys. Rev. E, 64 46135 (2001)

Mary

Who could

introduce me to

Richard Gere?

Bob

Jane


Slide43 l.jpg

67

63

54

1

power-law graph

number of

nodes found

94

6

2


Slide44 l.jpg

19

15

11

7

3

1

Poisson graph

number of

nodes found

93


Slide45 l.jpg

3

10

2

10

1

10

0

10

1

2

3

4

5

10

10

10

10

10

Scaling of search time with size of graph

Sharp cutoff at k~N1/a , 2nd degree neighbors

random walk

a

= 0.37 fit

degree sequence

a

=0.24 fit

covertime for half the nodes

size of graph


Slide46 l.jpg

Testing the models on social networks

(w/Eytan Adar)

Use a well defined network:

HP Labs email correspondence over 3.5 months

Edges are between individuals who sent

at least 6 email messages each way

Node properties specified:

degree

geographical location

position in organizational hierarchy

Can greedy strategies work?


Slide47 l.jpg

Strategy 1: High degree search

Degree distribution of all senders of email passing through the HP email server

outdegree


Slide48 l.jpg

Filtered network

(6 messages sent each way)

Degree distribution no longer power-law, but Poisson

450 users

median degree = 10

mean degree = 13

average shortest

path = 3

High degree search

performance (poor):

median # steps = 16

mean =40


Slide49 l.jpg

Strategy 2:

Geography


Slide50 l.jpg

Communication across corporate geography

1U

1L

87 % of the

4000 links are

between individuals

on the same floor

4U

3U

2U

2L

3L


Slide51 l.jpg

optimum for search

Cubicle distance vs. probability of being linked


Slide52 l.jpg

Finding someone in a sea of cubicles

median = 7

mean = 12


Slide53 l.jpg

Strategy 3: Organizational hierarchy


Slide54 l.jpg

Email correspondence scrambled


Slide55 l.jpg

Actual email correspondence


Slide56 l.jpg

distance 1

distance 1

Example of search path

distance 2

distance 1

hierarchical distance = 5

search path distance = 4


Slide57 l.jpg

Probability of linking vs. distance in hierarchy

in the ‘searchable’ regime: 0 < a < 2 (Watts 2001)


Slide58 l.jpg

Results


Slide59 l.jpg

Group size vs. probability of linking


Slide60 l.jpg

optimum for

search (Kleinberg 2001)

Group size and probability of linking

group size g


Slide61 l.jpg

Search Conclusions

  • Individuals associate on different levels into groups.

  • Group structure facilitates decentralized search using social ties.

  • HP Labs as a social network is searchable but not quite optimal.

  • searching using the organizational hierarchy is faster

  • than using physical location

  • A fraction of ‘important’ individuals are easily findable

  • Humans may be much more resourceful in executing search tasks:

  • making use of weak ties

    • using more sophisticated strategies


Slide62 l.jpg

PeopleFinder2 – a search engine for HP people

Extract & disambiguate names from publicly available documents

Enrich information available about individuals

Search for them by topic

Identify knowledge communities from co-occurrence of names

Live Demo

If live demo fails:

Current PeopleFinder functionality

PeopleFinder2 info on a person

Extracted topics for a person

Social network

Social network visualization

Search for individuals by topic

Visualize knowledge network

Find social network paths to experts


Slide63 l.jpg

To find out more:

(papers, slides, other research in the group)

Information dynamics group (IDL) at HP Labs:

http://www.hpl.hp.com/research/idl

List of publications

http://www.hpl.hp.com/personal/Lada_Adamic/research.html


  • Login