Ranking web sites with real user traffic
Download
1 / 32

Ranking Web Sites with Real User Traffic - PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on

Ranking Web Sites with Real User Traffic. Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani. Web Search and Data Mining Stanford, California February 11, 2008. Outline. Data collection Structural properties Behavioral patterns PageRank validation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Ranking Web Sites with Real User Traffic' - lang


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ranking web sites with real user traffic

Ranking Web Sites with Real User Traffic

Mark Meiss

Filippo Menczer

Santo Fortunato

Alessandro Flammini

Alessandro Vespignani

Web Search and Data Mining

Stanford, California

February 11, 2008


Outline
Outline

  • Data collection

  • Structural properties

  • Behavioral patterns

  • PageRank validation

  • Temporal patterns



Ranking web sites with real user traffic

Sources for Ranking Data:Dynamic Sources

Network flow data

Web server logs

Toolbars and plugins


Ranking web sites with real user traffic

Sources for Ranking Data:Packet Inspection

ISP

~100 K

users


Data collection
Data Collection

HTTP (80)

30% @ peak

anonymizer

Host

Path

Referer

User-Agent

Timestamp

GET

HUMAN

h/p/r/a/t

{

requests from IU only

FULL

h/p/r/a/t


Outline1

Data collection

Structural properties

Behavioral patterns

PageRank validation

Temporal patterns

Outline



Caveat sampling bias
Caveat: Sampling Bias


Ranking web sites with real user traffic

Structural properties:Strength (Site Traffic)


Ranking web sites with real user traffic

Structural properties:Weights (Link Traffic)


Outline2

Data collection

Structural properties

Behavioral patterns

PageRank validation

Temporal patterns

Outline


Behavioral patterns human
Behavioral patterns(HUMAN)

(Proportion of total out-strength)


Ratios are stable
Ratios are stable

Requests (x 106)


Ranking web sites with real user traffic

Ratios are stable

Requests (x 106)


Outline3
Outline

  • Data collection

  • Structural properties

  • Behavioral patterns

  • PageRank validation

  • Temporal patterns


Validation of pagerank
Validation of PageRank

  • PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph

  • Compare with actual site traffic (in-strength)

  • From an application perspective, we care about the resulting ranking of sites rather than the actual values



Pagerank assumptions
PageRank Assumptions

  • Equal probability of teleporting to each of the nodes

  • Equal probability of teleporting from each of the nodes

  • Equal probability of following each link from any given node



Local link heterogeneity

perfect concentration

perfect homogeneity

Local Link Heterogeneity

HH Index of concentration or disparity



Teleportation source heterogeneity hubness

sout > sin

popular hubs

-2

Teleportation Source Heterogeneity (“hubness”)

sout < sin

teleport sources

browsing sinks


Navigation vs jumps sources of popularity
Navigation vs. Jumps: Sources of Popularity


Outline4
Outline

  • Data collection

  • Structural properties

  • Behavioral patterns

  • PageRank validation

  • Temporal patterns


Temporal patterns
Temporal patterns

How predictable are traffic patterns?

-- Cache refreshing

(e.g. proxies)

-- Capacity allocation

(e.g. peering and provisioning for spikes)

-- Site design

(e.g. expose content based on time of day)


Ranking web sites with real user traffic

Temporal patterns

  • Predict future host graph (clicks) from current one, as a function of delay

  • Generalized temporal precision and recall:


Human host graph full is about 10 more predictable
HUMAN host graph (FULL is about 10% more predictable)


Summary
Summary

  • Heterogeneity: incoming and outgoing site traffic, link traffic

  • Less than half of traffic is from following links

  • Only 5% of traffic is directly from search engines

  • High temporal regularity

  • PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated


Ranking web sites with real user traffic
Next

  • Sampling bias and search bias

  • From host graph to page graph

  • Modeling traffic: Beyond random walk?


Thanks

CNLL

THANKS!

?

Mark Meiss

Filippo Menczer

Santo Fortunato

Alessandro Flammini

Alessandro Vespignani