Ranking web sites with real user traffic
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Ranking Web Sites with Real User Traffic PowerPoint PPT Presentation


  • 47 Views
  • Uploaded on
  • Presentation posted in: General

Ranking Web Sites with Real User Traffic. Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani. Web Search and Data Mining Stanford, California February 11, 2008. Outline. Data collection Structural properties Behavioral patterns PageRank validation

Download Presentation

Ranking Web Sites with Real User Traffic

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ranking web sites with real user traffic

Ranking Web Sites with Real User Traffic

Mark Meiss

Filippo Menczer

Santo Fortunato

Alessandro Flammini

Alessandro Vespignani

Web Search and Data Mining

Stanford, California

February 11, 2008


Outline

Outline

  • Data collection

  • Structural properties

  • Behavioral patterns

  • PageRank validation

  • Temporal patterns


Sources for ranking data the link graph

Sources for Ranking Data:The Link Graph


Ranking web sites with real user traffic

Sources for Ranking Data:Dynamic Sources

Network flow data

Web server logs

Toolbars and plugins


Ranking web sites with real user traffic

Sources for Ranking Data:Packet Inspection

ISP

~100 K

users


Data collection

Data Collection

HTTP (80)

30% @ peak

anonymizer

Host

Path

Referer

User-Agent

Timestamp

GET

HUMAN

h/p/r/a/t

{

requests from IU only

FULL

h/p/r/a/t


Outline1

Data collection

Structural properties

Behavioral patterns

PageRank validation

Temporal patterns

Outline


Structural properties degree

Structural properties: Degree


Caveat sampling bias

Caveat: Sampling Bias


Ranking web sites with real user traffic

Structural properties:Strength (Site Traffic)


Ranking web sites with real user traffic

Structural properties:Weights (Link Traffic)


Outline2

Data collection

Structural properties

Behavioral patterns

PageRank validation

Temporal patterns

Outline


Behavioral patterns human

Behavioral patterns(HUMAN)

(Proportion of total out-strength)


Ratios are stable

Ratios are stable

Requests (x 106)


Ranking web sites with real user traffic

Ratios are stable

Requests (x 106)


Outline3

Outline

  • Data collection

  • Structural properties

  • Behavioral patterns

  • PageRank validation

  • Temporal patterns


Validation of pagerank

Validation of PageRank

  • PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph

  • Compare with actual site traffic (in-strength)

  • From an application perspective, we care about the resulting ranking of sites rather than the actual values


Kendall s rank correlation

Kendall’s Rank Correlation


Pagerank assumptions

PageRank Assumptions

  • Equal probability of teleporting to each of the nodes

  • Equal probability of teleporting from each of the nodes

  • Equal probability of following each link from any given node


Ranking web sites with real user traffic

Kendall’s Rank Correlation


Local link heterogeneity

perfect concentration

perfect homogeneity

Local Link Heterogeneity

HH Index of concentration or disparity


Teleportation target heterogeneity

Teleportation Target Heterogeneity


Teleportation source heterogeneity hubness

sout > sin

popular hubs

-2

Teleportation Source Heterogeneity (“hubness”)

sout < sin

teleport sources

browsing sinks


Navigation vs jumps sources of popularity

Navigation vs. Jumps: Sources of Popularity


Outline4

Outline

  • Data collection

  • Structural properties

  • Behavioral patterns

  • PageRank validation

  • Temporal patterns


Temporal patterns

Temporal patterns

How predictable are traffic patterns?

-- Cache refreshing

(e.g. proxies)

-- Capacity allocation

(e.g. peering and provisioning for spikes)

-- Site design

(e.g. expose content based on time of day)


Ranking web sites with real user traffic

Temporal patterns

  • Predict future host graph (clicks) from current one, as a function of delay

  • Generalized temporal precision and recall:


Human host graph full is about 10 more predictable

HUMAN host graph (FULL is about 10% more predictable)


Summary

Summary

  • Heterogeneity: incoming and outgoing site traffic, link traffic

  • Less than half of traffic is from following links

  • Only 5% of traffic is directly from search engines

  • High temporal regularity

  • PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated


Ranking web sites with real user traffic

Next

  • Sampling bias and search bias

  • From host graph to page graph

  • Modeling traffic: Beyond random walk?


Thanks

CNLL

THANKS!

?

Mark Meiss

Filippo Menczer

Santo Fortunato

Alessandro Flammini

Alessandro Vespignani


  • Login