Ranking web sites with real user traffic
Download
1 / 32

Ranking Web Sites with Real User Traffic - PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

Ranking Web Sites with Real User Traffic. Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani. Web Search and Data Mining Stanford, California February 11, 2008. Outline. Data collection Structural properties Behavioral patterns PageRank validation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Ranking Web Sites with Real User Traffic

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ranking Web Sites with Real User Traffic

Mark Meiss

Filippo Menczer

Santo Fortunato

Alessandro Flammini

Alessandro Vespignani

Web Search and Data Mining

Stanford, California

February 11, 2008


Outline

  • Data collection

  • Structural properties

  • Behavioral patterns

  • PageRank validation

  • Temporal patterns


Sources for Ranking Data:The Link Graph


Sources for Ranking Data:Dynamic Sources

Network flow data

Web server logs

Toolbars and plugins


Sources for Ranking Data:Packet Inspection

ISP

~100 K

users


Data Collection

HTTP (80)

30% @ peak

anonymizer

Host

Path

Referer

User-Agent

Timestamp

GET

HUMAN

h/p/r/a/t

{

requests from IU only

FULL

h/p/r/a/t


Data collection

Structural properties

Behavioral patterns

PageRank validation

Temporal patterns

Outline


Structural properties: Degree


Caveat: Sampling Bias


Structural properties:Strength (Site Traffic)


Structural properties:Weights (Link Traffic)


Data collection

Structural properties

Behavioral patterns

PageRank validation

Temporal patterns

Outline


Behavioral patterns(HUMAN)

(Proportion of total out-strength)


Ratios are stable

Requests (x 106)


Ratios are stable

Requests (x 106)


Outline

  • Data collection

  • Structural properties

  • Behavioral patterns

  • PageRank validation

  • Temporal patterns


Validation of PageRank

  • PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph

  • Compare with actual site traffic (in-strength)

  • From an application perspective, we care about the resulting ranking of sites rather than the actual values


Kendall’s Rank Correlation


PageRank Assumptions

  • Equal probability of teleporting to each of the nodes

  • Equal probability of teleporting from each of the nodes

  • Equal probability of following each link from any given node


Kendall’s Rank Correlation


perfect concentration

perfect homogeneity

Local Link Heterogeneity

HH Index of concentration or disparity


Teleportation Target Heterogeneity


sout > sin

popular hubs

-2

Teleportation Source Heterogeneity (“hubness”)

sout < sin

teleport sources

browsing sinks


Navigation vs. Jumps: Sources of Popularity


Outline

  • Data collection

  • Structural properties

  • Behavioral patterns

  • PageRank validation

  • Temporal patterns


Temporal patterns

How predictable are traffic patterns?

-- Cache refreshing

(e.g. proxies)

-- Capacity allocation

(e.g. peering and provisioning for spikes)

-- Site design

(e.g. expose content based on time of day)


Temporal patterns

  • Predict future host graph (clicks) from current one, as a function of delay

  • Generalized temporal precision and recall:


HUMAN host graph (FULL is about 10% more predictable)


Summary

  • Heterogeneity: incoming and outgoing site traffic, link traffic

  • Less than half of traffic is from following links

  • Only 5% of traffic is directly from search engines

  • High temporal regularity

  • PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated


Next

  • Sampling bias and search bias

  • From host graph to page graph

  • Modeling traffic: Beyond random walk?


CNLL

THANKS!

?

Mark Meiss

Filippo Menczer

Santo Fortunato

Alessandro Flammini

Alessandro Vespignani


ad
  • Login