A Measurement Study of Peer-to-Peer File Sharing Systems

A Measurement Study of Peer-to-Peer File Sharing Systems Presented by Cristina Abad

Motivation • In a P2P file sharing system, peers are usually in the “edge” of the network • Does this affect/limit the quality of the infrastructure? • What are the characteristics of hosts that choose to participate? • Solution: Measure Gnutella and Napster traffic to help understand these issues

Napster

Gnutella

Methodology • Crawler periodically takes “snapshot” of Napster/Gnutella • capture basic info (peers, files shared, …) • For peers discovered • measure bottleneck bandwidth • measure latency • track content and degree of sharing • Measure lifetime • track availability of peers (at P2P and IP level)

Crawling Napster • Peers can only be discovered by querying index • Crawler issues queries with names of popular song artists • Query responses contain • IP, reported bandwidth, files shared (number, names and sizes) • Results: • Captured 40-60% of Napster hosts (contributing to 80-95% of total files) • Could not capture peers that do not share files

Crawling Gnutella • Crawler uses ping/pong to discover peers • Each crawl captured aprox. 10000 peers

Measuring bandwidth • Reported bandwidth may not be accurate (ignorance or lies) • Use bottleneck bandwidth as approximation to available bandwidth • capacity of slowest host along path between two hosts • Used SProbe to actively measure both upstream and downstream bottleneck bandwidth • Similar to “packet pair” technique

Packet Pair Technique • Two packets queued next to each other at bottleneck link exit the link t seconds apart: • Then, Kevin Lai and Mary Baker. “Measuringbandwidth”. In Proceedings of IEEE INFOCOM '99. 1999. s2: size of second packet bbnl: bottleneck bandwidth

How many peers are server-like? 8% have upstream bb  10Mbps • High-bandwidth, low latency, high availability

Availability – Host uptimes

Availability – Session duration

Free-riders

Is Gnutella robust?

Measurement, Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload Presented by Cristina Abad

Three-tiered approach • Analyze 200-day trace of Kazaa traffic • Considered only traffic going from U. Washington to the outside • Develop a model of multimedia workloads • Analyze and confirm hypothesis • Explore potential impact of locality -awareness in Kazaa

Contributions • Obtained some useful characterizations of Kazaa’s traffic • Showed that Kazaa’s workload is not Zipf • Showed that other workloads (multimedia) may not be Zipf either • Presented a model of P2P file-sharing workloads based on their trace results • Validated the model through simulations that yielded results very similar to those from traces • Proved the usefulness of exploiting locality-aware request routing

Measurement results • Users are patient • Users slow down as they age • Kazaa is not one workload • Kazaa clients fetch objects at-most-once • Popularity of objects is often short-lived • Kazaa is not Zipf

User characteristics (1) • Users are patient

User characteristics (2) • Users slow down as they age • clients “die” • older clients ask for less each time they use system

User characteristics (3) • Client activity • Tracing used could only detect users when their clients transfer data • Thus, they only report statistics on client activity, which is a lower bound on availability • Avg session lengths are typically small (median: 2.4 mins) • Many transactions fail • Periods of inactivity may occur during a request if client cannot find an available server with the object

Object characteristics (1) • Kazaa is not one workload

Object characteristics (2) • Kazaa object dynamics • Kazaa clients fetch objects at most once • Popularity of objects is often short-lived • Most popular objects tend to be recently born objects • Most requests are for old objects

Object characteristics (3) • Kazaa is not Zipf • Web access patterns are Zipf: small number of objects are extremely popular, but there is a long tail of unpopular requests. • Zipf’s law: popularity of ith-most popular object is proportional to i-α, (α: Zipf coefficient) • (Zipf) looks linear on log-log scale

Model of P2P file-sharing workloads • On average, a client requests 2 objects/day • P(x): probability that a user requests an object of popularity rank x  Zipf(1) • Adjusted so that objects are requested at most once • A(x): probability that a newly arrived object is inserted at popularity rank x  Zipf(1) • All objects are assumed to have same size • Use caching to observe performance changes (effectiveness  hit rate)

Model – Simulation results • File-sharing effectiveness diminishes with client age • System evolves towards one with no locality and objects chosen at random from large space • New object arrivals improve performance • Arrivals replenish supply of popular objects • New clients cannot stabilize performance • Can’t compensate for increasing number of old clients • Overall bandwidth increases in proportion to population size

Model validation • By tweaking the arrival rate of of new objects, were able to match trace results (with 5475 new arrivals per year)

Exploring locality-awareness • Currently organizations shape or filter P2P traffic • Alternative strategy: exploit locality in file-sharing workload • Caching; or, • Use content available within organization to substantially decrease external bandwidth usage • Result: 86% of externally downloaded bytes could be avoided by using an organizational proxy

Questions?

Analysis • How can results obtained be used when evaluating P2P schemes? • Are any of the measurements obtained biased? • Peers are heterogeneous • Incentives • Enforcement (e.g. super-peers in Kazaa)

SProbe • Works in uncooperative environments • Works on asymmetric network paths • Exploit properties of TCP protocol • Send SYN packet with large payload; then, measure time dispersion of received RST packet

Zipf • Linguist George Kingsley Zipf observed that for many frequency distributions, the n-th largest frequency is proportional to a negative power of the rank order n • "Zipf's law" is also sometimes used to refer to the corresponding probability distribution • Is an instance of a power law • Zipf's law is often demonstrated by plotting the data, with the axes being log(rank order) and log(frequency). If the points are close to a single straight line, the distribution follows Zipf's law.

A Measurement Study of Peer-to-Peer File Sharing Systems