Interest-Based Self-Organization of Peer-to-Peer Networks: a Club Economics Approach

Interest-Based Self-Organization of Peer-to-Peer Networks: a Club Economics Approach Atip Asvanund, Michael D. Smith, Rahul Telang H. John Heinz III School of Public Policy and Management Carnegie Mellon University December, 2003 WITS

Outline • Introduction to P2P • Research Questions • Game Formulation • Data Collection • Simulation Results • Conclusion WITS

Original P2P Architectures WITS

Current P2P Architecture Hybrid Architecture • Ultrapeers act like local central servers and shield leaf nodes from network traffic; • Leaf nodes do not interface with the rest of the network; ultrapeers issue and forward queries for them; TTL still applies • Ultrapeers keep content hash of their leaf nodes • High capacity peers can volunteer to become ultrapeers • (e.g., Gnutella 0.6, Kazaa) WITS

Problems with Current P2P Architecture • Random connection establishment • Content discovery can be more efficient if peers who can satisfy each other’s needs (services & content) are located “closer” to each other– • 50% of cross backbone network traffic is P2P (Sandvine 2002) • But Identifying peers who can satisfy each other’s needs is nontrivial • Requires knowledge of future queries; Cost overheads • No unique identifiers of content in P2P networks (e.g., ISBN for books; content naming schemes in P2P are unstructured • EX: Star Wars Episode 1 may be identified as: Star Wars: The Phantom Menace; Phantom Menace, The – Star Wars; Lucas’s Star Wars • Free riding • Worsens with network size (Adar and Huberman 2000, Asvanund et al 2002, Krishnan et al 2002) WITS

Research Questions • Assess the feasibility of P2P networks as distributed IR systems • Can performance be improved by organizing peers into interest-based communities? • Inefficient topology formation • Free riding • Can the overhead incurred by this method, be offset by the improvement? • Reducing search expansion • Reducing time-to-live • Interpolating number of connections • But, still retain the same performance in the improved network WITS

Model an ultrapeer and its leaf nodes as a club (Buchanan 1965) A leaf node seeks membership to the right club to maximize its private utility An ultrapeer accepts the right leaf nodes into its club to maximize the total utility of its members (club utility) Additionally, an ultrapeer also seeks connection to the right adjacent ultrapeers to maximize the club utility Community (Club) Formation WITS

Game Setup • Game formulation • Autonomous peers • Limited information set • Information Set • A peer knows its content hash • Content hash is a word frequency list (histogram) of words in the file name of its content • A peer has the ability to find content hash of another peer in the network • A peer, however does not know its future queries • And, a peer cannot find out the future queries of other peers in the network WITS

Utility Function • Utility function is a proxy for how likely a peer can satisfy another peer’s information needs • Based on available information set • We posit that peers with similar contents are highly likely to satisfy each other’s queries • A peer’s content represents a peer’s long-term interests • A peer’s content builds with the peer’s queries leading to downloads over time • A peer’s future queries suggest the peer’s interests, which should have some correlation to its long-term interests (content) • Supported by our empirical data: a peer can respond to 22% of its own queries, but only 3% of other peers’ queries • U(i, j) = CONTENT_SIMILARITY(i, j) * CONTENT_SIZE(j) • Considers content similarity and content size WITS

IR Similarity Measure • IR measures work with unstructured text • Jensen-Shannon divergence method (Dhillon et al. 2002) • Compares similarity of word frequency • Does know require knowledge of global distribution • Also tried other methods • TF-IDF Cosine • KL Divergence (Cover 1991) • Gives [0, 1] range WITS

Strategy Set • Leaf node • Chooses the set of ultrapeer to connect to in order to maximize its “private” utility [sum of UC (L, UP)] constrained by the number of connections allowed • Ultrapeer • Choose the set of leaf nodes to accept into its club to in order to maximize its “club” utility [sum of UC(UP, L)] constrained by the number of connections allowed • Choose the set of ultrapeers to connect to in order to maximize its “club” utility [sum of UC (UP, UP)], constrained by the number of connections allowed WITS

Peer’s Algorithm • Complexity of a peer’s decision • A peer must make protocol handshake to find out “content hashes” of other peers • While a peer may find it beneficial to connect to another peer, the latter may refuse its connection in preference to a different peer • Algorithm for leaf nodes • A leaf node L discovers a set of foreign ultrapeer through Gnutella node discovery protocol (pong messages and host catcher) • For each ultrapeer UP discovered, L sends its content hash to the UP • UP calculates UC(UP, L) • if UC(UP, L) is better than its worst member, UP will accept L • Otherwise, up will reject L • If UP is not at connection capacity, it will accept L unconditionally • If UP chooses to accept, it will send an acceptance along with UC(L, UP) • Leaf node L will then decide to adjust its currently connected set of UP to connect to the best set of UP’s available. WITS

Leaf Node Algorithm Example • Leaf node L is currently connected to up1, up2, and up3 • UC(L, up1) = 3 • UC(L, up2) = 2 • UC(L, up3) = 1 • Leaf node L discovers upA, and upB • Leaf node L sends inquiry to upA and upB • upA finds that UC(upA, L) > UC(upA, L’) where L’ is the worst member of upA • upA sends L an acceptance along with UC(l, upA) = 2 • upB finds that UC(upB, L) < UC(upB, L’) where l’’ is the worst member of upB • upB sends L a rejection • Leaf node L will drop its connection to up3 and accept the connection to upA • upA will now drop its connection to L’ to make room for L • Incentive reinforcing structure • An ultrapeer’s algorithm is similarly defined. WITS

Interpolation of Number of Connections • Discrete search expansion • Increasing time-to-live is polynomial • Add linear scaling to search expansion by interpolating number of connection • Ultrapeer knows UC(up, up’) for all connected UP’ • Interpolate by relaying queries to the top N up’ • Therefore, a UP, if connected to 3 other UP’s may forward the query to only top 2 UPs. • We will show that we can reduce network loads with our enhanced protocol by reducing the search expansion • Reducing time-to-live • Interpolating number of connections • But retaining the same performance in the improved network WITS

Performance Measures • Recall is the standard IR performance measure • A centralized system would have 100% recall • We calculate the recall of the “dumb” Gnutella and the “enhanced” Gnutella. • We also calculate the cost in terms of information flow across both network architecture. WITS

Data • Collected data for Gnutella v0.6 network for 3 weeks in September 2002 • Collected 10,533 unique hosts • Collected host IP, content, queries • 42% free riders • Average for non-free riders 270 files (long-right-tail) • Backbone distribution corresponds to current backbone shares (tracert) WITS

Simulation • Simulate a network with 1000, 2000, 4000 peers by seeding peers with queries and content data, retaining the actual correlation • 200 clubs • Club degree: 3 • Time-to-live: 0, 1, 2, 3 • Connection interpolation: 1, 2, 3 • Simulate each setup 20 times to evaluate statistical significance of findings WITS

Experimental Design • 3 network evolution methods • Intraclub first and then interclub (efficient) • 1 intraclub evolution = when all leaf node has moved once on average • 1 interclub evolution = when all ultrapeer has moved once on average • Every peer gets equal chance to move (realistic) • Random selection with replacement • Random selection without replacement • 1 evolution = when all peer has moved once on average WITS

Performance Evaluation • Maximum performance achievable • Run network evolution model for 10 evolutions • Performance does not improve much after 2 evolutions • All network evolution methods arrive at the same performance – sensitivity testing between all network sizes also achieve the same result • Displaying result for the case where each peer gets equal chance to move when random selecting peers without replacement, network size of 2000 WITS

Performance after 10 evolutions WITS

Statistical Significance WITS

Performance Summary • Summary • Recall improves for all time-to-live • Improvement diminishes with increasing time-to-live • Top clubs (top 50% and top 25%) exhibits more improvement • Incentive reinforcing structure as peers are given incentives to provide more content • Improvement for top clubs is two-fold • Placing themselves among clubs with higher provision • Placing themselves among clubs with similar content • Improvement for all club • Placing themselves among clubs with similar content • Effectiveness of our community formation • Suggests that the base case Gnutella network has much to gain by considering community formation • Our utility model approximate the true benefits (recall) well WITS

Overhead Cost • Cost overhead in our club formulation is in transmissions of content hashes for utility calculation (fixed) • However, the resulting improved network may experience cost saving if it can use fewer TTL or connection interpolation to relay queries while retaining benefits (variable) • Displaying result for the case where each peer gets equal chance to move when randomly selecting peers without replacement, network size of 2000 • Most realistic case • “Intraclub first and then interclub” is however more efficient WITS

Overhead WITS

Overhead • Evolving the community for 1 evolution requires 21,330,000 bytes for sending content hashes • Using TTL of 3 (full connection) requires 1,242 bytes to relay each query • Using TTL of 2 (full connection) requires 243 to relay each query. • Thus 21,351 queries must be relayed before the cost reduction justifies the cost overhead • This translates to each peer making 11 queries. • Not all setups allow us to reduce time-to-live or to interpolate number of connections, but in all cases it is possible for top 50% and 25% clubs • Even for cases where it is not possible, the overhead is sunk cost WITS

Conclusion • Summary • Performance of Gnutella network can be improved by considering community formation based on interests • Inefficient topology formation • Free riding • Our utility model is effective • IR measure – unstructured content naming scheme • Approximate interests on content (utilizing user’s information retrieval pattern) • Cost overhead can be justified • Combine IR and Economics to solve fundamental problems in extant P2P networks • Simulate our formulation on empirical data • Most work in the field present analytical treat • Augment existing protocol WITS

Interest-Based Self-Organization of Peer-to-Peer Networks: a Club Economics Approach