220 likes | 330 Views
This article examines the functionality of Hybrid Peer-to-Peer (P2P) systems, highlighting their advantages and disadvantages in data sharing. It discusses the significant computing power and network bandwidth improvements gained through P2P architecture while recognizing the challenges posed by decentralization, as seen in systems like Gnutella and Napster. The article evaluates different server architectures, such as chained, full replication, hash, and unchained models, considering their impact on query efficiency, indexing, and user connectivity. It aims to enhance understanding of P2P systems' operational intricacies and optimization strategies.
E N D
Comparing Hybrid Peer-to-Peer Systems based on an article by Hector Garcia-Molina Beverly Yang by Tudor Balan
P2P short survey P2P advantages • Resources of many computers might be gathered to form large pools of information and significantly computing power. • Network bandwidth significantly improves as computers directly communicate P2P drawbacks • due to decentralized nature. • Ex. Gnutella(network flooding & no scallability) • Improvements • Ex. Napster (restricted server search, fractional indexing) Goal • Study the functionality of P2P systems in order to understand their tradeoffs • Concentrate on data sharing and hybrid P2P systems.
Data sharing overview Pure data sharing systems Data sharing systems Hybrid data sharing systems Hybrid data sharing systems hugely popular but … well studied also? • Which is the best way to organize server nodes? • Should indexes be replicated? • Which are the common queries asked by users? • How to treat disconnected users?
Problem analysis and treatment • Present several architectures for P2P data sharing systems already used or to be. • Probabilistic model for user queries and for the result size • Illustrate a systems performance evaluating model • Based on above, let’s see some comparisons.
Server architectureGeneral concepts • Login • library • connecting • metadata upload • index • connection information (client IP, line speed) • local server • remote servers • local users • Query • list of desired words • satisfied (max nr of results touched) • query processing way (retrieve and intersect lists for each query word) • Download • library enrichment notification • index update • server notification when remove/logoff comes up
Server architecturesLogin policies • Batch • Login entire library metadata upload • Logoff entire library metadata removed • Index={metadata of active users} • Advantages • Small index dimensions • Increased query efficiency • Disadvantages • Intense and expensive metadata update • Incremental • Metadata permanence • Difference update • Advantages • Less effort at login/logoff • Disadvantages • Increased memory requirements • Penalty on query efficiency • Need to connect to the same server(sometimes)
Server architectures • Chained Architecure • Linked server nodes • Login • Local server metadata upload • Others server nodes unaffected • Query • Submitted to local server • While (not enough results OR all servers received and serviced the query) • local server contacts other servers • End While • Performance • Efficient login and downloads (local server conversation only) • Expensive query treatments (query forwarding, multiple query execution, results retrieval) • Full Replication Architecture • Intended to overcome previous disadvantages • Each server contains a complete index • Advantages • Single server queried • Login at any server (even in incremental policy case) • Disadvantages • Logins sent to all servers • High login/logoff frequency sensibility
Server architectures • Hash Architecture • Login • Metadata words hashed to # servers • A given server maintains the complete lists for a subset of all words • Query • Addressed to only one server • The addressed server ask other servers the lists for the words it doesn’t have • The addressed server merges all lists • Advantages • Limited nr of servers involved in each query processing • Limited nr of servers update metadata • No results traffic (only lists) • Disadvantages • High bandwidth for lists manipulations • Unchained architecture • Set of independent servers • Login • To one isolate server • No other servers are affected • Query • The server the user has logged on • Advantage • Scalability • Disadvantage • Partial functionality • Limited query results
Query model • Needed for systems comparison • Goals • Number of query results estimation • Nr. of servers to process a query • Initial computations in Chained architecture (more complex) • Subsequent derived computations (relaxing or particularizing chained architecture conditions)
Query model(following)Chained architecture • Assume a query universe q1,q2… • g = the probability function that describes the query popularity, i.e g(i) is the probability that a submitted query happens to be query I • f= the probability density function that describes the query selection power. If we take a given file in a user’s library, it will match query i with probability f(i)
Query model(following) • Full replication • ExServ=1 all results are local • ExRemoteResults=0 • Unchained • ExServ=1all results are local • ExRemoteResults=0 • Hash • ExRemoteResults=0
Particularization In case of music share g and f might be realistically taken as:
Performance model • Illustrates the way to measure the performance of a P2P system • NumServers (LAN, WAN) • Users (LAN, WAN) • {LAN,WAN} X {LAN, WAN} • Compute action costs in terms of: • CPU cycles • Interserver communication bandwidth • User-server communication bandwidth
CPU consumption CPU cost variations for chained architecture (batched and incremental) Interpretation • CPU cost variations for other architectures (related to chained one) • Unchained & Full replication • query costs (batch & incremental) formula is the same • …but ExServ=1 and ExRemoteResults=0 • Hash • additional cost for list transfer (for query costs)
Network consumption Client-Server byte costs Interserver byte costs • Full replication • each server sees each Login, AddFile, RemoveFile • LAN once broadcast each message • WAN each message sent NumServers-1 times by local server • Hash • each of selected server sees each Login, AddFile, RemoveFile • LAN once broadcast each message • WAN AddFile sent once for each server containing lists for words contained in the name of the file • Unchained • no interserver communication • 0 login costs • Chained • query interserver communication • no login interserver communication • 0 login costs
Overall performance • Hypothesis: known formulas for each action cost • Performance metric: UsersPerServer • How to compute a global formula for UsersPerServer ? (direct?...to complex) • For each resource • Assume infinite resources of other 2 types • Compute UsersPerServer for current resource (UsersPerServeri) • Compute min(UsersPerServeri)
Experiments • Results of performance studies • Music sharing systems • Sharing systems for domains others than music • Maximum number of users( throughput, not response time) • Architectures={CHN,FR,HASH,FR} • Login policies={batch, incremental} • Strategies=Architectures X Login policies
For MaxResults=100: • QueryLoginRation • nr of logins/sec • users supported available files expected nr of results Music share systems behaviour • Ex: For Query/Login ratio=1: • Incremental FR=54203 • Batch FR=7281 • QueryLoginRation increaseslogins/sec decreasesperformance increases • Incremental strategies outperform batch counterparts • CHN & UCH better than FR & HASH • UHCCHN(conserves performance but increases returned results) • Paradox: UCH more used than CHN • QueryLoginRatio sensitivity
Memory analysis No previous treatment of memory implications Batch strategies better than the incremental counterparts Memory=f(NumServers,ActiveFrac) NumServers , Memory (for FR) Mem of incremental=1/ActiveFrac Mem of batch ActiveFrac incr. strategies come closer to batch. Memory price may eliminate worries about memory limitations Small analysis Ex1.QueryLoginRatio=.75(incr & batch CHN comparison) (69708,26828) vs (12268,28828) take batch Ex2. QueryLoginRatio=.25(incr & batch CHN comparison) (52088,9190) vs (12268,9190) take incremental
Beyond music… • We can generally compute • Expected nr of results of a query • Expected nr of servers to satisfy the query • …using • g() distribution of query frequency • f() distribution of selection power • f and g are input for the general query model • For music f, g exponential (positively correlated) all precedent results( the more popular a query is, the greater the selection power is) • What if we have a stock? • Select * from Product where price>10 (rare query) return as much results as • Select * from Product (common query) • No correlation • What about archive-driven company? • Rare queries (for old articles) return good results • Frequent queries (for new articles) return few results • Negative correlation
Final Conclusions • Chained • Best for music today • Good login, least memory • Poor if many servers involved • Full replication • Potentially good in the future when more stable connections • Hash • Has high bandwidth requirements • Good in future or in systems when servers must not exchange large metadata amounts • Unchained • Not recommended • Few results for only small performance improvement • Good when nr of results is not important • Incremental policy • Good for systems with negative correlation