Comparing hybrid peer to peer systems
Download
1 / 22

Comparing Hybrid Peer-to-Peer Systems - PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

Comparing Hybrid Peer-to-Peer Systems. based on an article by Hector Garcia-Molina Beverly Yang by Tudor Balan. P2P short survey. P2P advantages

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Comparing Hybrid Peer-to-Peer Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Comparing Hybrid Peer-to-Peer Systems

based on an article by

Hector Garcia-Molina

Beverly Yang

by

Tudor Balan


P2P short survey

P2P advantages

  • Resources of many computers might be gathered to form large pools of information and significantly computing power.

  • Network bandwidth significantly improves as computers directly communicate

    P2P drawbacks

  • due to decentralized nature.

    • Ex. Gnutella(network flooding & no scallability)

  • Improvements

    • Ex. Napster (restricted server search, fractional indexing)

      Goal

  • Study the functionality of P2P systems in order to understand their tradeoffs

  • Concentrate on data sharing and hybrid P2P systems.


Data sharing overview

Pure data sharing systems

Data sharing systems

Hybrid data sharing systems

Hybrid data sharing systems hugely popular but …

well studied also?

  • Which is the best way to organize server nodes?

  • Should indexes be replicated?

  • Which are the common queries asked by users?

  • How to treat disconnected users?


Problem analysis and treatment

  • Present several architectures for P2P data sharing systems already used or to be.

  • Probabilistic model for user queries and for the result size

  • Illustrate a systems performance evaluating model

  • Based on above, let’s see some comparisons.


Server architectureGeneral concepts

  • Login

    • library

    • connecting

    • metadata upload

    • index

    • connection information (client IP, line speed)

    • local server

    • remote servers

    • local users

  • Query

    • list of desired words

    • satisfied (max nr of results touched)

    • query processing way (retrieve and intersect lists for each query word)

  • Download

    • library enrichment notification

    • index update

    • server notification when remove/logoff comes up


Server architecturesLogin policies

  • Batch

    • Login entire library metadata upload

    • Logoff entire library metadata removed

    • Index={metadata of active users}

    • Advantages

      • Small index dimensions

      • Increased query efficiency

    • Disadvantages

      • Intense and expensive metadata update

  • Incremental

    • Metadata permanence

    • Difference update

    • Advantages

      • Less effort at login/logoff

    • Disadvantages

      • Increased memory requirements

      • Penalty on query efficiency

      • Need to connect to the same server(sometimes)


Server architectures

  • Chained Architecure

    • Linked server nodes

    • Login

      • Local server metadata upload

      • Others server nodes unaffected

    • Query

      • Submitted to local server

      • While (not enough results OR all servers received and serviced the query)

        • local server contacts other servers

      • End While

    • Performance

      • Efficient login and downloads (local server conversation only)

      • Expensive query treatments (query forwarding, multiple query execution, results retrieval)

  • Full Replication Architecture

    • Intended to overcome previous disadvantages

    • Each server contains a complete index

    • Advantages

      • Single server queried

      • Login at any server (even in incremental policy case)

    • Disadvantages

      • Logins sent to all servers

      • High login/logoff frequency sensibility


Server architectures

  • Hash Architecture

    • Login

      • Metadata words hashed to # servers

      • A given server maintains the complete lists for a subset of all words

    • Query

      • Addressed to only one server

      • The addressed server ask other servers the lists for the words it doesn’t have

      • The addressed server merges all lists

    • Advantages

      • Limited nr of servers involved in each query processing

      • Limited nr of servers update metadata

      • No results traffic (only lists)

    • Disadvantages

      • High bandwidth for lists manipulations

  • Unchained architecture

    • Set of independent servers

    • Login

      • To one isolate server

      • No other servers are affected

    • Query

      • The server the user has logged on

    • Advantage

      • Scalability

    • Disadvantage

      • Partial functionality

      • Limited query results


Query model

  • Needed for systems comparison

  • Goals

    • Number of query results estimation

    • Nr. of servers to process a query

  • Initial computations in Chained architecture (more complex)

  • Subsequent derived computations (relaxing or particularizing chained architecture conditions)


Query model(following)Chained architecture

  • Assume a query universe q1,q2…

  • g = the probability function that describes the query popularity, i.e g(i) is the probability that a submitted query happens to be query I

  • f= the probability density function that describes the query selection power. If we take a given file in a user’s library, it will match query i with probability f(i)


Query model(following)

  • Full replication

    • ExServ=1  all results are local

    • ExRemoteResults=0

  • Unchained

    • ExServ=1all results are local

    • ExRemoteResults=0

  • Hash

    • ExRemoteResults=0


Particularization

In case of music share g and f might be realistically taken as:


Performance model

  • Illustrates the way to measure the performance of a P2P system

  • NumServers (LAN, WAN)

  • Users (LAN, WAN)

  • {LAN,WAN} X {LAN, WAN}

  • Compute action costs in terms of:

    • CPU cycles

    • Interserver communication bandwidth

    • User-server communication bandwidth


CPU consumption

CPU cost variations for chained architecture (batched and incremental)

Interpretation

  • CPU cost variations for other architectures (related to chained one)

  • Unchained & Full replication

    • query costs (batch & incremental) formula is the same

    • …but ExServ=1 and ExRemoteResults=0

  • Hash

    • additional cost for list transfer (for query costs)


Network consumption

Client-Server byte costs

Interserver byte costs

  • Full replication

    • each server sees each Login, AddFile, RemoveFile

    • LAN  once broadcast each message

    • WAN  each message sent NumServers-1 times by local server

  • Hash

    • each of selected server sees each Login, AddFile, RemoveFile

    • LAN  once broadcast each message

    • WAN AddFile sent once for each server containing lists for words contained in the name of the file

  • Unchained

    • no interserver communication

    • 0 login costs

  • Chained

    • query interserver communication

    • no login interserver communication

    • 0 login costs


Overall performance

  • Hypothesis: known formulas for each action cost

  • Performance metric: UsersPerServer

  • How to compute a global formula for UsersPerServer ? (direct?...to complex)

  • For each resource

    • Assume infinite resources of other 2 types

    • Compute UsersPerServer for current resource (UsersPerServeri)

  • Compute min(UsersPerServeri)


Experiments

  • Results of performance studies

  • Music sharing systems

  • Sharing systems for domains others than music

  • Maximum number of users( throughput, not response time)

  • Architectures={CHN,FR,HASH,FR}

  • Login policies={batch, incremental}

  • Strategies=Architectures X Login policies


  • For MaxResults=100:

  • QueryLoginRation

    • nr of logins/sec

    • users supported

      available files

      expected nr of results

Music share systems behaviour

  • Ex: For Query/Login ratio=1:

    • Incremental FR=54203

    • Batch FR=7281

  • QueryLoginRation increaseslogins/sec decreasesperformance increases

  • Incremental strategies outperform batch counterparts

  • CHN & UCH better than FR & HASH

  • UHCCHN(conserves performance but increases returned results)

  • Paradox: UCH more used than CHN

  • QueryLoginRatio sensitivity


Memory analysis

No previous treatment of memory implications

Batch strategies better than the incremental counterparts

Memory=f(NumServers,ActiveFrac)

NumServers , Memory (for FR)

Mem of incremental=1/ActiveFrac Mem of batch

ActiveFrac incr. strategies come closer to batch.

Memory price may eliminate worries about memory limitations

Small analysis

Ex1.QueryLoginRatio=.75(incr & batch CHN comparison) (69708,26828) vs (12268,28828)

take batch

Ex2. QueryLoginRatio=.25(incr & batch CHN comparison) (52088,9190) vs (12268,9190)

 take incremental


Beyond music…

  • We can generally compute

    • Expected nr of results of a query

    • Expected nr of servers to satisfy the query

  • …using

    • g()  distribution of query frequency

    • f() distribution of selection power

  • f and g are input for the general query model

  • For music f, g exponential (positively correlated)  all precedent results( the more popular a query is, the greater the selection power is)

  • What if we have a stock?

    • Select * from Product where price>10 (rare query) return as much results as

    • Select * from Product (common query)

    • No correlation

  • What about archive-driven company?

    • Rare queries (for old articles) return good results

    • Frequent queries (for new articles) return few results

    • Negative correlation


Performance variation as function of correlation


Final Conclusions

  • Chained

    • Best for music today

    • Good login, least memory

    • Poor if many servers involved

  • Full replication

    • Potentially good in the future when more stable connections

  • Hash

    • Has high bandwidth requirements

    • Good in future or in systems when servers must not exchange large metadata amounts

  • Unchained

    • Not recommended

    • Few results for only small performance improvement

    • Good when nr of results is not important

  • Incremental policy

    • Good for systems with negative correlation


ad
  • Login