Database Systems & Peer-2-Peer Networks

Database Systems & Peer-2-Peer Networks By Chibuike Muoh

Introduction • What is a Peer-2-Peer? According to [1] P2P is • a network that relies on the computing power and bandwidth of the participants (referred to as nodes) in the network rather than concentrating it in a relatively low number of servers. • Examples of P2P and its applications include • Content sharing networks such as Gnutella, Bittorrent etc. • Distributed processing such as computing clusters etc.

Introduction contd. • In essence, P2P networks are an alternative to the client/server architecture because they provide scalability and fault tolerance unlike the client/server model that is hierarchical and introduces a single point of failure for the system. • Other definitions of P2P • A P2P system is a collection of distributed computing nodes of equal roles or capabilities, exchanging information and services directly with each other

Introduction: Advantages of P2P • The P2P system of distributed computing is very desirable because: • Ad-hoc and dynamic membership means P2P systems can easily scale up/down as the number of peers grow (self organizing) • Better utilizes bandwidth since the peers communicate directly with each other • Inherent robustness and load balancing • Fault tolerant since no one node is more important the other • Richness and diversity in services/resource available in the network.

Types of P2P Systems • Currently there are (2) main architecture of content sharing P2P networks • Pure P2P • Peers have equal role and capabilities (act as both the client and server in the network) • Employs no centralized servers for global indexing and searching. • Conversely there is no concept of login • Unrestrictive membership qualification • Generally more dynamic and less venerable to downtimes. • E.g. Gnutella and Freenet

Types of P2P Systems • Hybrid P2P • Employs a system of central server(s) solely for the purpose of global indexing of resources in the system • Typically has restrictive membership usually requiring registration and logins • Yield better performance at the expense of the system being limited by the availability of the indexing servers • The peers in the system are responsible for hosting the resources available in the system (servers only index the resources) • E.g. Napster

P2P vs. Client server • For the purpose of this presentation, the key difference to note between P2P and the client server model is the way both systems handle query evaluation • In traditional client server model: the query must be either posted to the data store which then post back a response (e.g. traditional database systems, warehouses) or the data store could be posted to the query originator to evaluate for results (e.g. sensor and filter networks) • In the case of P2P system: each peer receives a copy of the query, evaluates against its own data store and post back the results. The data originator then aggregates the responses to the original query

Technical Challenges for P2P Systems • It is difficult to predict or reason about the location/quality of the systems resources due to its dynamicity • In the case of a Hybrid system, there is the risk of information explosion due to the fact the system has to maintain a global catalog/index of an increasingly changing number of available resources

Technical Challenges for P2P Systems • Queries in P2P systems are not flexible in that they only allow searches or lookup of resources by name or identifier. This limiting semantics of the system does not allow the user to fully explore the full breadth of the system’s resources. • Updates to a resource go unannounced to the peers with copies of the original data thus littering the system with inconsistent copies

Technical Challenges for P2P Systems • Data placement and availability is another area of problem in current P2P systems. • In current P2P system data placement is mostly “demand driven” and as such network bandwidth is utilized inefficiently (hot spots).

Data placement in P2P systems

Data placement (a.k.a)Rendezvous problem [4] • In a P2P system with nodes connected by a network with limited bandwidth on each link. Peers are typically engaged in the following roles • Data originator/Service provider: provides resources, services or stores content in the P2P system. • Query evaluator/Rendezvous peer: a peer that stores materialized views and/or evaluates queries • Query initiator: a peer that acts a client in the network and poses new queries in request for a service or resource (a node may initiate new queries on behalf of a query it is attempting to evaluate)

Data placement contd. • As you can reason, there is an associated cost of answering queries in a P2P system. These cost include • Transfer cost of query to storage provider(s)/data originator(s) • Resource utilization cost at query evaluator and other participating nodes (workload) • Cost to transfer the results back to the query initiator • More formally, • The Data placement problem in P2P is to distribute data and work so the full query workload is answered with lowest possible cost under existing resource and bandwidth constraints

Data placement in P2P • Thus in traditional P2P, to determine which nodes contain the matches for our query typically the best solution was to flood the whole network • thereby exacerbating the data placement problem in P2P.

Data placement in P2P contd. • On of the more popular strategy to minimize data placement cost in P2P is Distributed Hash tables • DHT forms an abstraction to overlay networks that provide a key-based routing scheme i.e. exact-match search (e.g. given an object identifier oidathe query q(oa) at node njreturns object oa). • Using DHT, the recipient of a query/message is determined dynamically by a distributed routing algorithm • One of the main drawbacks of DHT is that they only support exact-match search using identifiers rather than keyword search (although that functionality can be built atop it).

Static data placement problem • Given a graph G describing a network of peers, the static data placement problem is to perform data placement with optimal cost, where queries are zero-cost object lookups. • Theorem [3]: In a given graph G, the static placement problem is NP-complete, even if all the queries in the workload in G are object queries. • Proof of this theorem is based on a reduction from the vertex-cover problem.

What can P2P learn from database systems • Some of the problems inherent in P2P today are similar to issues addressed (or are being addressed) by the data management community • This includes issues such as • Semantics, • view materialization • query evaluation • data transformation and relationships

What can P2P learn from database systems • One of the main problem at the heart of any P2P system is the cost of data placement • The cost associated with data placement is more like the issue of view materialization in database (warehouses) which is an hotly research topic in the database community. P2P system design would benefit greatly from research in this area. • Also beneficial to reducing data placement cost are lessons learned from the area of multi-query optimization in distributed database. But there some main differences such as the absence of a sole administrator/controller and no centralize database schema in P2P.

What can P2P learn from database systems • Accessing data in hierarchical granularity levels is another area from which P2P system could learn from database systems

What can P2P learn from database systems • Currently, data within a P2P system can only be accessed at the atomic granularity level e.g. a complete audio MP3 file. • But in reality, multiple MP3 files can be grouped into an album, and albums into collections • Thus the data placement problem at this level is that current P2P system only allow placing a single file or an entire album as atomic object at a peer • This is unlike database systems where data can be expressed as a relation between other groups of data or as an aggregation before being displayed

Other Issues in P2P system design • In P2P systems any one peer can originate data/resource. This high level of heterogeneity influences the degree to which the P2P system can ensure uniform global semantics for the data/resources. • One possible solution to this problem might be use a moderator system much like in bittorrent. • This system/model guarantees that data/resource injected into the system are consistent and ensures integrity of the resources in the system

Other Issues in P2P system design • Freshness and update consistency in P2P is an issue in P2P design there is no way of propagating updates from the data originator to storage providers that have materialized view of the data. • Some possible solution would be a timeout/expiration-based protocol much like in DNS system or web caches. • Has lower overhead then invalidation messages (client/server).

Other Issues in P2P system design • Extent of knowledge sharing is another issue in in P2P. It relates to the question of how much knowledge is available to the system during its query optimization process i.e. choosing a query evaluation strategy that can identify nodes that have materialized views which can speed up query processing. • A simple strategy would be to use a centralized catalog (much like Napster’s hybrid P2P system). But this would introduce a single point of failure in the system • An alternative solution might to arrange the P2P system in a hierarchical organization as in DNS or LDAP with each peer holding some fragment of the global catalog. • The basic challenge in this scheme is to achieve a degree of consistency due to the dynamicity of the system.

Resources [1] http://en.wikipedia.org/wiki/Peer-to-peer [2] BeverlyYang, Hector Garcia-Molina Comparing Hybrid Peer-to-Peer Systems. VLDB 2001 [3] Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu. What Can Databases do for Peer-to-Peer? WebDB 2001. [4] Wesley W. Terpstra, Stefan Behnel, Ludger Fiege, Jussi Kangasharju, and Alejandro Buchmann. Bit Zipper Rendezvous [5] Zheng Zhang, Mallik Mahalingam, Zhichen Xu, Wenting Tang. Scalable, Structured Data Placement over P2P Storage Utilities [6] http://en.wikipedia.org/wiki/Distributed_hash_table

Database Systems & Peer-2-Peer Networks