1 / 25

Distributed Databases

Distributed Databases. John Ortiz. Distributed Databases. Distributed Database (DDB) is a collection of interrelated databases interconnected by a computer network Distributed Database Management System (DDBMS) is software which manages a distributed database

Download Presentation

Distributed Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Databases John Ortiz

  2. Distributed Databases • Distributed Database (DDB) is a collection of interrelated databases interconnected by a computer network • Distributed Database Management System (DDBMS) is software which manages a distributed database • World Wide Web technology does not yet constitute a DDB by our definition Distributed Databases

  3. Advantages of a DDB • Supports various levels of transparency • Distribution (network) transparency • Degree to which user is unaware of the networked nature of the DB • Replication transparency • Degree to which user is unaware of copies of the DB • Fragmentation transparency • Degree to which user is unaware the DB is broken into pieces Distributed Databases

  4. Advantages of a DDB • Increased Reliability and Availability • Reliability – probability a system is running at a particular point in time • Availability – probability a system is continuously available during a time interval Distributed Databases

  5. Advantages of a DDB • Improved Performance • Supports data localization – data is kept near where it is most often used to reduce affects of network delay • Easier Expansion • Adding more data, increasing DB size, adding resources is easier • Reduced Operation Costs (when considering a mainframe system) • cheaper to add workstations than a new mainframe computer Distributed Databases

  6. Advantages of a DDB • No Single Point of Failure • When one computer fails, others can take its place Distributed Databases

  7. Disadvantages of a DDB • Significant increase in complexity • Normalization, query optimization, security, transaction processing, concurrency control, crash recovery, etc. ALL become much more difficult to handle • Increased storage requirements • Since multiple copies of various portions of the DB exist, more storage space is required Distributed Databases

  8. Data Fragmentation • Fragmentation is the division of the database into pieces stored at different sites • Horizontal Fragmentation – a subset of tuples in a particular relation • the result of a query which SELECTS some tuples, but not others produces a horizontal “fragment” • In a DDB, the output from the previous query may be stored as a separate DB at a separate site • Requires a UNION to recombine information Distributed Databases

  9. Data Fragmentation • Vertical Fragmentation – a subset of attributes of a particular relation • The result of a query which PROJECTS certain, specific attributes • Requires an outer join (or an outer union) to recombine information • Hybrid Fragmentation – can you guess? • Includes both horizontal and vertical fragmentation • Complete fragmentation simply means all tuples/attributes are in the result • A fragmentation schema Distributed Databases

  10. Data Fragmentation • A fragmentation schema is a definition of the set of fragments that includes all attributes and tuples sufficient to reconstruct the DB • An allocation schema describes which fragments are at what sites Distributed Databases

  11. Data Replication • Replication is the creation of copies of the DB • A DDB may be fully replicated (a copy of the entire DB is made at each site) • Why would you want to make a full copy of a DDB? • A DDB may have no replication (each fragment is stored at one and only one site) • Naturally, a DDB may be partially replicated • A replication schema is a description of what pieces are copied at which sites Distributed Databases

  12. Data Replication • Replication creates new consistency and redundancy problems • Every piece of data that is replicated is redundant, and therefore subject to be inconsistent • These copies may be updated separately which causes inconsistency • How much inconsistency acceptable? Distributed Databases

  13. Synchronization • Synchronization is the process of of updating the individual replicas • Since pieces are stored in different places, the DDB must periodically be made consistent • Synchronization can be expensive in terms of network resources and time • It is not simply copying one replica to another – most recent updates on both copies being synchronized must be accounted for • P.775 - 778 in the text has an example of a DDB Distributed Databases

  14. US Air Force Email • We have noted in the past that there are many types of databases such as spreadsheets, address books, and even documents (such as MS Word) • Consider the AF with approximately 500,000 people who all have email addresses and need to communicate • They have constructed a global email address book and make use of replication • The AF is divided into levels: global, command, base Distributed Databases

  15. US Air Force Email • Initially the bases were each set up with email and interconnected via the network • However, you had to know the email address of anyone at a different base • Eventually, each command (a group of related bases) set up an address book consisting of all the bases • Each base maintains a complete replica of the entire commands address book • Why not just a piece? Distributed Databases

  16. US Air Force Email • The DB is synchronized each night • So, when someone moves, their email address is removed from the local copy • All the other bases will still have that “old” email address until the next day, at which point the DDB is consistent again • I believe that now the entire AF address book is available at each base • Not sure how often it is synchronized, perhaps weekly • Search for an email address is quick Distributed Databases

  17. US Air Force Email • Search for an email address is quick since a local copy is kept • This reduces network traffic considerably compared with everyone having to search a centralized DB for email addresses Distributed Databases

  18. Query Processing in DDB • When we looked at query processing before, the largest delay was with the disk • Now, that same concept is extended to include network delay – which can be much longer • Suppose the EMPLOYEE DB (10,000 records, 100 bytes each) is at site 1, and the DEPARTMENT DB (100 records, 35 bytes each) is at site 2 • YOU are at site 3 • Assume result is 400,000 bytes Distributed Databases

  19. Query Processing in DDB • SELECT E_Name • FROM EMPLOYEE • WHERE DeptNum = 5 • There are 3 strategies: • 1) Txfr both DBs to site 3 to perform the query • (1,003,500 bytes txfr’d) • 2) Txfr EMPLOYEE to site 2, perform the query, txfr result to site 3 (1,400,000 bytes txfr’d) • 3) Txfr DEPARTMENT to site 1, perform the query, txfr result to site 3 (403,500 bytes) Distributed Databases

  20. Query Processing using Semijoin • Rather than sending the entire set of records to be joined, we could just send the joining attribute(s) • Then the join is performed and the join attributes as well as the attributes projected, can be transferred to the requesting site • The semijoin is symbolized as: • NOTE: • R S S R • Substantially reduces amount of data txfr’d Distributed Databases

  21. Concurrency Control and Recovery • Dealing with multiple copies • Failure of individual sites • Failure of network • Distributed commit is more complicated • Deadlock is more difficult to detect and prevent • A number of techniques have been proposed to deal with these problems Distributed Databases

  22. Distinguished Copy • The locks for a data item are associated with the distinguished copy • There are several distinguished copy variations: • Primary site (with backup) • One site is the chosen one and coordinates locking activities (centralized locking) • Primary copy • Various fragments at different sites are chosen as the distinguished copy – this distributes the locking problem Distributed Databases

  23. Distributed Recovery • Very complex • Suppose that X sends a request to Y – there may be a number of reasons the request was not granted • Message was never delivered • Site Y is down • Site Y sent a response but the response was not delivered Distributed Databases

  24. Summary • Re-read the first 23 slides! • Advantages/Disadvantages of a DDB • The 3 Transparencies: network, replication, fragmentation • Fragmentation • Replication and Synchronization • Query Processing in a DDB • Semijoin • Concurrency Control and Recovery Distributed Databases

  25. Primary Site Technique Distributed Databases

More Related