Distributed Database Systems

Distributed Database Systems

What is a Distributed Database System ? A distributed database is a collection of databases which are distributed over different computers of a computer network. • Each site has autonomous processing capability and can perform local applications. • Each site also participates in the execution of at least one global application which requires accessing data at several sites.

Access Processor Access Processor Access Processor Multiprocessor Database Computers T T T Application (front-end) computer Interface Processor What we miss here is the existence of local applications, in the sense that the integration of the system has reached the point where no one of the computers (i.e., IFPs & ACPs) is capable of executing an application by itself.

Why Distributed Databases ? • Local Autonomy: permits setting and enforcing local policies regarding the use of local data (suitable for organization that are inherently decentralized). • Improved Performance: The regularly used data is proximate to the users and given the parallelism inherent in distributed systems. • Improved Reliability/Availability: • Data replication can be used to obtain higher reliability and availability. • The autonomous processing capability of the different sites ensures a graceful degradation property. • Incremental Growth: supports a smooth incremental growth with a minimum degree of impact on the already existing sites. • Shareability: allows preexisting sites to share data. • Reduced Communication Overhead: The fact that many applications are local clearly reduces the communication overhead with respect to centralized databases.

Disadvantages of DDBSs • Cost: replication of effort (manpower). • Security: More difficult to control • Complexity: • The possible duplication is mainly due to reliability and efficiency considerations. Data redundancy, however, complicates update operations. • If some sites fail while an update is being executed, the system must make sure that the effects will be reflected on the data residing at the failing sites as soon as the system can recover from the failure. • The synchronization of transactions on multiple sites is considerably harder than for a centralized system.

Distributed DBMS Architecture

NetworkTransparancy • The user should be protected from the operational details of the network. • It is desirable to hide even the existence of the network, if possible. • Location transparency: The command used is independent of the system on which the data is stored. • Naming transparency: a unique name is provided for each object in the database.

Replication & Fragmentation Transparancy • The user is unaware of the replication of framents • Queries are specified on the relations (rather than the fragments). Site A Copy 1 of R1 Copy 1 of R2 Relation R Fragment R1 Site B Copy 2 of R1 Fragment R2 Fragment R3 Fragment R4 Site C Copy 2 of R2

ANSI/SPARC Architecture External view External view External view External Schema Conceptual view Conceptual Schema Internal view Internal Schema Internal view: deals with the physical definition and organization of data. Conceptual view: abstract definition of the database. It is the “real world” view of the enterprise being modeled in the database. External view: individual user’s view of the database.

A Taxonomy of Distributed Data Systems A distributed database can be defined as a logically integrated collection of shared data which is physically distributed across the nodes of a computer network. Distributed data systems Heterogeneous (Multidatabase) Homogeneous Unfederated (no local users) Federated Loosely coupled (interoperable DB systems using export schema) Tightly coupled (/w global schema)

Architecture of a Homogeneous DDBMS Global user view 1 Global user view n A homogeneous DDBMS resembles a centralized DB, but instead of storing all the data at one site, the data is distributed across a number of sites in a network. Global Schema Fragmentation Schema Allocation Schema Local conceptual schema 1 Local conceptual schema n Local internal schema 1 Local internal schema n Local DB 1 Local DB n

Fragmentation Schema & Allocation Schema Fragmentation Schema: describes how the global relations are divided into fragments. Allocation Schema: specifies at which sites each fragment is stored. Example: Fragmentation of global relation R. A B To materialize R, the following operations are required: R=(A B) U ( C D) U E C D E

Homogeneous vs. Heterogeneous Global user • Homogeneous DDBMS • No local users • Most systems do not have local schemas • Heterogeneous DDBMS • There are both local and global users • Multidatabase systems are split into: • Tightly Coupled Systems: have a global schema • Loosely Coupled Systems: do not have a global schema. Local user Local user Multidatabase Management system DBMS DBMS DBMS DBMS

Schema Architecture of a Tightly-Coupled System An individual node’s participation in the MDB is defined by means of a participation schema. Global user view 1 Global user view n Global Conceptual Schema Auxiliary Schema 1 Local Participation Schema 1 Local Participation Schema 1 Auxiliary Schema 1 Local user view 1 Local Conceptual Schema 1 Local Conceptual Schema 1 Local user view 1 Local user view 2 Local Internal Schema 1 Local Internal Schema 1 Local user view 2 Local DB 1 Local DB 1

Auxiliary Schema (1) Auxiliary schema describes the rules which govern the mappings between the local and global levels. • Rules for unit conversion: may be required when one site expresses distance in kilometers and another in miles, … • Rules for handling null values: may be necessary where one site stores additional information which is not stored at another site. • Example: One site stores the name, home address and telephone number of its employees, whereas another just stores names and addresses.

Auxiliary Schema (2) • Rules for naming conflicts: naming conflicts occur when: • semantically identical data items are named differently • DNAME  Department name (at Site 1) • DEPTNAME  Department name (at Site 2) • semantically different data items are named identically. • NAME  Department name (at Site 1) • NAME  Manager name (at Site 2) • Rules for handling data representation conflicts: Such conflicts occur when semantically identical data items are represented differently in different data source. • Example: Data represented as a character string in one database may be represented as a real number in the other database.

Auxiliary Schema (3) • Rules for handling data scaling conflicts: Such conflicts occur when semantically identical data items stored in different databases using different units of measure. • Example: “Large”, “New”, “Good”, etc. These problems are called domain mismatch problems

Loosely-Coupled Systems(Interoperable Database Systems) Global user view 1 Global user view 2 Global user view 3 Local user view 1 Local Conceptual schema 1 Local Conceptual Schema 2 Local Conceptual Schema n Local user view 2 Local internal schema 1 Local internal Schema 2 Local internal Schema n Local DB n Local DB 1 Local DB 2

Loosely-Coupled Systems Global user view 1 Global user view 2 Global user view m Export Schema n Export schema 1 Export schema 2 Export Schema 3 Local user view 1 Local Conceptual schema 1 Local Conceptual Schema 2 Local Conceptual Schema n Local user view 2 Local internal schema 1 Local internal Schema 2 Local internal Schema n Global user views are constructed using powerful query language such as MSQL Local DB n Local DB 1 Local DB 2

Integration of Heterogeneous Data Models • Provide bidirectional translators between all pairs of models • Advantage: no need to learn another data model and language • Disadvantage: requires n(n-1) translators, where n is the number of different models. • Adopt a single model (called canonical model) at the global level and map all the local models onto this model • Advantage: requires only 2n translators • Disadvantage: translations must go through the global model. (The 2nd approach is more widely used)

Distributed Database Design • Top-Down Approach:The database system is being designed from scratch. • Issues: fragmentation & allocation • Bottom-up Approach: Integrating existing databases into one database • Issues: Design of the export and global schemas.

TOP-DOWN DESIGN PROCESS Requirements Analysis Defining the interfaces for end users Entity analysis + functional analysis System Requirements (Objectives) Conceptual design View integration View Design External Schema Definitions Global conceptual schema Access information Distribution Design Fragmentation & allocation Local Conceptual Schemas Maps the local conceptual schemas to the physical storage devices. Physical Design Physical Schema

Design Consideration (1) • The organization of distributed systems can be investigated along three dimensions: • Level of sharing • No sharing: Each application and its data execute at one site. • Data sharing: Programs are replicated at all sites, but data files are not. • Data + Program Sharing: Both data and programs may be shared.

Design Consideration (2) • Assess Pattern • Static: Access patterns do not change. • Dynamic: Access patterns change over time. • Level of Knowledge • No information • Partial information: Access patterns may deviate from the predictions. • Complete information: Access patterns can reasonably be predicted.

Fragmentation Alternatives Vertical Partitioning J JNO JNAME BUDGET LOC J1 Instrumental 150,000 Montreal J2 Database Dev. 135,000 New York J3 CAD/CAM 250,000 New York J4 Maintenance 350,000 Paris JNO BUDGET J1 150,000 J2 135,000 J3 250,000 J4 310,000 Horizontal Partitioning JNO JNAME LOC J1 Instrumentation Montreal J2 Database Devl New York J3 CAD/CAM New York J4 Maintenance Paris J1 JNO JNAME BUDGET LOC J1 Instrumental 150,000 Montreal J2 Database Dev. 135,000 New York J2 JNO JNAME BUDGET LOC J3 CAD/CAM 150,000 Montreal J4 Maintenance. 310,000 Paris

Why fragment at all? • Reasons: • Interquery concurrency • Intraquery concurrency • Disadvantages: • Vertical fragmentation may incur overhead. • Attributes participating in a dependency may be allocated to different sites. • Integrity checking is more costly.

Degree of Fragmentation • Application views are usually subsets of relations. Hence, it is only natural to consider subsets of relations as distribution units. • The appropriate degree of fragmentation is dependent on the applications.

Correctness Rules • Vertical Partitioning • Lossless decomposition • Dependency preservation • Horizontal Partitioning • Disjoint fragments • Allocation Alternatives • Partitioning: No replication • Partial Replication: Some fragments are replicated • Full Replication: Database exists in its entirety at each site

Notations S Title SAL L1 E J LOC ENO ENAME TITLE JNO JNAME BUDGET L2 L3 G ENO JNO RESP DUR L1: 1-to-many relationship Owner(L1) = S = Source relation Member(L1) = E = Target relation

Simple Predicates • Given a relation R(A1, A2, …, An) where Ai has domain Di, a simple predicate pj defined on R has the form • pj: AiValue • where and Value Example: J JNO JNAME BUDGET LOC J1 Instrumental 150,000 Montreal J2 Database Dev. 135,000 New York J3 CAD/CAM 250,000 New York J4 Maintenance 350,000 Orlando • Simple predicates: • p1: JNAME = “Maintenance” • P2: BUDGET = “200,000” • Note: A simple predicate defines a data fragment

MINTERM PREDICATE • Given a set of simple predicates for relation R. • P = {p1, p2, …, pm} • The set of minterm predicates • M = {m1, m2, …, mn} • is defined as • M = {mi | mi = } • where Possible simple predicates: P1: TITLE=“Elect. Eng.” P2: TITLE=“Syst. Analy” P3: TITLE=“Mech. Eng.” P4: TITLE=“Programmer” P5: SAL<=35,000 P6: SAL > 35,000 Some corresponding minterm predicates: A minterm predicate defines a data fragment

Primary Horizontal Fragmentation A primary horizontal fragmentation is defined by a selection operation on the owner relations of a database schema. E J ENO ENAME TITLE JNO JNAME BUDGET LOC L2 L3 G Owner(L3) = J ENO JNO RESP DUR A possible fragmentation of J is defined as follows:

Horizontal Fragments Thus, a horizontal fragment Ri of relation R consists of all the tuples of R that satisfy a minterm predicate mi. There are as many horizontal fragments (also called minterm fragments) as there are minterm predicates.

Completeness (1) A set of simple predicate Pr is said to be complete if and only if there is an equal probability of access by every application to any two tuples belonging to any minterm fragment that is defined according to Pr. Simple Predicates Minterm Fragments Applications A1 ≥ k1 A2 = k2 A3 ≤ k3 A4=k4 p1 F1 A1 p1 p3 A2 p3 F2 A3 F3 A4 The fragments look homogeneous

Completeness (2) A set of simple predicate Pr is said to be complete if and only if there is an equal probability of access by every application to any two tuples belonging to any minterm fragment that is defined according to Pr. Case 1: The only application that accesses J wants to access the tuples according to the location. The set of simple predicates LOC=“Montreal”, Pr = LOC=“New York”, LOC=“Orlando” LOC=“Montreal” J1 LOC=“New York” J J2 is complete because each tuple of each fragment has the same probability of being accessed. J3 LOC=“Orlando”

Completeness (3) Example: J1 JNO JNAME BUDGET LOC 001 Instrumental 150,000 Montreal JNO JNAME BUDGET LOC 004 GUI 135,000 New York 007 CAD/CAM 250,000 New York Note: Completeness is a desirable property because a complete set defines fragments that are not only logically uniform in that they all satisfy the minterm predicate, but statistically homogeneous. J2 J3 JNO JNAME BUDGET LOC 003 Database Dev. 310,000 Orlando • Case 2: There is a second application which accesses only those project tuples where the budget is less than $200,000. • Since tuple “004” is accessed more frequently than tuple “007”, Pr is not complete. • To make the the set complete, we need to add (BUDGET< 200,000) to Pr.

Minimality • Relevant: • Let mi and mjbe two almost identical minterm predicates: • mi = p1Λp2Λ p3 fragment fi • mj = p1Λ¬ p2Λ p3 fragment fj p2 is relevant if and only if Access frequency Cardinality • That is, there should be at least one application that accesses fi and fj differently. • i.e., The simple predicate pi should be relevant in determining a fragmentation. • Minimal • If all the predicates of a set Pr are relevant, Pr is minimal.

A Complete and Minimal Example • Two applications: • One application accesses the tuples according to location. • Another application accesses only those project tuples where the budget is less than $200,000. • Case 1: Pr={Loc=“Montreal”, Loc=“New York”, Loc=“Orlando”, BUDGET<=200,000,BUDGET>200,000} is • complete and minimal. • Case 2: If, however, we were to add the predicate JNAME= “Instrumentation” to Pr, the resulting set would not be minimal since the new predicate is not relevant with respect to the applications.

BUDGET<=200,000 JNAME = “Instrument” J11 J121 LOC=“Montreal” J12 J122 J1 BUDGET>200,000 JNAME!  “Instrument” J LOC=“New York” BUDGET<=200,000 J2 J21 J22 LOC=“Orlando” J3 BUDGET>200,000 [ JNAME = “Instrument” ] is not relevant. BUDGET<=200,000 J31 J32 BUDGET>200,000 Relevant Irrelevant

Application Information • Qualification Information • The fundamental qualification information consists of the predicates used in user queries (i.e., “where” clauses in SQL). • 80/20 rule: 20% of user queries account for 80% of the total data access.  One should investigate the more important queries. • Quantitative Information • Minterm Selectivity sel(mi): number of tuples that would be accessed by a query specified according to a given minterm predicate. • Access Freequency acc(qi): the access frequency of queries in a given period. Qualitative information guides the fragmentation activity. Quantitative information guides the allocation activity.

Determine the set of meaningful minterm predicates • Applications: • Take the salary and determine a raise accordingly. • The employee records are managed in two places, one handling the records of those with salary less than or equal to $30,000 and the other handling the records of those who earn more than $30,000. Pr={p1: SAL<=30,000, p2: SAL>30,000} is complete and minimal. The minterm predicates: is contradictory is contradictory Therefore, we are left with M = {m2, m3} Implications:

Invalid Implications J JNO JNAME BUDGET LOC J1 Instrumental 150,000 Montreal J2 Database Dev. 135,000 New York J3 CAD/CAM 250,000 New York J4 Maintenance 350,000 Orlando Simple predicates p1: LOC=“Montreal” p2: LOC=“New York” p3: LOC=“Orlando” p4: BUDGET<=200,000 p5: BUDGET>200,000 VALID Implications INVALID Implications Implications should be defined according to the semantics of the database, not according to the current values.

Compute Complete & Minimal Set Rule:a relation or fragment is partitioned “into at least two parts which are accessed differently by at least one application. Relevant:a simple predicate which satisfies the above rule, is relevant. • Repeat until the predicate set is complete • Find a simple predicate pi that is relevant • Determine minterm fragment fi according to pi • Accept pi and fi • Remove any pj and fj from acceptance list if pj becomes nonrelevant /* the list is minimal */ • Determine the set of minterm predicates M (using the acceptance list) • Determine the set of implications I (among the acceptance list) • For each mi in M, remove mi if it is contradictory according to I

Derived Horizontal Fragmentation Derived fragmentation is used to facilitate the join between fragments. In some cases, the horizontal fragmentation of a relation cannot be based on a property of its own attributes, but is derived from the horizontal fragmentation of another relation.

Benefits of Derived Fragmentation Primary Fragmentation: PAY (TITLE, SAL) EMP (ENO, ENAME, TITLE) EMP1 = EMP SJ PAY1 EMP2 = EMP SJ PAY2 • Using Derived Fragmentation: EMP1 PAY1 EMPi and PAYi can be allocated to the same site. EMP2 PAY2 Not using derived fragmentation: one can divide EMP into EMP1 and EMP2 based on TITLE and divide PAY into PAY1, PAY2, PAY3 based on SAL. To join EMP and PAY, we have the following scenarios. EMP1 EMP2 PAY1 PAY2 More communication overhead ! PAY3

Derived Fragmentation • How do we fragment EMP_PROJ ? • Semi-Join with EMP, or • Semi-Join with PROJ • Criterion: Suport the more-frequent join operation. EMP (ENO, ENAME, TITLE) PROJ (PNO, PNAME, BUDGET) EMP_PROJ (ENO, PNO, RESP, DUR)

Star Relationships • Design the primary horizontal fragmentation for SPJ. • Derive the derived fragmentation designs for S, P, and J accordingly. • Si = S SJSNAM SPJi • Pi = P SJPNAM SPJi • Ji = J SJSNAM SPJi S (SNUM, …) P (PNUM, …) SPJ (SNUM, PNUM, JNUM, …) J (JNUM, …)

Chain Relationships • Design the primary fragmenation for R1. • Derive the derived fragmentation for Rk as follows: • Rk = Rk SJRKFK=R(k-1)PK R(k-1) • for 2  k  n in that order. R1 (R!PK, …) R2 (R2PK, R1FK, …) R3 (R3PK, R2FK, …) . . .

VERTICAL FRAGMENTATION Purpose: Identify fragments Ri such that many applications can be executed using just one fragment. Advantage: When many applications which use R1 and many applications which use R2 are issued at different sites, fragmenting R avoids communication overhead. • Vertical partitioning is more complicated than horizontal partitioning: • Vertical Partitioning: The number of possible fragments is equal to where m is the number of nonprimary key attributes • Horizontal Partitioning: 2n possible minterm predicates can be defined, where n is the number of simple predicates in the complete and minimal set Pr.

Vertical Fragmentation Approaches • Greedy Heuristic Approaches: • Split Approach: Global relations are progressively split into fragments. • Grouping Approach: Attributes are progressively aggregated to constitute fragments. • Correctness: • Each attribute of R belongs to at least one fragment. • Each fragment includes either a key of R or a “tuple identifier”.

Distributed Database Systems