Distributed Database Systems: Integration vs. Centralization

Parallele und Verteilte Datenbanksysteme Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität Wien Tel. 4277 39425 Sprechstunde: Di, 13.00-14.00 LV-Portal: www.par.univie.ac.at/~brezany/teach/gckfk/300658.html

Motivation Business Medicine Scientific experiments Data and data exploration cloud Simulations Earth observations

The Knowledge Discovery Process Knowledge OLAP Queries OLAP Online Analytical Mining Evaluation and Presentation Data Mining Selection and Transformation Data Warehouse Cleaning and Integration

Data Preprocessing Fig. 3.1

EcoGRID Scetch Distributed Data Distributed Applications Distributed Datamining Reporting Bio- diversity Waste Popular Presen- tation Statistic Air Soil Flow Analysis Prediction Models Emmisions Water Geo- Statistic … Forests Common Ontology

Traumatic brain injuries (TBIs) typically result from accidents in which head strikes an object. The treatment of TBI patients is very resource intensive. The trajectory of the TBI patients management: Trauma event First aid Transportation to hospital Acute hospital care Home care All the above phases are associated with data collection into databases – now managed by individual hospitals. Management of TBI patients Usage of mobile communication devices

assumed Data Mining Accuracy vs. Data Size 100% accuracy sampled data size available data size

GridMiner :A knowledge discovery Grid infrastructure (http://www.gridminer.org/) OGSA-based architecture Workflow management Grid-aware data preprocessing and data mining services Data mediation service OLAP service GUI Current Implementation on top of Globus Toolkit 3.2 Applications : Exploration of ecological data, management of patients with traumatic brain injuries Research exhibition available The GridMiner Project in Vienna

Auf der WWW-Seite der LV Literatur

Distributed Memory Architecture(Shared Nothing) Interconnection Network CPU CPU CPU CPU Local Memory Local Memory Local Memory Local Memory

DMM: Shared Disk Architecture Interconnection Network CPU CPU CPU CPU Local Memory Local Memory Local Memory Local Memory Global Shared Disk Subsystem

Shared Memory Architecture(Shared Everything, SMP) Interconnection Network CPU CPU CPU CPU Global Shared Memory

Cluster of SMPs Interconnection Network CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU 4-CPU SMP 4-CPU SMP 4-CPU SMP 4-CPU SMP

High-Performance I/O Systems

Note: RAID technology is introduced in a separate scriptum.

Principles of Distributed Database Systems The main literature

DDBS is the union of what appears to be two diametrically opposed approaches to data processing: database systems and computer network technologies. Database systems have taken us from a paradigm of data processing in which each application defined and maintained its own data (figure follows) to one in which the the data is defined and adminstered centrally (figure follows) -> data independence (The application programs are immune to changes in the logical and or physical organization of the data and vice versa.) One of the major motivations is the desire to integrate the operational data of an enterprise and to provide centralized, thus controlled access to that data. Distributed Database System (DDBS) Technology – Introduction

The technology of computer networks promotes a mode of work that goes against all centralization efforts. How these two contrasting approaches can be synthesized to produce a technology that is more powerful and more promising than either one alone? The key understanding is the realization that the most important objective of the database technolgy is integration, not centralization. It is important to realize that either one of these terms does not necessarily imply the other. It is possible to achieve integration without centralization, and that is exactly what the distributed database technology attempts to achieve. DDBS – Introduction (cont.)

Distributed Database System Technology - Introduction

Central Database on a Network -Example Boston Edmonton Communication Network Paris San Francisco

Definition 1:Distributed database. A distributed database is a collection of multiple, logically interrelated databases distributed over a computer network. Definition 2:Distributed database management system (DBMS). It is defined as the software system that permits the management of the DDBS and makes the distribution transparent to the users. A DDBS is not a „collection of files“ that can be individually stored at each node of a computer network. To form a DDBS, files should not only be logically related, but there should be structure among the files, and access should be via a common interface. The physical distribution of data is very important. It creates problems, that are not encountered when the databases reside in the same computer system. Distributed Database System (DDBS) - Definitions

Transparency refers to separation of the higher-level semantics of a system from lower-level implementation issues; a transparent system „hides“ the implementation details from the user. Example (next slide): Consider an engineering firm that has offices in several cities. It is preferable, to localize each data such that data about the employees in Edmonton office are stored in Edmonton, ..., and so forth. The same applies to the project information. In this process we partition each of the relations and store each partition at a differetn site – it is known as fragmentation. It may be preferable to duplicate some of this data at other sites for performance and reliability reasons. The result is a distributed database which is fragmented and replicated. Fully transparent access means that the users can still pose queries in the same form as to a centralized system, without paying any attention to the fragentation, location, or replication of data, and let the system worry about resolving these issues. Promises of DDBSs1.Transparent Management of Distributed and Replicated Data

Distributed Database System Environment - Example Edmonton Boston • Edmonton (employees) • Paris Projekte (projects) • Edmont Projekte (projects) • Boston Angestellte (employees) • Paris Angestellte (employees) • Boston Projekte (projects) Communication Network San Francisco Paris • Paris Angestellte (employees) • Paris Projekte (projects) • Boston Angestellte (employees) • Boston Projekte(projects) • San Francisco Angestellte (employees) • San Francisco Projekte (projects)

Distributed DBMSs are intended to improve reliability since they have replicated components and, thereby eliminate single points of failure. The failure of a single site, or the failure of a communication link which makes one or more sites unreachable, is not sufficient to bring down the entire system. In the case of a distributed database, this means that some of the data may be unreachable, but with proper care, users may be permitted to access other parts of the dist. database. The „proper care“ comes in the form of support for distributed transactions. Promises of DDBSs2. Reliability Through Distributed Transactions

A distributed DBMS fragments the conceptual database, enabling data to be stored in close proximity to its points of use. The inherent parallelism of dist. systems may be exploited for inter-query and intra-query parallelism. Inter-query parallelism results from the ability to execute multiple queries at the same time. Intra-query parallelism is achieved by breaking up a single query into a number of subqueries each of which is executed at a different site, accessing a different part of the distributed database. Promises of DDBSs3. Improved Performance

In a distributed environment, it is much easier to accommodate increasing database sizes. Major system overhauls are seldom necessary; expansion can usually be handled by adding processing and storage power to the network. It may be possible to obtain a linear increase in „power“, since this also depends on the overhead of distribution. It normally costs much less to put together a system of smaller computers with the equivalent power of a single big machine. Promises of DDBSs4. Easier System Expansion

Distributed database design Distributed query processing Distributed directory management Distributed concurrency control Distributed deadlock management Heterogeneous databases Problem Areas

The architecture of a system defines its structure. This means that the components of the system are identified, the function of each component is specified, and the interrelationships and interactions among these components are defined. In this part we classify DBMS architectures. These are idealized views – many research and commercially available systems may deviate from them. We use a classification (next slides) that organizes the systems as characterized with respect to (1) the autonomy of local systems, (2) their distribution, and (3) their heterogeneity. Distributed DBMS Architecture

Autonomy refers to the distribution of control, not of data. It indicates the degree to which individual DBMSs can operate independently. Requirements of an autonomous system: The local operations of the individual DBMSs are not affected by their participaion in a multidatabase system. The manner in which the individual DBMSs process queries and optimize them should not be affected by the execution of global queries that access multiple databases. System consistency or operation should not be compromised when individual DBMSs join or leave the multidatabase confederation. Autonomy

Whereas autonomy refers to the distributed control, the distribution dimension of the taxonomy deals with data. There are a number of ways DBMSs have been distributed. We abstract 2 alternative classes: client/server distribution peer-to-peer distribution (or full distribution) Distribution

Heterogeneity may occur in different forms: hardware data models query languages transaction management protocols Heterogeneity

Architekturmodell

Architektur von DBMS • Client - Server Architektur (nicht interessant für diese LV) • Verteilte Datenbank Architektur • Multi Datenbank Architektur

Hier gibt es typischerweise einen zentralen Datenbank-Server und eine größere Anzahl vernetzter Arbeitsplatzrechner, die keine relevanten Daten speichern. Der Benutzer am Arbeitsplatzrechner sieht die volle Funktionalität des DBMS. Das System verhält sich wie ein zentrales Datenbanksystem, die Kommunikation ist für den Benutzer transparent. Client/Server Architektur

Client/Server Architektur (cont.)

Hier gibt es mehrere Datenbankserver, wobei bestimmte Daten auf nur einem Rechner oder auch auf mehreren (replizit) gespeichert sein können. Eine virtuelle Datenbank, deren Komponenten physisch in einer Anzahl unterschiedlicher, real existierender DBMS abgebildet werden. Transaktionen können in diesem Fall über mehrere DBMS laufen. Sammlung von Daten, die Aufgrund gemeinsamer, verknüpfender Eigenschaften dem gleichen System angehören Auf versch. Rechnern im Netzwerk verteilt sind Wobei jeder Rechner seine eigene Datenbank besitzt Autonom lokal Aufgaben abwickeln kann Verteiltes Datenbanksystem

Verteiltes Datenbanksystem (cont.) - gleichzeitige Benutzung der Rechenleistung mehrerer Rechner - Engpaß in zentralen Datenbanksystemen bei Zugriff auf die Daten wird vermieden, da die Daten verteilt sind (ggf. repliziert) - Daten werden von einem Datenbanksystem verwaltet - Verteilungstransparenz - Grundlage: 4-Ebenen-Schema-Architektur

Repetition: ANSI/SPARC Architecture Users External view External view External view External Schema The conceptual schema is an abstract definition of the database – it is the „real view“ of the enterprise being modeled in the database. The requirements of individual applications or the restrictions of the physical storage media are not considered. Conceptual view Conceptual Schema The internal view deals with the physical definition and organization of data. The location of data on different storage devices and the access mechanisms used to reach and manipulate data are the issues dealt with at this level. Internal view Internal Schema The external view is concerned with how users view the database. An individual user‘s view represents the portion of the database that will be accessed by that user as well as the relationships that the user would like to see among the data. A view can be shared among a number of users.

Verteiltes Datenbanksystem (cont.) externes Schema 1 . . . externes Schema N glob. konzept. Schema lokales konzept. Schema lokales konzept. Schema lokales konzept. Schema . . . lokales internes Schema lokales internes Schema lokales internes Schema . . . 4 - Ebenen - Schema - Architektur

Functional Schematic of an Integrated Distributed DBMS Global directory (GD/D) permits the required global mappings. Local mappings are per- formed by a local directory/dictionary (LD/D) mappings.

Components of a Distributed DBMS User processor • The user interface handler is responsible for inter-preting users commands and formatting the result data. • The semantic data controller uses the integrity constraints and authorizations that are defined as part of the global conceptual schema to check if the user query can be processed. • The global query optimizer and decomposer determines an execution strategy to minimize a cost function, and translates the global queries into local ones using the global and local conceptual schemas as well as the global directory. • The distributed execution monitor coordinates the distributed execution of the user request. Data processor • The local query optimizer is responsible for choosing the • best access path (The term access path refers to the • data structures and algorithms that are used to access • data. A typical access path is an index on one or more • attributes of a relation.) to acces any data item. • The local recovery manager is responsible for making sure • that the locak database remains consistent. • The run-time support processor physically accesses the • database according to the physical commands in the • schedule generated by the query optimizer.

- Ein MDBS ist ein Verbund von mehreren Datenbanksystemen. - Das Konzeptionelle Schema repräsentiert nur den Teil von Daten, den die lokalen DBMS teilen wollen. - Auf jedes DBS können lokale Anwendungen zugreifen. - Jedes DBS kann Daten enthalten, welche keine Beziehung zu Daten anderer DBS haben. Multidatenbanksystem

Multidatenbanksystem GES GES GES LES LES LES LES LES LES GKS LKS 1 LKS n ... ... LIS 1 LIS n Modell mit globalem konzeptionellem Schema

Multidatenbanksystem (cont.) ES 1 ES 2 ES n Multidatabase layer Local system layer LKS 1 LKS 3 LKS 2 LIS 1 LIS 2 LIS 3 Modell ohne globales konzeptionelles Schema

Components of an MDBS

Directory Management Strategies - Alternatives

Distributed Database Systems: Integration vs. Centralization

Distributed Database Systems: Integration vs. Centralization

Presentation Transcript

Sicherheit in und durch verteilte Systeme