THIRD GENERATION DATABASE SYSTEM MANIFESTO The Committee for advanced DBMS Function
INTRODUCTION • First Generation DB systems: Network and hierarchical systems present in the 70s(Offering data definition and manipulation languages) • Second Generation DB systems: Non-procedural data manipulation language and data independence (DB2,INGRES,NON_STOP SQL, ORACLE,..) Focused on business data processing. Inadequate for CAD,CASE, hypertext applications.
INTRODUCTION • Consider a publishing application. Client wishes to arrange the layout of a newspaper.Storing text segments, graphics and icons is required. Supporting such data is difficult in 2nd generation systems. • 2nd generations systems sometimes do not support business data processing adequately either.Consider an insurance application that processes claims. Requires traditional data like names, ages etc. But also requires photographs of the events to which claims are related.
THIRD GENERATION DBMS SYSTEMS • PRINCIPLE 1: Besides traditional data management services, third generation DBMS will provide support for richer object structures and rules.
PRINCIPLE 1 • Richer object structures to store and manipulate text and spatial data are required. • Capability of specifying rules should be given to the designer. • Referential integrity in a relational context is an example for a rule
PRINCIPLE 1 • Ex: Newspaper application • Text,Icons,maps,advertisement copy(Rich object structures) • Rate,Number of days the adv. will run, classification(Traditional data) • Can`t put advertisement of Sun on the same page as Microsoft.(Rule)
PRINCIPLE 1 • Ex: Insurance application: • Need to store photographs • Making changes to insurance coverage is a standard transaction. • Rules such as: Cancel the coverage of any customer who has had a claim of type Y over value X Escalate any claim older than N days.
THIRD GENERATION DBMS SYSTEMS • PRINCIPLE 2: Third generation DBMSs must subsume second generation DBMSs
PRINCIPLE 2 • Non procedural access and data independence must not be compromised by third generation systems. • Query language is a necessity. • Ex: mechanical CAD systems. Store the parts which compose a product. • Along with spatial geometry of data, store cost of the part, color,meantime to failure... • Query Example: How much does the cost of my product if supplier X raises prices by Y percent.
PRINCIPLE 2 • Second advance is data independence. Second generation systems maintain the consistency of all access paths to data and a query optimizer chooses the best way to access the data. Also, second generation systems provide views whereby a user can be insulated from the changes underlying the set of collections.
THIRD GENERATION DBMS SYSTEMS • PRINCIPLE 3: Third generation DBMSs must be open to other subsystems.
PRINCIPLE 3 • Must have 4th generation language(4GL), various decision support tools, access from many PLs, access to popular systems like LOTUS, graphical interfaces, distributed DBMS. • Must allow access from additional tools running in different environments
THIRD GENERATION DBMS SYSTEMS • There are three groups of propositions which should be followed by third generation DBMSs. • Propositions which result from Principle1 and refine the requirements of object & rule management. • Propositions which claim that a 3rd generation DBMS should subsume a 2nd generation DBMS • Propositions which result from the requirement that a third generation system be open
Propositions Concerning Object and Rule Management • DBMSs cannot anticipate all the kinds of data elements that an application might want. Most people think time is measured in seconds and days but the day ends at 16.30 at most banks and yesterday may need to skip the weekends and holidays. Hence it is imperative that a third generation system manage a diversity of objects.
1.1 A Third Generation DBMS must have a rich type system • An abstract data type to construct new ones • An array constructor • A Sequence type constructor • A record type constructor • A set type constructor
1.1: A Third generation DBMS must have a rich type system:(cont.d) • Functions as a type • A union type constructor • Recursive composition of the above constructors • In addition to implementing these type constructors, the underlying query language must be extended.
1.1: A Third generation DBMS must have a rich type system • The first mechanism allows one to construct new base types in addition to the standard integers, floats andcharacter strings available in most systems. These include bit strings, points, lines, complex numbers, etc. • The second mechanism allows one to have arrays of data elements, such as found in many scientific applications.Arrays normally have the property that a new element cannot be inserted into the middle of thearray and cause all the subsequent members to have their position incremented. In some applications suchas the lines of text in a document, one requires this insertion property, and the third type constructor supportssuch sequences.
1.1: A Third generation DBMS must have a rich type system • The fourth mechanism allows one to group data elements into records. Using thistype constructor one could form, for example, a record of data items for a person who is one of the "oldguard" of a particular university. • The fifth mechanism is required to form unordered collections of data elementsor records. For example, the set type constructor is required to form the set of all the old guard. The next mechanism allows one to construct a data element whichcan take a value from one of several types. • The last mechanism allows type constructors to be recursively composed to support complexobjectswhich have internal structure such as documents, spatial geometries, etc.
1.2 Inheritance is a good idea • Allowing types to be organized into an inheritance hierarchy is a good idea. • If only single inheritance is supported then there are too many situations that can not be adequately modeled. So, multiple inheritance is better.
1.3 Functions, including database procedures and methods and encapsulation are a good idea • 2nd generation systems support functions restrictedly(SQL supports -create, alter,drop- which are functions of tables.) • Encapsulation: Users should not be allowed direct access to EMPLOYEE collection but given funcions HIRE(EMPLOYEE), FIRE(EMPLOYEE), RAISE_SAL(EMPLOYEE) • Encapsulation encourages modularity and registers functions along with the data
1.3 Functions, including database procedures and methods and encapsulation are a good idea • Encapsulation has performance advantages in a protected or distributed system. • Ex: function HIRE(EMPLOYEE). If specified as a function to be executed internally, only one round trip message btw DBMS and application is required. If runs in a user program, one round trip message for each access.
1.3 Functions, including database procedures and methods and encapsulation are a good idea • Such functions can be inherited and overridden in inheritance hierarchy. • HIRE(EMPLOYEE) can be applied to STUDENT EMPLOYEE by being overridden. • Users should write functions in a high level language and obtain access through a non-procedural access language. • Functions should run queries and not perform their own navigation using lower-level DBMS calls.
1.3 Functions, including database procedures and methods and encapsulation are a good idea • Some OODB enthisuasts claim that the only way to access a data type be via functions. For example, access EMPLOYEE via HIRE(EMPLOYEE). But select * from EMPLOYEE where salary > 1000 needs to access all data elements. So data elemets inside should be accessed with accessor functions. An authorization system is required to control this access. • Functions should be inherited. Functions stored in the database with DBMS calls in the query language are very common.
1.4 Unique Identifiers(UIDs) for records should be assigned by the DBMS only if a user-defined primary key is not available • 2nd generation systems support primary keys. Its value never changes and it is also human-readable. • If no primary key is available, the system assigns a UID
1.5 Rules(triggers,constraints) will become a major feature in future systems. They should not be associated with a specific function or collection • OODB researchers ignore the importance of rules. They say rules are implemented by adding some code to a function which operates on the collection. • Ex: Rule: No employee wins greater than his manager.Insert code for this into RAISE_SAL and HIRE_EMPLOYEE functions • But, this way whenever a new function is added some code must be included by it. Also, same code is added to two different places which consumes programming effort(replication).
1.5 Rules(triggers,constraints) will become a major feature in future systems. They should not be associated with a specific function or collection • Example Rules: • Whenever Hasan gets a salary adjustment, propagate the change to Can. • Whenever Can gets a salary adjustment, propagate the change to Osman. • Also, the users might ask queries about the rules applied . If they are embedded in functions,it will be hard to do this. It will be hard too, when a rule is deleted. • So, rules must be enforced by the DBMS but not be bound to any function or collection.
3.2 Propositions Concerning Increasing DBMS Function • 3rd generation systems must subsume all capabilities of 2nd generation systems. • These are query languages, specification of sets of data elements and data independence
2.1 Essentially all programatic access to a database should be through a non-procedural high-level access language. • The expressive power of a query language must be present in every programatic interface. • Long term this can be provided by adding query language constructs to programming languages • Short term,this can be provided by embedding a query language into conventional programming languages.
2.1 Essentially all programatic access to a database should be through a non-procedural high-level access language. • Many OODB researchers state that they wish to navigate to desired data using a low-level procedural interface • But this kind of interface is undesirable and shouldn`t be used • When the programmer navigates to data in this fashion, he replaces the function of query optimizer. A well written optimizer can always do better than a programmer.
2.1 Essentially all programatic access to a database should be through a non-procedural high-level access language. • Second problem is schema evolution. If the number of indexes change or the physical access paths to data changes, the programmer must modify his program. On the other hand, query optimizer simply produces a new plan for the new environment.
2.1 Essentially all programatic access to a database should be through a non-procedural high-level access language. • Consider a user who is browsing a DB ie, navigating from one record to the other. “select * from collection where collection.key=value” Although there is little to optimize in this query, one is still insulated from required program maintenance in the event of schema changes. • One does not obtain this service if a lower level interface is used such as “dereference(pointer)
2.1 Essentially all programatic access to a database should be through a non-procedural high-level access language. • A majority of OODB researchers suggest that a pointer be soft, ie its value not change even if the data it points to is moved. This characteristic, location independence, is desirable because it allows data elements to be moved without compromising the structure of DB. • There is little performance benefit to using the lower level interface when a single data is returned and there is performance degrading when multiple data elements are returned.
2.1 Essentially all programatic access to a database should be through a non-procedural high-level access language. • OODB enthusiasts claim that CAD programmers want to perform their own navigation and therefore need lower level interface. This is like resisting to move from assembly language to high level programming languages. These programmers require education. • All DBMS access should be specified by queries in a non-procedural high-level access notation.
2.2 There should be at least two ways to specify collections, one using enumeration of members and one using the query language to specify membership • The OODB literature suggests specifying sets by enumerating members by means of a linked list or array of identifiers. This is an inferior choice because: ALUMNI(name,age,address) GROUPS(g-name, composition) Here we have a collection of alumni of a university along with collection of groups of alumni. Each group has a name and composition fiels indicates members of each group. • Composition can be specified as an array of pointers to alumni but this is inefficient because sets are quite large and they overlap. Also., when a new person is added to ALUMNI collection, application programmer should add him to all appropriate groups.
2.2 There should be at least two ways to specify collections, one using enumeration of members and one using the query language to specify membership • It is also possible to represent GROUPS as follows: GROUPS(g-name,min-age,max-age,composition). • In this specification, groups are parametrized by age and set membership becomes automatic. When new alumnus are added, they are added to the appropriate groups. Also look at the following query Select g-name from GROUPS where composition.name=‘Hasan’
2.2 There should be at least two ways to specify collections, one using enumeration of members and one using the query language to specify membership • If an array of pointers is used, query optimizer will scan all records in GROUPS and then find ‘Hasan’. On the other hand if automatic membership method is used, the optimizer will transform the query into: “ select g-name from GROUPS,ALUMNI where ALUMNI.name=‘Hasan’ and Alumni.age > GROUPS.min_age and Alumni.age < GROUPS.max_age” • If there is an index on GROUPS.min-age or GROUPS.max-age or ALUMNI.name, this query will perform better.
2.2 There should be at least two ways to specify collections, one using enumeration of members and one using the query language to specify membership • In summary, there are two ways to specify collections such as sets, arrays and sequences. They can be specified either extensionally through collections of pointers or intensionally through expressions which also maintains automatic membership. • Also with an intensional specification optimizer is free to use any access path, not limited to use any pointer structures. • Both representations are required but intensional should be favored.
2.3 Updatable views are essential • Most of the databases are dynamic,when a set of collections changes, the DB needs maintenance. • Encapsulation of DB access into functions and encapsulation of functions with a single collection is helpful in identifying the functions to be changed. But if a change is made to schema, it may take too long to change the affected functions.
2.3 Updatable views are essential • A better approach is to support virtual collections(views). • In second generation systems, it is not possible to update relational views. • The traditional way to support view updates is to perform command transformations along the lines. To disambiguate view updates additional semantic information must be provided. • One approach is to require that each collection be opaque which might become a view at a later time. This way there is a group of functions for all accesses and view definer must perform program maintenance on these functions.
2.3 Updatable views are essential • Alternately, a suitable rules system can be used to provide the necessary semantics.This way, only one(or a small number of) rules need to be specified to provide view update semantics.
2.4 Performance indicators have nothing to do with data models and must not appear in them. • Main determiners of performance are: The amount of performance tuning done on DBMS Usage of compilation techniques by the DBMS Location of the buffer pool(in client or DBMS) Kind of indexing available Performance of the client-DBMS interface Clustering performed. • Such issues have nothing to do with the data model or usage of a high-level language versus a lower level interface.
2.4 Performance indicators have nothing to do with data models and must not appear in them. • Clustering related objects together has been highlighted as an important OODB feature. But this is a physical representation issue and has nothing to do with the data model. • Some nontraditional problems such as CAD need a database environment which is optimized for the specific application like An access method for spatial data A buffer pool in the engineers WS. • But these are all performance issues and have nothing to do with the data model.
3.3 Propositions that result from the necessity of an open system • Up to now, we have discussed the characteristics of third generation DBMS`s. Now we turn our attention to the API, through which a user program will communicate with the DBMS.
3.1 Third generation DBMSs must be accesible from multiple HLLs. • Some designers wrongly claim that, a function should yield the same result if it is executed in the user space on transient data or inside the DBMS on persistent data. This can happen only if the execution model of DBMS is identical to that HLL. • There is no aggrement on the single HLL. • An open DBMS must be multi-lingual. It must allow access from a variety of externally written application systems.
3.2 Persistent X for a variety of Xs is a good idea. They will all be supported on top of a single DBMS by compiler extensions and a(more or less) complex run time system • Second generation systems were interfaced to programming languages using a preprocessor partly because early DBMS developers did not have the cooperation of compiler developers. But the resulting interfaces were not very friendly and were characterized as “like glueing an apple on a pancake”. Obviously it is possible to provide the same level of elegance for general purpose programming languages. • It is crucial to have a closer match between type systems. Also, it will be nice to allow any variable in a user`s program to be optionally persistent(ie remembered after the program terminates).
3.2 Persistent X for a variety of Xs is a good idea. They will all be supported on top of a single DBMS by compiler extensions and a(more or less) complex run time system • In order to perform well, persistent X must maintain a cache of data elements and manage this cache using some replacement algorithm. A user space cache will improve program performance 100-1000 times. • Functions should be coded by including calls to the DBMS expressed in the query language. Hence, persistent X requires some way to express queries. Such queries can be expressed in a notation appropriate to the HLL in question. The run-time system for the HLL must accept and process such queries. • Such a run-time system will be difficult to build depending on the HLL.
3.3 For better or worse, SQL is intergalactic dataspeak. • SQL is the universal way of expressing queries today. Although it has a variety of well known minor problems, it is necessary for commercial viability.
3.4 Queries and their resulting answers should be the lowest level of communication between a client and a server. • In an environment where a user interacts with a remote DB, there is a question about the protocol to be used. OODB enthisuasts debate whether requests should be for single records or pages. • Expressions in the query language should be the lowest level unit of communication. If queries can be packed into a function, then the user can use RPC to cause function execution on the server. • If a lower level specification is used, such as record or page transfers, the protocol design will be harder because of machine dependencies and states.
Summary We agree with OODB enthisuasts on : • Rich type system, functions, inheritance and encapsulation. • Disagree on object management issues. We propose a larger issue of support for data, rule and object management. • DBMS access should only occur through HLLs. • Use of automatic collections should be encouraged whenever possible. Persistence may be added to HLLs. Unique identifiers should be used either user-defined or system-defined.