Object Relational Databases

Object Relational Databases Ioan Despi

Motivation & Politics In the early 80’s, it became clear that relational systems were not robust enough for non-administrative data-intensive applications of the day: CAD/CAM CASE GIS etc. Two buzz-phrases began to emerge: "Object-Oriented" and "Extensible" Much vision & politics ensued: Various data models (NF2, ER, Functional, Semantic) Object-Oriented DB System Manifesto (OO-ness) Third-Generation DB System Manifesto (Extensibility) Many query languages proposed Systems were built, companies started, etc.

Today, the field has settled down into two arenas: Persistent OO PL systems (e.g. EXODUS, ObjectStore, Objectivity, Versant, etc.) Query-based systems with OO features (e.g. Starburst, Postgres, Illustra, Informix & ORacle "Universal Servers", DB/2 UDB) Almost nobody does both well Few people continue to argue in terms of the paradigms. Market: $10 billion/year for RDBMS $200 million/year for OODBMS

Obvious & common idea: • To extend an existing relational data base management system: • keep the basic relational tables & query language • add object flavours: • user - extensible type system • encapsulation • inheritance • polymorphism • dynamic binding of methods • complex objects (non-first normal form objects) • object identity

Terminology: Extended Relational DBMS (ERDBMS) -- original term Object - Relational DBMS (ORDBMS) -- more descriptive Universal Server, Universal DBMS (UDBMS)-- recently Oracle Informix have all extended their systems to become ORDBMSs IBM SQL3 standardize extensions to the relational model and query language

Advantages: resolves many of the weaknesses of the relational model extends RDBMS with reuse and sharing comes from the ability to extend the database server to perform standard functionality centrally, rather than having it coded in each application preserves the knowledge and experience from relational model Disadvantages: complexity and associated increased costs simplicity and purity of the relational model are lost object-orinted purists: the used terminology is wrong

The Third - Generation Database Manifestos 1989 Atkinson et al. : Object -Oriented Database System Manifesto: 1. Complex objects must be supported 2. Object identity must be supported 3. Encapsulation must be supported 4. Types or classes must be supported 5. Types or classes must be able to inherit from ancestors 6. Dynamic binding must be supported 7. The DML must be computationally complete 8. The set of data types must be extensible 9. Data persistence must be provided

10. Support for very large databases must be provided 11. DBMS must support concurrent users 12. DBMS must be capable of recovery 13. DBMS must provide a simple way of querying data 1990 Stonebraker et al. : Committee for Advanced DBMS Function The Third-Generation Database Systems Manifeso 1. Athird-generation DBMS must have a rich type system 2. Inheritance is a good idea 3. Functions, including database procedures, methods and encapsulation, are a good idea 4. Unique identifiers for records should be assigned by the DBMS only if a user-defined primary key is not available

5. Rules (triggers, constraints) will become a major feature. They should not be associated with a specific function or collection. 6. The programmatic access to a database should be through a non-procedural high-level access language 7. At least two ways to specify collections: using enumaration of members using QL to specify membership 8. Updateable views are essential 9. Performance indicators must not appear in the data model 10. DBMS must be accessible from multiple languages 11. On top of the DBMS: a variety of high-level languages 12. For better or worse, SQL is ‘intergalactic dataspeak’ 13. Queries and their resulting answers should be the lowest level of communication between a server and a client

1995, 1998: Darwen & Date: attempt to defend the relational data model, as described in their book (1992). The Third Manifesto: certain oo features are desirable but they must be orthogonal to the relational model the realtional model needs no extension, no correction, no subsumtion, no perversion SQL is the biggest perversion--> it is rejected from the model instead: language D a front-end layer is provided to D that allows SQL to be used (a migration path for existing SQL users)

The D language is subject to: 1. Prescriptions that arise from the relational model (RM Prescriptions) 2. Prescriptions that do not arise from the realtional model, called Other Orthogonal prescriptions (OO Prescriptions) 3. Proscriptions that arise from the relational model (RM Proscriptions) 4. Proscriptions that do not arise from the relational model (OO Proscriptions) 5. Very strong suggestions (RM and OO)

RM Prescriptions 14. Base vs. derivedrelvars 15. Database variables (dbvars) 16. Transactions and dbvars 17. Create/destroy operations 18. Relational algebra 19. Relvar names and explicit values 20. Relation functions 21. Relation and tuple assignement 22. Comparisons 23. Integrity constraints 24. Relation and database predicates 25. Catalog 26. Language design 1. Domanins 2. Typed scalars 3. Scalar operators 4. Actual representation 5. Truth values 6. Type constructor TUPLE 7. Type constructor RELATION 8. Equality operator 9. Tuples 10. Relations 11. Scalar variables 12. Tuple variables 13. Realtion variables (relvars)

RM Proscriptions OO Proscriptions 1. No attribute ordering 2. No tuple ordering no duplicate tuples 4. No nulls 5. No nullogical mistakes 6. No internal-level constructs 7. No tuple-level operations 8. No composite columns 9. No domain check override 10. Not SQL 1. Relvars are not domains 2. No object ids 3. No ‘public instance variables’ 4. No ‘prottected instance variables’ or friends OO Prescriptions 1. Compile-time type checking 2. Single inheritance ( conditional) 3. Multiple inheritance (conditional) 4. Computational completeness 5. Explicit transactions boundaries 6. Nested transactions 7. Aggregates and empty sets

RM Very strong suggestions OO Very strong suggestions 1. Candidate keys for derived relvars 2. System-generated keys 3. Referential integrity 4. Candidate key inference 5. Quota queries 6. Transitive closure of a relation 7. Tuple and relation parameters 8. Default values 9. SQL migration 1. Typew inheritance 2. Collection type constructors 3. Conversion to/from relations 4. Single- level store

The primary object in the propopsal is the domain: a named set of encapsulated values, of arbitrary complexity equivalent to a data type or object class Domain values: scalars, can be manipulated only by means of operators defined for the domain. The language D comes with some build-in domains(ex: truth values) The equals (=) comparison operator is defined for every domain, returning a boolean value Relations, tuples and tuple headings have their normal meaning with the introduction of RELATION and TUPLE type constructors for these objects

The following variables are defined: • Scalar variable of type V--variable whose permitted values are scalars from a specified domain V • Tuple variable of type H-- variable whose permitted values are tuples with a specified tuple heading H • Relation variable (relvar) of type H -- variable whose permitted values are relations with a specified relation heading H • Database variable (dbvar) -- a named set of relvars. Every dbvar is subject to a set of named integrity constraints and has an associated self- describing catalog

Stonebraker’s matrix: upper right is growing, Query RDBMS ORDBMS No Query File Sys. OODBMS Simple Complex Data Data

Systems History Three influential research systems: Starburst (IBM Almaden) POSTGRES (Berkeley) EXODUS (Wisconsin) Others include O2 (Altair), ORION (MCC), Iris (HP), Genesis (Texas)

EXODUS EXODUS was intended to be both a persistent PL system and a query system with "Toolkit" extensibility Query processing engine never got built, though the EXODUS optimizer architecture was influential (Graefe & DeWitt) Ended up focusing on OODB stuff SHORE (follow-on to EXODUS) is delivering on EXODUS promises, but rather late. Persistent C++ Query Processing (w/ GIS features): Paradise Extensible Optimizer: Opt++ Parallelism We'll see Exodus/SHORE work on pointer swizzling, client-server caching

POSTGRES Stonebraker, Rowe, a few staff and many students, 1986-1994. Post-INGRES. The Postgres Data Model 1.Co-opt the OO terminology class = relation instance = tuple object-id = tuple-id method = attribute or function of attributes 2.Support extensible ADTs extensible procedures using C functions binary operators, which interface to extensible AM

3.Support type constructors trick: use queries columns can be parameterized Postquel functions (returns setof, or tuple) queries can live in fields of a tuple (returns setof or tuple) another exploitation of the view paradigm! these derived objects can optionally be cached (never implemented) nested-dots used to traverse complex object structures leverages EXISTING techniques for relational processing added array support directly 4.Added class inheritance (gives method inheritance and collection hierarchies)

Starburst Original goal: build a nice playpen for whatever comes next. Extensible "in-house". Not by users! No one survey paper seems to capture the work they did. Best bet: "Starburst Mid-Flight: As The Dust Clears", Haas, et al., TKDE 1990 Plumbing: clean internal query representation (QGM). Key to Query Rewrite! non-normalized catalogs for efficiency – normalized view for users WAL instead of shadow pages B+-tree compression Buffer Pool Manager accepts hints from optimizer (a la DBMIN)

Extensibility features: • User-defined functions: • table expressions: queries or C functions • scalar functions • no dynamic linking • Rule-based query rewrite engine • a little rule system with QGM as "working memory" • conditions and actions are C functions that check and • change QGM • some nifty rule control mechanisms (rule classes, rule • budgets, multiple conflict res.) • Extensible access methods (as in POSTGRES) • "Attachments": routines to be automatically called before/after • dealing with an access method • used by Starburst Rule System to generate transition logs • used to implement pre-computed joins • Complex objects implemented in a "wrapper" (SQL-XNF), • translated down to Starburst SQL

Hot Applications: Web servers, full-text collections Time-series data "Asset Management" GIS image DB Players: Informix Universal Server (head of Illustra, body of Informix). Shipping now. IBM DB2 UDB (head of Starburst, body of DB2). Extensibility features coming along. UniSQL (Won Kim of ORION fame). Went out of business recently. Oracle Universal Server (marketing-ware). Shipping now. NCR (Teradata) bought Wisconsin's Paradise ORDBMS (and DeWitt/Naughton/students) Other big R vendors are late (Sybase, Tandem, etc.)

Object -Relational Databases Concepts • I. Nested relations • II. Complex types and object orientation • III. Querying with complex types • IV. Creation of complex values and objects • V. Comparison of OO and OR databases

Object -Relational Data Models 1. Extend the relational data model by including object orientation and constructs to deal with added data types 2. Allow attributes of tuples to have complex types, including non-atomic values such as nested relations 3. Preserve relational foundations (e.g. declarative access to data) while extending modeling power 4. Upward compatibility with existing relational languges

I. Nested Relations 1. Permit non-atomic domains (e.g. set of integers, set of tuples,…) --> violate 1NF (that all attributes have atomic (indivisible) domains) 2. Allow more intuitive modelling for applications with complex data a complex object may be represented by a single tuple ---> 1:1 correspondence between data items and objects 3. Retain mathematical foundation of relational model Intuitive definition: Nested relations allow: 1. Relations wherever we allow atomic (scalar) values 2. Relations within relations

Example. Suppose the information to be stored consists of: 1. document title 2. author_list (set of authors) 3. date (day, month, year) 4. key_word_list (list of keywords) stored in a non-1NF document relation, doc: title keyword_list author_list date day month year salesplan stat.report {Smith, Jones} {Jones, Frick} 17 april 93 28 june 98 {profit, strategy} {personnel, profit}

The doc relation can be represented in 1NF as doc’ but awkward. If we asssume the following MVDs hold: title --> author title --> keyword title --> day month year we can decompose the relation in the following 4NF relations: R1(title, author) R2(title, keyword) R3(title, day, month, year)

A relation is in 5NF if no reamining nonloss projections are possible, except the trivial one in which the key appears in each projection. A relation is in 4NF iff it is in BCNF and there are no nontrivial MVD A relation is in BCNF iff every determinant is a candidate key. A relation is in 3NF iff it is in 2NF and no nonkey attribute is transitively dependent on the key A relation is 2NF iff it is in 1NF and all the non-key attributes are fully FD on the key A relation is 1NF iff every attribute is single-valued for each tuple. The non-1NF representation may be an easier-to-understand model (closer to user’s view) The 4NF design would require users to include joins in their queries, thereby complicating interaction with the system We could define a view, but we lose the 1:1 correspondence between tuples and documents.

II. Complex Types and Object Orientation Again: extensions to relational model include: nested relations complex types specialization (IS_A hierarchies) inheritance object identity

A. Structured and collection types Define a relation doc with complex attributes: sets and structured attributes: create type MyString char varying create type MyDate (day integer, monthchar(10), year char(10)) create type Document (name MyString, author_listsetof(MyString), date MyDate, keyword_listsetof (MyString)) create table doc of type Document

Unlike table definitions in ordinary relational databases, the doc table definition allows attributes that are sets and structured attributes (see: MyDate) • Allow composite attributes and multivalued attributes of ER diagrams to be represented directly. • The types created using the above statements are recorded in the schema stored in the database • Can create tables directly: • create table doc • (name MyString, author_list setof (MyString), date MyDate, keyword_list setof (MyString)) Complex type systems usually support other collection types, as arrays: author_array MyString[10] //presents an ordered list of authors multisets: print_runs multiset(integer) //presents the number of copies in each //printing run

B. Inheritance Inheritance can be at the level of types or at the level of tables 1. Inheritance of types create type Person (name MyString, social_security_no integer) create type Student (degree MyString, department MyString) underPerson create type Teacher (salary integer, department MyString) underPerson

1’. Multiple inheritance of types Definiton of the type TeachingAssistant as a subtype of both Teacher andStudent. Since name and social_security_no are inherited from a common source, Person, there is no conflict by inheriting them . However, department is defined separately in Student and Teacher and one can rename them to avoid conflict, by using an as clause: create type TeachingAssistant underStudent with (department as student_dept), under Teacher with (department as teacher_dept)

2. Inheritance of tables To avoid creation of too many subtypes: one approach is to allow an object to have multiple types without having a most specific type OR databases can model such a feature by using inheritance at the level of tables, rather than types and allowing an entity(object) to exist in more than one table at once. create table people (name MyString, social_security_no integer) create table teachers (salary integer, department MyString) underpeople create table students (degree MyString, department MyString) underpeople

2’. Table inheritance: Roles Table inheritance permits an object to have multiple types, without having a most-specific type (unlike type inheritance) Example: an object can be in the studentsand teachers subtables simultaneously, without having to be in a subtable student_teachers that is under both students and teachers Object can gain/ lose roles: corresponds to inserting /deleting object from a subtable. 2’’. Table inheritance: Consistency Requirements 1. Each tuple of supertable people can correspond to (i.e. having the same values for all inherited attributes as) at most one tuple in each of the tables studentsand teachers (WHY?) 2. Each tuple in students and teachers must have exactly one corresponding tuple in people. (WHY?)

Subtables can be stored in an efficient manner without replication of all inherited fields: inherited attributes other than the primary key of the supertable need not be stored and can be derived by means of a join with the supertable, based on the primary key As with types, multiple inheritance is possible with tables: a TeachingAssistant can simply belong to the table students as well as to the table teachers. However, if ew want, we can create a table for TeachingAssistant entities. Based on the consistency requirements for subtables, if an entity is present in the teaching_assistants table, it is also present in the teachers and in the students table.

3 . Inheritance: Conclusion Inheritance: makes schema definition natural ensures referential and cardinality constraints enables the use of functions defined for supertypes on objects belonging to subtypes allows the orderly extension of a database system to incorporate new types

C. Reference Types Object - oriented languages provide the ability to create and refer to objects. 1.To refer to objects = an attribute of a type can be a reference to an object of a specified type. Example: redefine the author_list field of the type Document as: author_listsetof ( ref (Person)) Now author_list is a set of references to Person objects 2. Tuples of a table can also have references to them. Example: ref(people) It can be implemented using the primary key or tuple-id. 3. SQL3 uses identity (for tuples) and oid (for objects).

III. Quering with Complex Types A. Relation-Valued Attributes Extended SQL allows an expression evaluating to a relation to appear anywhere the relation name may appear.==> can take advantage of the structure of nested relations Example: consider the following relation pdoc: create tablepdoc name MyString author_list setof (ref people)), date MyDate, keyword_list setof (MyString))

1. Find all documents having the word “database” as one of their keywords: select name from pdoc where “database” in keyword_list 2. Find all pairs of the form “doc_name, author_name” for each document and each author of the document: select B.name, Y.name from pdoc as B, B.author_listas Y 3. Find the name and the number of authors for each document (aggregate functions can be applied to any relation-valued expression) select name, count(author_list) from pdoc

B. Path Expressions The dot notation for referring to composite attributes can be used with references. Example: student.advisor.name References can be used to hide join operations --> they simplify the query considerably. Example: Consider the previous table people and a table phd_student: create table phd_students (advisor (ref(people)) under people

Find the names of the advisers of all PhD students: select phd_student.advisor.name from phd_students In general, attributes used in a path expression can be a collection, such as a set or a multiset. For example, to get the names of all authors of documents in pdoc relation: select Y.name from pdoc.author_list asY

C. Nesting and Unnesting Unnesting = the transformation of a nested relation into 1NF converts a nested relation into a single flat relation with no nested relations or structured types as attributes create table doc (name MyString, author_list setof (MyString), date MyDate, keyword_list setof (MyString)) author_list, keyword_list: nested relations name, date: are not nested select name, A as author, date.day, date.month, date.year, K as keyword from doc as B, B.author_list as A, B.keyword_list as K B is declared to range over doc, A ranges over the authors in author_list for that document, K ranges over the keywords in the keyword_list of the document

Nesting = the reverse operation of transformation of a 1NF relation into a nested relation can be carried out by an extension of grouping in SQL Example: nest the relation flat_doc on the attribute keyword: selecttitle, author, (day, month, year) as date, set (keyword) as keyword_list from flat_doc groupby title, author, date date title keyword_list author_list day month year Smith Jones Jones Frick 17 april 93 17 april 93 28 june 98 28 june 98 {profit, strategy} {profit, strategy} {personnel, profit} {personnel, profit} salesplan salesplan stat.report stat_report

D. Functions Object - relational systems allow functions to be defined by users Functions can be defined in a DML, such as extended SQL Example: define a function that, given a document, return the count of the number of authors: create function author_count (one_doc Document) returns integer as select count (author_list) from one_doc the function can be used in a query , to find the names of all documents that have more than one author

select name from doc where author_count (doc) > 1 Notes: 1.Although doc refers to a relation in the from clause, it is treated as tuple variable in the where clause, and can therefore be used as an argument to the author_count function. 2. A select statement can return a collection of values. If the return type of a function is a collection type, the result of the function is the entire collection. However, if the return type is not a collection type, the collection generated by SQL shoul contain only one tuple. Otherwise, a system may have two choices: flag an error or select an arbitrary one from the collection.

Functions can also be defined in a programming language as C, C++, Java. It can be more efficient and handle more complex computations than that defined using SQL. Since the code needs to be loaded and executed within the database system code, it may carry the risk of: integrity: a bug in the program can corrupt the database internal structure security: it can by-pass the access control functionality of the database management system Embedded SQL is different from C++ code functions: in SQL the query is passed by the user program to the database system to run. User-written code never needs to access to the database itself. The operating system thus can protect the database from access by any user process.

Object Relational Databases