SQL Unit 18: Data Management: Databases and Organizations Richard Watson

SQL Unit 18: Data Management: Databases and OrganizationsRichard Watson Summary of Selections from Chapters 9, 10 prepared by Kirk Scott

Chapter 9, The Relational Model and Relational Algebra • Generally speaking, the contents of this chapter should not be too difficult • The idea is that most of the information has been introduced inductively in the foregoing sections • This chapter puts some of the earlier information into context and sums up the idea of relational databases

Background • Databases existed before the development of the relational model • They were based on networks or hierarchies • In other words, their implementation was based on linked data structures • These kinds of databases were not easy to understand or code • Note how everything is cyclical: O-O databases are basically modern hierarchical databases

The general idea of storing data in tables was an obvious alternative from the beginning. • However, it was unclear whether the relational data model was a practical alternative for large data sets. • The main apparent problem was performance. • Linked code can run quickly. • Performing joins, for example, by traversing two tables is not very efficient.

More basic questions included, “What is a database?” and, “What should the user interface be like?” • E. F. Codd is recognized as the main figure in the development of the relational model as a practical alternative to existing dbms’s. • He and others addressed all of these questions.

Normalization was one of the results of this work. • It illustrates the value of theory. • Garden variety idiots think they understand tables. • “What is there not to understand?” they think. • Without a theoretical understanding there was no clarity about tables or how to use them.

The two remaining areas of development were related. • Efficient algorithms for performing relational operations were necessary. • Would the user be exposed to the implementations of the operations? • Or would the user be given a different language as an interface?

Codd made the following observations about existing systems: • 1. They forced programmers to write low level code • This meant that queries were more difficult to write, took longer to write, and typically required debugging because they were error prone.

2. No commands were available for processing multiple records at a time. • Existing systems used procedural algorithms which used loops to traverse linked data structures. • Linked list traversal can be efficient, but the code can be difficult to write • By definition, inside the loop one record at a time was accessed.

The relational model was inherently set-based • It would be desirable to give the user a set-based interface • That would require the implementation of set level commands in the db internals. • Efficient implementations would be needed before the relational model could be adopted

3. The existing systems were not amenable to ad hoc querying. • Trained programmers are needed in order to write procedural code. • SQL is simple enough that an end user can learn it (maybe). • Also, the development time for an SQL query is short enough that it becomes practical to write one-time queries, not suites of programs.

Observations about existing systems and the contrasts with the relational model led Codd to these three goals for a database management system: • 1. Data independence • 2. Communicability • 3. Set processing

1. Data independence • The users of databases should not have to worry about how the data was physically stored. • They should be free to envision the data simply as a collection of related tables, regardless of the physical implementation. • Any physical level questions would be at the operating system or database administrator level.

2. Communicability • The basic idea here is that the relational model, based on tables, records, keys, and values, is relatively easily understood by both users and programmers, making it easier for clients and developers to work together. • This is in marked contrast to earlier database models.

3. Set processing • This is basically just a repetition of information given above. • The beauty of the relational model is that it allows queries to be non-procedural and still supports the retrieval of multiple records. • The model is “tell what you want” rather than “tell how to get it”.

The Major Components of the Relational Model • The relational data model has three major components: • Data structures • Integrity rules • Operators used to retrieve, derive, or modify data

Data Structures • The following terms summarize the data structures that the relational model is based on: • Domains (fields) • Relations (collections of fields) • Primary key • Candidate key = Alternate key • Foreign key • Relational database (relations in a primary to foreign key relationship)

Integrity Rules • These are the integrity rules of the relational model: • Entity integrity • The primary key is unique and not null • Referential integrity • Every foreign key value has to have a matching primary key value

Operators = Manipulation Languages • A complete dbms has to support two kinds of functionality • The two kinds of functionality be together in one language or they may be implemented in separate forms: • DDL = data definition language = defining the database tables • DML = data management language = inserting, updating, and deleting data

There are essentially four language or manipulation options when it comes to relational databases: • Relational calculus • Relational algebra • SQL • QBE

Relational Calculus • Relational calculus is based on the mathematical underpinnings of the relational model • It has never been implemented as a language in a widely accepted dbms product • Relational calculus will not be pursued at all

Relational Algebra • Relational algebra also emphasizes the mathematical underpinnings of the relational model • The query language for Postgres, Quel, was based on relational algebra • In the marketplace, it has largely been superseded by SQL

Relational algebra will be pursued for two reasons: • It provides a useful vocabulary for talking about queries • Even without delving into the theory, it is possible to make some useful observations about the necessary contents of a query language based on relational algebra concepts

Relational algebra is fundamentally based on 8 operations: • 1. Restrict (select): This picks a subset of rows from a table • 2. Project: This picks a subset of columns from a table: • 3. Product: This forms all possible pairings of the rows of two tables

4. Union: This forms a vertical combination of the rows of two tables • 5. Intersect: This finds the rows that appear in both of two tables • 6. Difference: This finds the rows that appear in one table but not another • 7. Join: this finds a subset of rows of a product, typically where corresponding field values match

8. Divide: • Relational divide is not as simple as the other concepts and has not been fully explained yet • For the sake of completeness it will be explained in the following overheads • After explaining division, the discussion will return to relational concepts in general

Relational Division • Divide should be (and is) the converse of product • Division was mentioned in passing in the unit on SQL querying that covered double NOT EXISTS • In the context of the products of relations, the logical concept of FOR ALL is closely linked to the concept of division • Practically speaking, division will be accomplished using double NOT EXISTS

Relational Division Example • The plan for this section is to explain relational division with the help of a few examples. • These examples are actually the last four questions on the assignment for this unit. • The answers to these questions will be given here as part of the explanation.

If you choose to do the assignment, your goal should not be to copy the answers given. • Instead, after having read the explanatory material, hopefully enough of it will stick in your memory that you can come up with the correct answer on your own. • If not, you can refer back to the explanations again.

TableX, TableY, and TableZ are given for the questions/examples. • TableY is the table in the middle in a m – n relationship between TableX and TableZ • TableY contains a subset of the Cartesian product of the pk (id) fields of TableX and TableY • The tables are shown on the following overhead.

Two tables are needed in order to do division. • In this example we are interested in the quotient of TableY and TableZ • In other words, we’re interested in finding TableY DIVIDED BY TableZ. • TableX is included in the example in order to help visualize the relationship between TableY and TableZ.

Dividing TableY by TableZ won’t yield TableX • It will yield a subset of TableX • This is because TableY isn’t the full Cartesian product of TableX and TableZ • If TableY were the full Cartesian product of TableX and TableZ, then TableY DIVIDED BY TableZ would give TableX

Part of the goal of this discussion is to show how relational division can be accomplished in SQL. • Division in SQL is done by means of double NOT EXISTS queries • The familiar structure of such queries consists of double nesting with three tables • Therefore, it’s convenient to have TableX available along with TableY and TableZ

Now consider TableY and TableZ. • The first column of TableY is the field xid. • The second column of TableY is the field zid. • TableY and TableZ have the field zid in common. • The second field of TableZ, field zone, does not play a role in the division. • The division of the two tables is based on the common field, zid. • The result of the division will be in terms of the first field in TableY, xid.

The definition of relational division can be explained using these two tables as an example. • The verbal expression of what TableY DIVIDED BY TableZ is supposed to produce as a result is this: • It should find all of those values of xid, the first field in TableY, where those values of xidare matched with every value ofzid, the common field, that appears in TableZ.

The verbal expression can be restated in this way: • The division of the two tables should find those values of xid in TableYthat are in a Cartesian product with the values of zid in TableZ. • The division operation will not include in the results any values of xid in TableYthat are not matched with every value of zid in TableZ.

TableY divided by TableZ on the fields TableY.zid and TableZ.zid, respectively, gives a one column result table containing xid values taken from TableY. • TableX, TableY, and TableZ are repeated on the next overhead. • The result of dividing TableY by TableZ is shown on the overhead following that one.

Another Example • In order to help the idea stick, another example is explained here verbally without completely illustrating it with tables. • Suppose some TableR was the full Cartesian product of the xid values in TableX and the zid values in TableZ. • What would the result be of dividing TableR by TableZ on their common field zid?

Except for the fact that it's stated verbally rather than completely illustrated, this question is easier than the first one. • In this example TableR replaces TableY. • If TableR is the Cartesian product of TableX.xid and TableZ.zid, then every xid value in TableR will be in the result of TableR divided by TableZ. • In other words, the actual results of the division would be the table shown on the next overhead.

Relational Division Using SQL • From a mathematical point of view, relational division is a binary operation. • Using SQL syntax, relational division can be accomplished with double NOT EXISTS. • Double NOT EXISTS on three different tables is easier to keep track of than double NOT EXISTS on two tables, where one table appears once and the other table appears twice in the query. • That’s why TableX is included in the example.

Relationally, the operation of interest is TableY divided by TableZ • Operationally, this means finding those values of TableX.xid which are paired in TableY with all of the values of TableZ.zid • In other words, find those TableX.xid values where there does not exist a TableZ.zid value that it's not matched with in TableY. • The desired results can be phrased using universal quantification, all, or double negation.

This is the indication that in SQL the desired result can be obtained with a double NOT EXISTS query. • If this query is written correctly, the result set of TableX.xid values will equal the set of TableY.xid values that would result from dividing TableY by TableZ on the fields TableY.zid and TableZ.zid, respectively. • The desired query is shown on the next overhead.

SELECT xid • FROM TableX • WHERE NOT EXISTS • (SELECT * • FROM TableZ • WHERE NOT EXISTS • (SELECT * • FROM TableY • WHERE TableX.xid = TableY.xid • AND TableY.zid = TableZ.zid));

Phrased informally, as was done in the unit that covered the double not exists queries, this query asks for those values of xid in TableX where there is not a zid value in TableZ that it's not matched with, through the table in the middle, TableY.

Notice that this query follows the pattern for double NOT EXISTS queries • The first query opens the left base table • The second query opens the right base table • The third query opens the table in the middle. • For reasons of scope, both of the joining conditions are in the third query.

It is also possible to write such a query using just the two tables that are involved in the division. • When considering the double NOT EXISTS query an example was given where all of the relevant fields were in the table in the middle and it could be opened three times with aliases in order to achieve the desired results.

SQL Unit 18: Data Management: Databases and Organizations Richard Watson