A Brief History of DB Theory

A Brief History of DB Theory • Functional dependencies --- Normalization • More dependencies: multivalued, general • Universal relations • Acyclic hypergraphs • Logical query languages: Datalog

Functional Dependencies • Setting: A relation = table with column headers (schema ) and a finite number of rows (tuples ). A B C 0 1 2 0 3 4

FD X ->Y • A statement about which instances (finite sets of tuples) are “legal” for a given relation. • If two tuples agree in all the attributes of set X , then then must also agree in all attributes of set Y. • Common case: X is the key of the relation, Y is the other attributes.

Issue #1: Inference • Armstrong’s Axioms • X ->Y if Y is a subset of X • X ->Y implies XZ ->YZ (XZ , etc. = “union”) • X ->Y and YZ ->W implies XZ ->W • Closure test (Bernstein): given FD’s and a possible consequent X ->Y , use the FD’s to see what X determines, and then see if Y is contained therein.

Issue #2: Redundancy • Certain combinations of FD’s lead to redundancy and other “anomalies.” • Example: B ->C is the only FD: A B C 0 1 2 3 1 ? • The FD lets us predict the value of ?.

Issue #3: Normalization • Eliminate redundancy by splitting schemas. • A FD X ->Y allows us to split schema XYZ into XY and XZ . • Without the FD, the decomposition would lack a lossless join : the ability to reconstruct the original XYZ relation from the decomposed relations.

How I Met Moshe & Learned Database Theory Tsichritzis Bernstein Beeri Ullman Vardi

Why I Like Working With Vardi • Taste, selection of issues. • Best inventor of constructions. • Name comes after mine in alphabet.

Lossless Joins • When relations are decomposed, do the pieces allow reconstruction of the original? • Only way: join the projected relations. • You always get back what you started with. • Bad case is when you get more.

Example A B C 0 1 2 3 1 4 A B A B C B C 0 1 0 1 4 1 2 3 1 3 1 2 1 4 0 1 2 3 1 4

More Kinds of Dependencies • In essence, a dependency is any predicate that tells whether a given set of tuples is OK for a given schema. • The language chosen determines what constraints can be expressed and what we can decide about relations. • FD’s are just one, simple example

Multivalued Dependencies • These occur when a relation tries to connect one class of objects to independent sets from two other classes.. • Notation: X ->->Y means: • If two tuples agree on X , then we may swap the Y components and get two tuples that are also in the relation.

Example • EmpID ->-> Phone EmpID Addr Phone Project If: 123 a1 p1 j1 123 a2 p2 j2 Then: 123 a1 p2 j1 123 a2 p1 j2

MVD Origins • Independent work of Delobel, Fagin, Zaniolo • Inference of MVD’s + FD’s: Beeri, Fagin, and Howard.

Generalized Dependencies • Equality-generating dependencies: • If tuples with this pattern of equal symbols appear in a relation instance, then certain symbols must also be equal. • Generalizes FD’s. • Tuple-generating dependencies: • If tuples with this pattern of equal symbols appear in a relation instance, then a certain tuple must also appear. • Generalizes MVD’s.

Example: EGD • FD A ->B in schema ABC : A B C a b1 c1 (Hypotheses) a b2 c2 b1 = b2 (Conclusion)

Example: TGD • MVD A ->->B in schema ABC : A B C a b1 c1 (Hypotheses) a b2 c2 a b1 c2 (Conclusion)

Full Versus Embedded GD’s • Full GD’s have a conclusion with only symbols that appear in the hypotheses. • Embedded GD’s may have new symbols in the conclusion. • Existentially quantified. • Embedded EGD makes no sense, but embedded TGD’s are quite interesting.

The Chase for Inferring GD’s • Test whether a GD G follows from given GD’s. • Start with R = the hypotheses of G . • Apply GD H by mapping the hypotheses of H to some rows of R . • If so, infer the (mapped) conclusion of H --- equate two symbols of R or add a tuple to R . • If you eventually infer the conclusion of G , then G follows, else it does not.

Example: If A ->B Then A ->->B • R has tuples (a,b1,c1) and (a,b2,c2). Does it have (a,b1,c2)? • Apply A ->B . • It’s hypotheses are the same as the two tuples of R , so surely they map. • Lets us conclude b1=b2. • Thus, since R has (a,b2,c2), It surely has (a,b1,c2).

Decision Properties of GD’s • If all dependencies are full, no new symbols ever appear, so the chase must terminate. • Either we prove the desired conclusion, or R becomes a relation that provides a counterexample --- it satisfies all the given GD’s, but not the one we were trying to infer.

Undecidability of Embedded GD’s • First undecidability results involved the inference of untyped GD’s (symbols may appear in several columns). • Beeri and Vardi • Chandra, Lewis, and Makoswky • Tough result is undecidability even when GD’s are typed (one column per symbol). • Vardi • Gurevich and Lewis

History of GD’s • Similar ideas were developed independently by several people: • Yannakakis and Papadimitriou • Beeri and Vardi • Fagin • Sadri and Ullman

The Universal Relation • Idea: attributes carry information regardless of how they are placed in relation schemas. • One of the cool things about Moshe Vardi is the collection he keeps of re-inventions of this universal relation concept. • Typically touted by some HCI type as a “wonderful new interface to databases.”

UR Query Systems Query UR Stored Relations

Example • Stored relations: ES(Emp,Sal), ED(Emp,Dept), DM(Dept,Mgr) • Universal relation: U(Emp,Sal, Dept,Mgr) • UR query: SELECT Mgr WHERE Emp=a • Translated to: SELECT Mgr FROM ED, DM WHERE Emp=a AND ED.Dept=DM.Dept • U != join(ES, ED, DM); empty ES proves it.

Theory of Universal Relations • Some were quite upset by the UR idea. • Objection: the informal or ad-hoc way queries were translated from UR to stored relations. • Codd: “Neither the collection of all base relations nor the collection of all views should be cast by the DBMS in the form of a `universal relation’ (in the Stanford University sense) [Vardi, 1988].”

Some Early Approaches to the UR • Selected joins of stored relations. • Representative instance = pad stored relations with nulls in missing attributes; then chase with given dependencies. • Contributions by Vassiliou, Honeyman, Mendelzon, Sagiv, Yannakakis. • Maximal objects = union of joins within “acyclic hypergraphs.” • Maier, Ullman

Window Functions • Two stage process: 1. If query involves set of attributes X , compute the window function [X ] = some relation with schema X derived from the UR. 2. Apply the query to [X ]. • Different UR definitions give different window functions, e.g., join-based, rep.-instance, maximal-object. • From Maier, Ullman, Vardi [1984].

Acyclic Hypergraphs • Several different applications led to the identification of a class of database schemas (collection of relation schemas) with useful properties. • Sensible connections among stored relations in maximal-object theory of UR. • Joins of many relations in time polynomial in the size of the input and output (Yannakakis). • Etc., etc.

Hypergraphs • Nodes (typically attributes) • (Hyper)edges = sets of any number of nodes. A B C D E F

GYO Reduction • Graham, Yu-Ozsuyoglu, independently defined acyclic hypergraphs thusly: • Two transformations: 1. Delete a node that is in only one hyperedge. 2. Delete a hyperedge contained in another. • Hypergraph is acyclic iff the result is a single, empty edge. • Limit is unique for any hypergraph.

Example • GYO steps: delete C; delete D; delete {B}; delete F; delete {B,E}; delete A; delete B; delete E; A B C D E F

History • Acyclic hypergraph idea from Bernstein and Goodman. • But the definition is rather different. • Hypergraph version from Fagin, Mendelzon, Beeri, Maier, others. • And don’t forget the GYO folks. • Fagin defined several related-but-distinct concepts of “acyclic.” • This one is “alpha-acyclic.”

Logical Query Languages • Collections of Horn-clause rules without function symbols behave almost like SQL, but can do recursion. • Conjunctive queries = single Horn-clause rules w/o function symbols have decidable containment (Chandra and Merlin).

Conjunctive Queries • Example: path(X,Y) :- edge(X,Z) & edge(Z,Y) • Atoms in the body (hypothesis) refer to stored relations (EDB or “Extensional Database”). • Atom in head (conclusion) is the result. • Q1 is contained in Q2 if for every database D, Q1(D ) is a subset of Q2(D ).

Containment Mappings • Mapping from the variables of Q2 to the variables of Q1 that: 1. Turns the head of Q2 into the head of Q1. 2. Turns every atom in the body of Q2 into some atom in the body of Q1. • Containment mapping exists iff Q1 is contained in Q2.

Example Q1: answer(X,Y) :- e(X,Y) & e(Y,X) Q2: answer(X,Y) :- e(X,W) & e(W,Z) & e(Z,Y)

Datalog Programs • Collection of conjunctive queries. • Some predicates are EDB (stored). • Other predicates are IDB (intensional database; defined by the CQ’s only). • One IDB predicate is the answer = least fixedpoint of the CQ’s.

Example path(X,Y) :- edge(X,Y) path(X,Y) :- path(X,Z) & edge(Z,Y) • Value of path is all pairs of nodes such that there is a path from the first to the second according to EDB predicate edge .

Optimization of Datalog Programs • Problem: often the query asks for only a fraction of the answer. • e.g., find path(0,Y) for fixed node 0. • Bottom-up methods (“seminaive”) evaluate the whole least-fixedpoint, throw most away. • Top-down (goal seeking) can get stuck in left-recursion.

Linear-Recursion Methods • Recursion is linear if at most one IDB atom in the body. • Henschen-Naqvi. • Magic-sets for linear (Bancilhon, Maier, Sagiv, Ullman). • Left- and right-linear special cases (Naughton). • Conversion of nonlinear to linear (Ramakrishnan, Sagiv, Ullman, Vardi).

Magic Sets • Several similar techniques for rewriting or executing Datalog programs in a way that avoids generation of useless facts. • Rohmer, Lescoeur, and Kerisit (earliest exposition). • Beeri and Ramakrishnan (reordering of atoms). • Sacca and Zaniolo (simplification of rules). • Dietrich and Warren (tabulation). • Vielle (query-subquery).

Bounded Recursion • Sometimes a recursive Datalog program is equivalent to a nonrecursive program. • Sufficient conditions (Naughton, Ioannidis, Naughton and Sagiv). • Undecidability (Gaifman, Mairson, Sagiv, and Vardi). • Decidable cases (Vardi, Cosmodakis, Gaifman, Kanellakis, and Vardi)

Containment for Datalog • Undecidable if one Datalog program is contained in another (Shmueli). • NP-complete whether a CQ is contained in a Datalog program. • Triply exponential whether a Datalog program is contained in a CQ (Chaudhuri and Vardi).

What Is It All Good For? • Dependencies and normalization are now familiar to most CS graduates. • Difference between BCNF and 3NF. • Tradeoffs: lossless joins versus maintaining dependencies. • Magic-sets, recursion used in IBM’s DB/2. • Universal-relation interfaces despite Codd.

More “What Is It All Good For?” • Information integration . While logical query languages have not caught on, logic has been used to specify how legacy databases are combined into a uniform whole. • Tsimmis (Papakonstantinou, Vassalos). • Information Manifold (Levy). • Infomaster (Duschka, Genesereth).

A Brief History of DB Theory