Design Guidelines for Effective Database Schemas

CS 728Advanced Database SystemsChapter 10 Functional Dependencies & Normalization

Design Guidelines for Relation Schemas • 1. Design relation schemas so that their attributes will have clear meanings (semantics) and related attributes are grouped into single entities. • 2. Design relation schemas in such a way to avoid update anomalies by reducing redundant data. • 3. Avoid (minimize) NULL values. • 4. Design schemas so that when relations of such schemas are joined no spurious tuples will be generated.

Semantics of Relation Attributes • Guideline 1: • Design a relation schema so that it is easy to explain its meaning. Do not combine attributes from multiple entity types and relationship types into a single relation. • Design I: STUDENT(STNO, Name, Address, ANO) ADVISOR(ANO, Name, Address, Dept) • Design II: Student-Advisor(STNO, Name, Address, ANO, A-name, A-address, Dept) • Design I is better when compared with Design II.

Update Anomalies • Insertion anomalies: • In design II, if we add a new student to Student-Advisor, we have to add data related to that student’s advisor. This information should be consistent with all other occurrences of that advisor. Note that in design II all data related to a particular advisor is repeated a number of times which equals to the number of students supervised by that advisor. In design I, only advisor number is repeated. • It is difficult to add a new advisor who have no students yet to the database. This is because we have to assign nulls to STNO and STNO is the primary key for Student-Advisor.

Update Anomalies • Deletion Anomalies • If we delete the last student associated with a particular advisor, then that advisor cannot exist in the database in design II any more. • Modification Anomalies • If an advisor changes his/her address, say, then we have to modify his/her address in all tuples. • If we miss some tuples, then we will have several addresses for the same advisor.

Update Anomalies • Guideline 2: • Design the base relation schemas so that no insertion, deletion, or modification anomalies occur. • If any anomalies are present, note them clearly so that the programs that update the database will operate correctly.

NULL values in Tuples • Guideline 3: • Avoid placing attributes in a base relation whose values may be null. • If nulls are unavoidable, make sure that they apply in exceptional cases only and do not apply to majority of tuples in the relation.

NULL values in Tuples • Problems with Nulls: • Waste storage space. • Have multiple interpretations (not-applicable, not-known,…). • Create ambiguities with aggregate functions (count, avg, …) • Create ambiguities with joins.

NULL values in Tuples • Example: • If only 10% of employees have phones, then Employee(SSN, Name,…., Office-phone) is a poor design, because 90% of the last column values will be nulls; • But: • Employee(SSN, Name, …) • Phone(SSN, Office-phone) is a better design.

Spurious Tuples • Guideline 4: • Design relation schemas so that they can be joined with equality conditions on attributes that are either primary keys or foreign keys in a way which guarantees that no spurious tuples are generated.

Spurious Tuples SSN Pno Hours Ename Pname Plocatoin 11 p1 20 Alex X Irbid 22 p1 20 John X Irbid 22 p2 25 John Y Amman 33 p2 25 Bruce Y Amman 33 p3 30 Bruce Z Amman • Emp-Proj(SSN, Pno, Hours, Ename, Pname, Plocation)

Spurious Tuples Ename Plocation Alex Irbid John Irbid John Amman Bruce Amman • Emp-locs(Ename, Plocation) • Emp-Proj1(SSN, Pno, hours, Pname, Plocation)

Spurious Tuples SSN Pno Hours Pname Plocatoin 11 p1 20 X Irbid 22 p1 20 X Irbid 22 p2 25 Y Amman 33 p2 25 Y Amman 33 p3 30 Z Amman Emp-Proj(SSN, Pno, Hours, Pname, Plocation)

Spurious Tuples Ename Plocation SSN Pno Hours Pname Alex Irbid 11 p1 20 X Alex Irbid 22 p1 20 X John Irbid 11 p1 20 X John Irbid 22 p1 20 X • Result  Emp-Locs  Emp-proj1 • Then: Result(Ename, Plocation, SSN, Pno, Hours, Pname)

Spurious Tuples • If we combine Emp-Locs and Emp_Proj1 on Plocation attribute, we will get spurious tuples as you have noticed in the previous slide. This is because Plocation is the attribute which combines the two relations and it is neither a primary keynor a foreign key in either Emp-Locs or Emp-Proj1

Bad Tables (1) • Alternative designs for a product database: • (1) • Products(Prod_no, Prod_name, Price, Manu_id) • Manufacturers (Manu_id, Manu_name, Address) • (2) • Prod_Manu (Prod_no, Prod_name, Price, Manu_id, Manu_name, Address)

Bad Tables (2) • Problems with the second design: • Redundancy --- the name and address of each manufacturer will be repeated once for each product made by the manufacturer. • more storage space needed • potential inconsistency (update anomalies)

Bad Tables (3) • Insertion anomalies --- a manufacturer's name and address cannot be recorded in the database if it does not make at least one product (because Prod_no is part of the primary key). • Deletion anomalies --- If we delete all products made by a manufacturer, we will unintentionally lose track of the manufacturer's name and address. • The first design does not have similar problems. • The challenge is to identify bad relations and convert them into good relations.

Functional Dependencies • R: Relation schema. • r(R): Relation instance • A1, A2, …, An: Attributes which belong to universal relation. • X, Y: Sets of attributes.

Functional Dependencies • A functional dependency (X  Y), between X and Y specifies a constraint on the possible tuples that can form a relation instance r of R. • The constraint states that for any two tuples t1 and t2 in r such that t1[X] = t2[X], then t1[Y] = t2[Y] • This means that the Y component of a tuple in rdepends on (or determined by) the values of the X component of that tuple in r. • Or the values of the X component of a tuple in r uniquely or functionally determines the values of Y component.

Functional Dependencies • X is said to functionally determine Y (or Y is functionally dependent on X) if for every legal relation instance r(R), for any two tuples t1 and t2 in r(R), we have • t1[X] = t2[X] , then t1[Y] = t2[Y] • X  R denotes that X is a subset of the attributes of R • X Y denotes that X functionally determines Y.

Functional Dependencies • Note that: • if X is a candidate key then X  Y is correct for any subset of attributes of R. • if X  Y, this does not say whether Y X is correct or not. • Student(SSN, STNO, Name, Major) • SSN  Name • t1(91910, 980090012, Ahmed, …..) • t2(91910, 980090012, Ahmed, …..) • STNO  Major • t10((980090012,…, Math) • t12(980090012, …, Math) • SSN  {SSN, Name, STNO, Major}

Functional Dependencies 1 Manage 1 Dept Manager • Manager(SSN, Name, …. Dept) • SSN  {Name, Dept} (correct) • SSN uniquely determines the Name and Dept • Dept  SSN (correct) • Dept uniquely determines the SSN • Name  SSN (incorrect)

Functional Dependencies • FD is a property of the meaning or semantics of attributes • FD is specified as a constraint on R, all extensions of R (i.e. r(R)) should specify that constraint (legal extensions); otherwise r(R) are called illegal extensions • Note that we cannot infer FDs from r(R)

Diagrammatic Representation of FDs • SSN  STNO, NAME, MAJOR • STNO  SSN, NAME, MAJOR Student(SSN, STNO, Name, Major) FD 1 FD 2

Inference Rules for Functional Dependencies • Let • F: set of functional dependencies defined on R • F+ (Closure of F): is the set of all functional dependencies that can be defined on R • The closure of F is the set of all FDs that are logically implied by F • The closure of F is denoted by F+ • F+ = { X  Y | F ╞X  Y} • A BIG F+ may be derived from a small F • For R(A, B, C) and F = {A  B, B  C} • F+ = {A  B, B  C, A  C, A  A, B  B,C  C, AB  AB, AB  A, AB  B, ... }

Inference Rules for Functional Dependencies • Emp-Dept(SSN, Ename, Bdate, Address, Dnumber, Dname, MGR-SSN) F = { SSN  {Ename,Bdate,Address,Dnumber}, Dnumber  {Dname, MGR-SSN}} • We can infer the following FDs: • SSN  SSN (Reflexive) • SSN  Ename (Decomposition) • SSN {Dname, MGR-SSN} (Transitive) • We usually write F ╞ X  Y to denote that the FD X Y is inferred from F • X, Y are subsets of attributes

Functional Dependency • Several equivalent definitions: • X  Y in R ifffor any t1, t2 in r(R), if t1 and t2 have the same X-value, then t1 and t2 also have the same Y-value • X  Y in R iffthere exist no t1, t2 in r(R) such that t1 and t2 have the same X-value but different Y-values • X  Y in R ifffor each X-value, there corresponds to a unique Y-value.

Functional Dependency • Question: if all X-values are different in all possible r(R), does X  Y in R? • Theorem 1: • If X is a superkey of R and Y is any subset of R, then X  Y in R • Note that X  Y in R is a property that must be true for all possible legal r(R), not just for the present r(R)

Functional Dependency Example: which is true? A B C D A  B a1 b1 c1 d1 A  C a1 b2 c1 d2 C  A a2 b2 c2 d2 A  D a2 b3 c2 d3 B  D a3 b3 c2 d4 AB  D AB  C AB  CD

Identify Functional Dependency • FD created by assertions. • Employees(SSN, Name, Years_of_emp, Salary, Bonus) • Assertion: • Employees hired the same year have the same salary • This assertion implies: • Years_of_emp  Salary

Inference Rules • IR1: Reflexive Rule ( منعكس) • Y  X, then X Y • e.g. • SSN  SSN • {P#, S#, Qty}  Qty • IR2: Augmentation Rule زيادة)) • { X  Y} ╞ XZ  YZ • e.g.: F = {SSN  Address} • F ╞ {SSN, Name}  {Address, Name}

Inference Rules • IR3: Transitive rule: (متعدّي) • {X  Y, Y  Z} ╞ X  Z • e.g. • F = {SSN  Dnumber, Dnumber  Dname} • F ╞ SSN  Dname • IR4: Decomposition Rule: (التّحلّل) • X  {Y, Z} ╞ X  Y and ╞ X  Z • e.g. • F = {SSN  {Ename,BDate,Address, Dnumber}} • F ╞ SSN  Ename • ╞ SSN  Bdate • ╞ SSN  Address • ╞ SSN  Dnumber

Inference Rules • IR5: Union (Additive) Rule: • {X  Y, X  Z} ╞ X  {Y, Z} • e.g. • F = {SSN  Ename, SSN  Bdate} • F ╞ SSN  {Ename, Bdate} • IR6: Pseudo-transitive Rule: • {X  Y, WY  Z} ╞ WX  Z • e.g. • F = { SSN  STNO, {Major, STNO}  Name} • F ╞ {Major, SSN}  Name

Closure of Attributes • How to determine if F ╞ X  Y is true? • Method 1: • Compute F+ • If X Y  F+, then F ╞ X Y • Problem: • Computing F+ could be very expensive!

Closure of Attributes • Method 2: • Compute X+ : the closure of X under F • X+ denotes the set of attributes that are functionally determined by X under F. • X+ = { Y | X  Y  F+ } • Theorem: • X Y  F+ if and only if Y  X+

Algorithm for Computing X+ • Input: • a set of FDs F, a set of attributes X in R • Output: • X+ • Begin • X+ = X; • Repeat • oldX+ = X+ • for each FD Y Z in F do • if Y  X+ then X+ = X+Z; • until (X+ = oldX+ ) • end

Algorithm for Computing X+ • Example: • R(A, B, C, G, H, I) = ABCGHI • X = AG • F = {A  B, CG  HI, B  H, A  C } • Compute X+ = (AG)+ • Initialization: • X+ = AG;

Algorithm for Computing X+ • 1st iteration: • consider A  B: • since A is a subset of X+, X+ = ABG; • consider CG  HI: • since CG is not a subset of X+, X+ = ABG; • consider B  H: • since B is a subset of X+, X+ = ABGH; • consider A C: • since A is a subset of X+, X+ = ABCGH; • X+ is changed from AG to ABCGH

Algorithm for Computing X+ • 2nd iteration: • consider A  B: • since A is a subset of X+, X+ = ABCGH; • consider CG  HI: • since CG is a subset of X+, X+ = ABCGHI; • consider B  H: • since B is a subset of X+, X+ = ABCGHI; • consider A  C: • since A is a subset of X+, X+ = ABCGHI; • X+ is changed from ABCGH to ABCGHI

Algorithm for Computing X+ • 3rd iteration: • consider each FD in F again, but there is no change to X+, exit • Result: • (AG)+ = ABCGHI. • The performance of the algorithm is sensitive to the order of FDs in F

Algorithm for Computing X+ • Theorem: • Given R(A1, ..., An) and a set of FDs F in R, K  R is a • superkey if K+ = {A1, ..., An}; • candidate key if K is a superkey and for any proper subset X of K, X+{A1, ..., An}.

Algorithm for Computing X+ • Continue the above example: • AG is a superkey of R since • (AG)+ = ABCGHI. • Since A+ = ABCH, G+ = G, neither A nor G is a superkey. • Hence, AG is a candidate key

Normal Forms Based on Primary Keys • Normalization of data: • is a process during which unsatisfactory relation schemas are decomposed by breaking up their attributes into smaller relation schemas that posses desirable properties • We normalize data for several reasons one of them is to avoid update anomalies • We have • First Normal Form (1NF) • Second Normal Form (2NF) • Third Normal Form (3NF) • Boyce-Codd Normal Form (BCNF) (a stronger definition of 3NF) • All the above normal forms are based on functional dependencies.

Normal Forms Based on Primary Keys • Forth Normal Form (4NF) Based on multivalued dependencies. • Fifth Normal Form (5NF) based on join dependencies. • Aside: • Student-Adv(STNO, StName, Major,…, Ano, Aname, …) • Has several problems • Student(STNO, StName, Major, …, Ano) • Advisor(Ano, Aname, ….) • Pay the price of expensive joins

Basic Definitions • R = {A1, …, An) • Superkey: is set of attributes S  R and t1[S]  t2[S] i =1, ..., n. S may contain redundant attributes. • Key (K): is a superkey with no redundant attributes, i.e. removal of any attribute from K will no longer make it a superkey. • Candidate Key: if a relation has more than one key, each is called a candidate key. One of these keys is arbitrarily chosen as a primary key. • Prime Attribute: is an attribute which is a member of any key (primary or candidate); other attributes are called nonprime attributes.

Basic Definitions • Student(SSN, STNO, Name, Address, Salary) • Superkeys • {SSN,Name}/{SSN,STNO,Name,Address,Salary} • Candidate keys • {SSN, STNO}; • Primary Key • SSN • Prime Attribute: • SSN and STNO • Nonprime Attributes: • {Name, Address, Salary}

1NF (First Normal Form) • A relation schema R is in 1NF if every attribute of R takes only single and atomic values. • Domains of attributes must include only atomic values and that the value of any attribute in a tuple must be a single value from the domain of that attribute. • In other words, multivalued and composite attributes are disallowed.

1NF (First Normal Form) • Student(STNO, StName, {Course(CNO, Ctitle)} • The set braces {} identify the attribute Course as multivalued • The set braces () identify the attribute Course as a composite attribute • Because of the last attribute (Course), the Student relation schema is not in 1NF.

1NF (First Normal Form) • Student(STNO, StName, {Course(CNO, Ctitle)} • To normalize it to 1NF: • Student(STNO, StName, CNO, Ctitle) • this is not a good representation because it has many disadvantages. Replication of Course information and student information. Combining attributes which belong to two separate entities (namely student and course) into a single relation schema.

Design Guidelines for Effective Database Schemas

Design Guidelines for Effective Database Schemas

Presentation Transcript

Advanced Database Systems

Advanced Database Systems

Advanced Database Systems

Advanced Database Systems

CS 728 Advanced Database Systems Chapter 16

Advanced Database Systems

CS 435 Database Systems

Advanced Database Systems

Advanced Database Systems

CS 306 Database Systems

Advanced Database Systems

CS 370 Database Systems

CS 728 Advanced Database Systems Chapter 18

CS 370 Database Systems

CS 370 Database Systems

Advanced Database Systems

CS 370 Database Systems

Database Systems CS 311

CS 435 Database Systems

CS 728 Advanced Database Systems Chapter 20 Foundation of Database Transaction Processing