Functional Dependencies and Normalization for Relational Databases

Functional Dependencies and Normalization for Relational Databases

Relational Database Designs • There are many different relational database "designs" (schemas) that can be used to store the relevant mini-world information. • Can one schema be much better than another?

Example: Shipment of Parts • Consider a database containing shipments of parts from suppliers. • A supplier belongs to a particular city and a status is associated with each city. • A part has a name, color, quantity and weight and is shiped on particular date • Proposed relation to capture this information: Ship(S#, sname, status, city, P#, pname, colour, weight, qty, date)

Example: Shipment of Parts Ship(S#, sname, status, city, P#, pname, colour, weight, qty, date) Problems ?? • Redundant Data • Inability to represent certain information • Loss of information

Example: Shipment of Parts Ship(S#, sname, status, city, P#, pname, colour, weight, qty, date) • INSERT • We can not enter a new supplier until the supplier ships a part. • We can not enter a new part until it is shipped by a supplier. • DELETE • If we delete the only tuple of a particular supplier, we destroy not only the shipment information but also the information that the supplier is from a particular city. (example S5 and P1) • UPDATE • The city of a supplier may appear many times in Ship. • If we change the value of the city in one tuple but not all the other, it creates inconsistency in the database.

Design Anomalies Ship(S#, sname, status, city, P#, pname, colour, weight, qty, date) • This Apply relation exhibits three types of anomalies: • Redundancy • Update Anomaly • Deletion Anomaly • Good designs: • No anomalies • Can reconstruct all original information

Example: Shipment of Parts

Example: Shipment of Parts Supplier(S#, sname, status, city) Parts(P#, pname, color, weight) Shipment(S#, P#, qty, date) • INSERT • We can not enter the fact that a particular city has a particular status until we have a supplier from that city • DELETE • If we delete the only tuple with a particular city, we destroy not only the supplier information but also the information that that city has that particular status. • UPDATE • The city of a supplier may appear many times in Supplier (redundant information). • If we change the status for Paris, we have to change it for all tuples his Paris as the city or we face inconsistency. Problems ??

Example: Shipment of Parts

Informal Design Guidelines for Relational Databases • We first discuss informal guidelines for good relational design • Then we discuss formal concepts of functional dependencies and normal forms

Semantics of the Relation Attributes • Each tuple in a relation should represent one entity or relationship instance. EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours) • Attributes of different entities should not be mixed in the same relation • Only foreign keys should be used to refer to other entities • Entity and relationship attributes should be kept apart as much as possible.

Redundant Information in Tuples and Update Anomalies EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours) • Information is stored redundantly • Wastes storage • Causes problems with update anomalies • Insertion anomalies • Deletion anomalies • Modification anomalies

EXAMPLE : DESIGN ANOMALIES EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours) • Update Anomaly: • Changing the name of project number P1 from “Billing” to “Customer-Accounting” may cause this update to be made for all 100 employees working on project P1. • Insert Anomaly: • Cannot insert a project unless an employee is assigned to it. • Cannot insert an employee unless a he is assigned to a project.

EXAMPLE OF AN DELETE ANOMALY EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours) • Delete Anomaly: • When a project is deleted, it will result in deleting all the employees who work on that project. • Alternately, if an employee is the sole employee on a project, deleting that employee would result in deleting the corresponding project.

Two relation schemas suffering from update anomalies

Base RelationsEMP_DEPT and EMP_PROJ formed after a Natural Join : with redundant information

Redundant Information in Tuples and Update Anomalies • Design a schema that does not suffer from • insertion, • deletion and • update anomalies. • If there are any anomalies present, then note them so that applications can be made to take them into account. • Use anomaly free relations and define Views to join attributed frequently referenced in queries

Null Values in Tuples • Relations should be designed such that their tuples will have as few NULL values as possible • Attributes that are NULL frequently could be placed in separate relations (with the primary key) • Reasons for nulls: • Attribute not applicable or invalid • Attribute value unknown (may exist) • Value known to exist, but unavailable

Relation Decomposition • In order to solve the previous problems we decomposed the relations into smaller relations. • Relation decomposition can be dangerous we can loose information. • Loss-less join decomposition • If we decompose a relation R into smaller relations, the join of those relations should return R, not more, not less.

Example: Relation Decomposition • Redundant Information • Eliminates redundancy • To get back to the original relation use join

Bad decomposition • Association between CID and grade is lost • Join returns more rows than the original relation

Unnecessary decomposition • Join returns the original relation • Unnecessary: no redundancy is removed, and now SID is stored twice!

Spurious Tuples • Bad designs for a relational database may result in erroneous results for certain JOIN operations • The relations should be designed to satisfy the lossless join condition. • No spurious tuples should be generated by doing a natural-join of any relations.

Functional Dependencies

Motivation • How do we tell if a design is bad? • This design has redundancy and update anomalies • How about a systematic approach to detecting and removing redundancy in designs? • Dependencies, decompositions, and normal forms

Functional Dependencies • Functional dependencies play an important role in database design. • They describe relationships between attributes. • FDs are derived from the real-world constraints on the attributes

Functional Dependencies • A functional dependency (FD) has the form X → Y, where X and Y are sets of attributes in a relation R • X → Y means that whenever two tuples in R agree on all the attributes in X, they must also agree on all attributes in Y • We say X functionally determines Y • X -> Y in R specifies a constraint on all relation instances r(R) • For any two tuples t1 and t2 • in any relation instance r(R): • If t1[X]=t2[X], then t1[Y]=t2[Y]

Example: Student Grade Info • StudentGrade(SID, name, email, CID, grade) • SID → name, email • email → SID • SID, CID → grade • Not a good design

EXAMPLE: ADDRESS Address (street_address, city, state, zip) • street_address, city, state → zip • zip → city, state • zip, state → zip? • This is a trivial FD • Trivial FD: LHS ⊇ RHS • zip → state, zip? • This is non-trivial, but not completely non-trivial • Completely non-trivial FD: LHS ∩ RHS = ∅

Examples of FD constraints • If K is a key of R, then K functionally determines all attributes in R • SSN -> ENAME • PNUMBER -> {PNAME, PLOCATION} • {SSN, PNUMBER} -> HOURS

FD is a property of relational schema R. Must be define by someone who knows the semantic of attributes of R. FD’s are a property of the meaning of data and hold at all times: certain FD’s can be ruled out based on a given state of the database

Functional Dependencies • Given a relation R and a set of FD’s F • Does another FD follow from F? • Are some of the FD’s in F redundant (i.e., they follow from the others)? • Consider schema Emp_Dept(ename, ssn, bdate, address, dnumber, dname, dmgrssn) • F ={ssn -> {ename, bdate, address, dnumber), dnumber ->{dname, dmgrssn) } • Additional FD that can be infer from F • ssn-> dname, dmgrssn • ssn-> ssn • dnumber -> dname

Inference Rules for FDs • Given a set of FDs F, we can infer additional FDs that hold whenever the FDs in F hold • Armstrong's inference rules: • IR1. (Reflexive) If Y  X, then X -> Y • IR2. (Augmentation) If X -> Y, then XZ -> YZ (XZ stands for X U Z) • IR3. (Transitive) If X -> Y and Y -> Z, then X -> Z • IR1, IR2, IR3 form a sound and complete set of inference rules

Inference Rules for FDs • Some additional inference rules that are useful: • Decomposition: If X -> YZ, then X -> Y and X -> Z • Union: If X -> Y and X -> Z, then X -> YZ • Pseudotransitivity: If X -> Y and WY -> Z, then WX -> Z • The last three inference rules, as well as any other inference rules, can be deduced from IR1, IR2, and IR3 (completeness property)

If X -> YZ, then X -> Y and X -> Z • YZ -> Y (IR1) • X -> YZ (given) • X -> Y (transitivity IR3) • If X -> Y and X -> Z, then X -> YZ • X ->XY (augmenting X in X-> Y) • XY -> YZ (augmenting Y in X-> Z) • X -> YZ • If X -> Y and WY -> Z, then WX -> Z • WX->WY (augmenting) • WY -> Z (given) • WX -> Z (transitivity)

Inference Rules for FDs • Closure of a set F of FDs is the set F+ of all FDs that can be inferred from F • Closure of a set of attributes X with respect to F is the set X+ of all attributes that are functionally determined by X • X+ can be calculated by repeatedly applying IR1, IR2, IR3 using the FDs in F

Algorithm for computing Attribute Closure • The closure of X (denoted X +) with respect to F is the set of all attributes functionally determined by X • Algorithm for computing the closure X+ = X repeat OldX+ =X+ for each functional dependency Y → Z in F do if Y X+ then X+ := X+ Z until (X + = oldX+ )

Example 1: Closure Algorithm Consider following schema EMP_PROJ(Ssn, Ename, Pno, Pname, Hours) And FDs • F = {Ssn -> Ename, Pno -> {Pname, Plocation}, {Ssn, Pno} -> Hours } • Find the following Closure Sets • {Ssn}+ = • {Pno} + = • {Ssn, Pno} + =

Example 2: Closure Algorithm StudentGrade (SID, name, email, CID, grade) F includes: • SID → name, email • email → SID • SID, CID → grade • { CID, email }+ = ? • X+ = {CID, email} Consider email → SID • X+ = {SID, CID, email} Consider SID → name, email • X+ = {SID, CID, email, name} Consider SID, CID → grade • X+ = {SID, CID, email, name, grade}

Using rules of FD’s • Given a relation R and set of FD’s F • Does another FD X → Y follow from F? • Use the rules to come up with a proof • Example: • F includes: • SID → name, email; email → SID; SID, CID → grade • CID, email → grade? • email → SID (given in F) • CID, email → CID, SID (augmentation) • SID, CID → grade (given in F) • CID, email → grade (transitivity)

Equivalence of Sets of FDs • Two sets of FDs F and G are equivalent if: • Every FD in F can be inferred from G, and • Every FD in G can be inferred from F • Hence, F and G are equivalent if F+ =G+ • Covers: • F covers G if every FD in G can be inferred from F • (i.e., if G+ F+) • F and G are equivalent if F covers G and G covers F • There is an algorithm for checking equivalence of sets of FDs

Minimal Sets of FDs • A set of FDs is minimal if it satisfies the following conditions: • Every dependency in F has a single attribute for its RHS. • We cannot remove any dependency from F and have a set of dependencies that is equivalent to F. • We cannot replace any dependency X -> A in F with a dependency Y -> A, Y  X and still have a set of dependencies that is equivalent to F. • Every set of FDs has an equivalent minimal set • There can be several equivalent minimal sets

Algorithm for finding Minimal Cover F for a set of FDs E • Set F:= E • Replace each FD X-> {A1, A2, …, An} in F by n FDs X-> A1, X->A2, … , X-> An • For each FD X-> A in F For each attribute B of X if {{ F- { X -> A }} U { (X- B) -> A}} = F then replace X->A with (X- B) -> A in F • For each remaining FD X->A in F if {F – {X->A} } = F then remove X->A from F

Computing the Minimal Sets of FDs Consider set of FDs E : {B → A, D → A, AB → D}. We have to find the minimum cover of E. • Step1: Ensure above dependencies are in canonical form • Step 2: Determine if AB → D has any redundant attribute, that is, can it be replaced by B → D or A → D? Consider B → A BB → AB (augmenting with B -- IR2) B → AB AB → D (given) We get B → D ( transitivity – IR3) Hence, AB → D may be replaced by B → D. • We now have a set equivalent to E, say E′ : {B → A, D → A, B → D}. • Step 3: Look for a redundant FD in E′ Using the transitive rule on B → D and D → A, we get B → A. Hence B → A is redundant in E’ and can be eliminated. • The minimum cover of E is {B → D, D → A}.

Algorithm for finding Minimal Cover F for a set of FDs E • Set F:= E • Replace each FD X-> {A1, A2, …, An} in F by n FDs X-> A1, X->A2, … , X-> An • For each FD X-> A in F For each attribute B of X if {{ F- { X -> A }} U { (X- B) -> A}} = F then replace X->A with (X- B) -> A in F • For each remaining FD X->A in F if {F – {X->A} } = F then remove X->A from F

Assignment 3- Functional Dependencies • Before the start of next class • 10.21 • 10.24 • 10.26

Normalization

Normalization of Relations • Normalization: • The process of decomposing unsatisfactory "bad" relations by breaking up their attributes into smaller relations • Normal form: • Condition using keys and FDs of a relation to certify whether a relation schema is in a particular normal form

Normalization of Relations • 2NF, 3NF, BCNF • based on keys and FDs of a relation schema • 4NF • based on keys, multi-valued dependencies : MVDs; 5NF based on keys, join dependencies : JDs (Chapter 11) • Additional properties may be needed to ensure a good relational design (lossless join, dependency preservation; Chapter 11)

Practical Use of Normal Forms • Normalization is carried out in practice so that the resulting designs are of high quality and meet the desirable properties • The database designers need not normalize to the highest possible normal form (usually up to 3NF, BCNF or 4NF) • Denormalization: The process of storing the join of higher normal form relations as a base relation—which is in a lower normal form

Functional Dependencies and Normalization for Relational Databases