Create Presentation
Download Presentation

Download Presentation

Normalization Hussam Eldin Najib CS Department An-Najah National University

Normalization Hussam Eldin Najib CS Department An-Najah National University

138 Views

Download Presentation
Download Presentation
## Normalization Hussam Eldin Najib CS Department An-Najah National University

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**NormalizationHussam Eldin NajibCS Department An-Najah**National University HUSSAM NAJIB- CS DEPT- AN-NAJAH NATIONAL UN.**Introduction**• The purpose of normalization is to design “good” database schemas; normalized schemas avoid certain types of anomaly • If we use methodologies based on developing a definition of the schema in terms of the ER model and then mapping such schema in a relational schema, most of these anomalies do not arise • Very often, however, the normalization process results in a large number of relations with few attributes each; notice as this may result in performance inefficiency, due to the increased need of join operations**Redundancy and anomalies**job budget name salary project Red 2 DB 300 technician Green 3 WEB 500 designer Green 3 CAD 1500 designer Black 7 CAD 1500 director Black 7 WEB 500 consultant Black 7 DB 300 consultant Mori 6 DB 300 director Mori 6 CAD 1500 designer White 6 CAD 1500 designer White 6 DB 300 designer**Redundancy and anomalies**1) redundancies: • The information concerning the salary of each employee is repeated several times • The information concerning the budget of each project is repeated several times • The values of columns ‘name’ and ‘project’ are repeated and therefore each one of these two columns cannot be taken as key • The key is thus (name, project): we do not have repetitions of information related to the relationship project-employee**Redundancy and anomalies**2) updates: • Because the information about the salary of each employee is repeated multiple times, if the salary of an employee is updated, such update must be propagated to all the tuples concerning such employee • The same applies to updates to the budget of projects**Redundancy and anomalies**3) deletions: • Suppose that an employee resigns from the company or does not participate on any of the company project, then data about some projects can be lost if this employee was the only one working on these projects • The same happens for employee data if a project is deleted • If the key is (project, name) in both cases of deletions the key would have null values, which is a situation not admitted**Redundancy and anomalies**4) insertions: • Because the key is (project, name) it is not possible to insert the data of an employee if such employee has not been assigned to at least a project. • The same holds for projects. It is not possible to insert the data about a project, if this project does not have at least an employee working on it • Accepting the insertion of a project (without an employee) or an employee (without a project) results in introducing null values (not compatible with the key)**Redundancy and anomalies**• Such anomalies do not in general arise if one follows a good database design methodology: design of an ER schema and mapping of this schema on the relational model • It can anyhow happen that deficiencies in the specification or errors in the conceptual shema may result in relations with anomalies • The most frequent cases are when dealing with database that are not well documented or when trying to understand the nature of the data from documents that describe the information based on paper forms**Redundancy and anomalies**• ER correct • Mapping onto the relational model employees (name,salary) projects (project,budget) Works-in (name,project,job) jobs name project (0,n) (0,n) projects employees budget salary job**Problem**• Given a relational schema: • How can we determine if there the schema has anomalies by only analyzing the schema? • In the case in which the schema has anomalies, how can we transform this schema in a schema which is equivalent and free of anomalies?**Approach**• Formalization of some knowledge about the domain through some special integrity constraints • Functional dependencies • Characterization of the properties of the schemas that do not have anomalies • Normal forms • Algorithms to translate schemas into normalized schemas • There is one such algorithm for each normal form**Functional dependencies**• A functional dependency is an integrity constraint for the relational model • In the employees-projects example, each employee has a unique salary • In the table, given an employee, each time such employee appears in a tuple, the value of the salary column is the same • We can say that the value of the column ‘name’ determines the value of the column ‘salary’, that is: • There exists a function that associates with each value of the domain of the column ‘name’ one and only one value from the domain of the column ‘salary’**Functional dependencies**• let R(A1,A2,....An) be a relational schema, and let X and Y be subsets of {A1,A2,....An}; we say that X functionally determines Y, denoted by X Y, if for each state r (or extension) of R, it is not possible that two tuples exist in r having equal values for all attributes in X, and different values for the attributes in Y (in other words, if t1, t2 r and t1.X=t2.X, then t1.Y=t2.Y) • In our example: name salary project budget name project job**Functional dependencies - Example**• Consider a database with the following schema: Members (Name, Address, Balance) Orders (Order_no, Name, Item, Quantity) Suppliers(Sname, Saddress, Item, Price) • Suppose that each customer has a unique address and balance; suppose that there are no two customers with the same name. Such assumptions can be formalized by the following functional dependency Name Address Balance**Functional dependencies - Example**• In the Orders relation we can assume that the order_no determines all other attributes, in that different orders have different order numbers; thus Order_no Name Item Quantity • In the Suppliers relation, if we assume that a supplier has a unique address and that a supplier supplies a given item at a unique price, we can observe the following dependencies: Sname Saddress Sname Item Price**Functional dependencies - Example**• A question is whether dependencies such as: Saddress Sname or Address Name are valid dependencies • The reply depends from the data semantics: if it is possible that two different customers have the same address, then the dependency Address Name is not valid (given an address, we can find two different customers with different names associated with this address)**Functional dependencies - Example**• In addition some trivial dependencies hold, such as: 1) Name Name 2) Sname Item Item • such dependencies state that: 1) if two tuples have the same value for Name, they have the same value for Name 2) two tuples having the same values for Item and Sname have the same value for Item**Functional dependencies - Example**• Other non trivial dependencies can be derived from the dependencies we have already specified • Sname Item Saddress Price • This dependency states that two tuples, having the same values for Sname and Item, must have the same values for both attributes Saddress and Price • This dependency is obtained by observing that Sname by itself functionally determines Saddress, whereas Sname and Item together functionally determine Price • Thus if we fix a value of Sname and a value of Item, we find a single value for Saddress and for Price**Functional dependencies - Closure**• Let F be a set of functional dependencies defined over a relational schema R, and let X Y be a functional dependency, we say that F logically implies X Y, denoted as F |= X Y, if each relation r of schema R that verifies the functional dependencies in F, also verifies X Y • Example: {A B , B C} |= A C • The closure of a set F of functional dependencies, denoted as F+, is defined as the set of functional dependencies logically implied by F F+ = {X Y | F |= X Y}**Functional dependencies - Closure**Example • Notation: in what follows we will use the compact notation A B C to denote {A B, A C} • R(A,B,C) and F = {AB, BC} • F+ consists of all the functional dependencies XY that verify one of the following three conditions • X contains A, • X contains B but not A and Y does not contain A, • X Y is the dependency C C • F+ contains for example ABC BC, AB BC, A C, BC B**Functional dependencies - Closure**• A set of inference rules has been defined that allows the computation of all the dependencies in F+ • Such rules are sound and complete, that is, they generate all and only the dependencies in F+ • Such set of rules is known as Armstrong’s axioms (from the name of the researcher that defined it in 1974)**Armstrong axioms**• Let Dbe a set of attributes and F be a set of functional dependencies defined over the attributes in D, the inference rules are as follows: • A1: (reflexivity) If Y X D, then X Y is logically implied by F such rule generates trivial dependencies, that is dependencies in which the set of attributes in the right hand side of the dependency is a subset of the set of attributes on the left hand side of the dependency; the use of such rule does not depend from F • A2: (additivity ) If X Y, and Z D, then XZ YZ is logically implied by F • A3: (transitivity ) If X Y, and Y Z, then X Z is logically implied by F**Rules derived from Armstrong axioms**• It is possible to derive additional rules from the Armstrong axioms a)Union rule {X Y, X Z} |= X YZ b) Pseudotransitivity rule {X Y, WY Z} |= XWZ c) Decomposition rule If Z Y, XY |= X Z**Keys**• The notion of key, that we have already seen, can be formalized by using the functional dependencies • Let R be a relational schema with attributes A1 A2.....An and F be a set of functional dependencies. Let X be a subset of R. X is a key of R if: • X A1 A2.....An is in F+ that is, the functional dependency of each attribute of R from X is either given or logically follows from the given functionally dependencies • There is no Y, proper subset of X, such that Y A1 A2.....An is in F+ (minimality condition) • We use the term superkey to denote a superset of the key**Keys - Example**• Person(SSN, FName, LName, BirthDate) • F={SSN FName LName BirthDate} (such dependency specifies that the SSN is unique for each person, that is, no two or more tuples exist that have the same value for SSN and different values for the other attributes) • It is easy to see that SSN is a key for Person, where as (SSN, LastName) is not, in that a proper subset exists (SSN) that functionally determines all the attributes**Keys - Example**• R(A,B,C) and F = {AB, BC} • The key is A because: AA, AB, AC (for the transitivity rule) or equivalently AABC • In addition no proper subset X (not including A) of the attributes of R exists such that XABC**Closure of a set of attributes**• Let F be a set of functional dependencies defined on a set D of attributes; let X be a subset of D. The closure of X with respect to F is denoted by X+ and is defined as the set of attributes {A / F |= XA} = {A / XA follows from F by application of the Armstrong axioms} • XYF+ if and only if Y X+ with respect to F**Closure of a set of attributes**• The computation of F+ is very expensive • In the worst case it is exponential in the number of the attributes of the schema • Computing the transitive closure with respect to a set X of attributes is less expensive • Very often we are not interested in computing F+ but only in verifying if F+ includes a certain dependency**Closure of a set of attributes - algorithm**• Input: A finite set D of attributes, a set F of functional dependencies, and a set X D • Output: X+, closure of X with respect to F • Approach: A sequence of attribute sets X(0), X(1),....... is computed by using the following rules • X(0) is X • X(i+1) = X(i) È A , where A is a set of attributes such that: • A dependency YZ is in F • A is in Z • Y X(i) • because X=X(0) ... X(i) … D and D is finite, at the end we obtain an index i such that X(i)=X(i+1) • If follows that X(i)=X(i+1)=X(i+2)=....... The algorithm then terminates when X(i)=X(i+1) • X+ is X(i), where i is such that X(i)=X(i+1)**Closure of a set of attributes - example**• Let F be the following set and suppose we would like to compute the (BD)+ ABC CA BCD ACDB DEG BEC CGBD CEAG • X(0)=BD • X(1): the dependency DEG is used X(1)=BDEG • X(2): the dependency BECis used X(2)=BCDEG • X(3): we consider CA, BCD, CGBD e CEAG X(3)=ABCDEG • (BD)+=X(3)**Equivalence of sets of functional dependencies**• Let F and G be setof functional dependencies, we say that F and G are equivalent if F+=G+ • To determine the equivalence we proceed as follows: • For each dependency YZ in F, we test if YZ is in G+ by using the attribute closure algorithm to compute Y+ and then verifying if Z Y+ • If some dependency YZ in F is not in G+, then F+G+ • We then check that each functional dependency in G is in F+**Minimal sets**• A set F of functional dependencies is minimal if: • The right hand side of each dependency in Fis a single attribute • For no dependency XA in F the set F \ {XA} is equivalent to F • For no dependency XA in F and no subset Z of X, the set F \ {XA} U {ZA} is equivalent to F • Intuitively: • condition (2) assures that F does not contain redundant dependencies • condition (3) assures that the left hand side of each dependency does not have redundant attributes • Because on the right hand side of each rule, there is a single attribute (because of condition (1)), certainly there is no rule having redundancies in the right hand side • Each set F of functional dependencies is equivalent to a minimal set F'**Minimal sets**• To determine the minimal set we proceed as follows: • We generate F' from F by replacing each dependency XY where Y has the form A1....An with a set of dependencies of the form XA1, ....., XAn • We consider each dependency in F' having more that one attribute on the left hand side; if it is possible to eliminate an attribute from the left hand side and still have an equivalent set of functional dependencies, such attribute is eliminated (approach: B in X is redundant with respect to the dependency XA if (X\{B})+ computed with respect to F' contains A) • Let F" be the set generated by the previous step. We consider each dependency XA in F" according to some order and if F"\ {XA} is equivalent to F", we remove XA from F" (approach: computer X+ with respect to F"\{XA} and if X+ includes A, then XA is redundant)**Minimal sets**• Note that at step (3) the order according to which the functional dependencies are analyzed may result in different minimal sets: AB AC BA CA BC it is possible to eliminate BA and AC, or BC but not all three • Note that at step(3) the order according to which the attributes are analyzed influences the obtained result ABC AB BA we can eliminate either A or B from ABC but not both**Minimal sets**• Let F be the following set ABC CA BCD ACDB DEG BEC CGBD CEAG • By simplifying the right hand sides we obtain ABC CA BCD ACDB DE DG BEC CGB CGD CEA CEG • Notice that: • CEA is redundant in that it is implied by CA • CGB è is redundant in that CG+ = CGADBE CGD, CA, and ACDB imply CGB as it can be verified by computing (CG)+ • In addition ACDB can be replaced by CDB because CA • result: ABC CA BCD CDB DE DG BEC CGD CEG**Minimal sets**• It is possible to obtain an other minimal set by eliminating: CEA, CGD and ACDB • The resulting minimal set is: ABC CA BCD DE DG BEC CGB CEG**Boyce-Codd normal form**• A relational schema R with a set F of functional dependencies is in Boyce-Codd normal form (BCNF) if: • For each X A in F, such that A is not in X, then X is a superkey of R (that is X is a key or includes the key) • In other words: all non trivial dependencies for R are dependencies in which a key functionally determines all the other attributes**Boyce-Codd normal form**In our example, the key is (name, project) • name, project function verifies the BCNF, • name salary and project budget are partial dependencies that cause anomalies • The redundancies and anomalies arise because of dependencies X Y that introduce repetitions in the relation**Boyce-Codd normal form**The schema CSZ with dependencies F={CS Z, Z C} is not in BCNF Key CS: CS+ = CSZ Key ZS: ZS+=ZSC The dependency Z C violates the BCNF: Z+ = {Z,C} and thus because Z+ does not include S, Z is not a key**Boyce-Codd normal form**The schemas Members (Name, Address, Balance) Orders (Order_no, Name, Item, Quantity) with the functional dependencies Name Address Balance Order_no Name Item Quantity are in BCNF**Third normal form**• there exists a normal form weaker than BCNF; such normal form is known as third normal form (3NF) • an attribute A of a relational schema is called prime if it is an element of some key of R (note: R can have several keys) • Example: CSZ with F = {CS Z, Z C} all attributes are prime in that in both CS and SZ are keys • Example: ABCD with F = {AB C, B D, BC A} the keys are AB and BC, thus A,B,C are prime, whereas D is not prime**Third normal form**• A relation schema R with dependencies F is in third normal form (3NF) if: • For each X A in F, such that A is not in X, then: (i) X is a superkey of R (that is X is a key or includes a key) or (ii) A is prime • The main difference is that the 3NF admits functional dependencies that do not have a key on the left hand side, as long as the right hand side only has prime attributes**Example**• CSZ with F = {CS Z, Z C} • is in 3NF in that all attributes are prime • SAIP with F ={ SI P, S A} • The only key for this relation is SI • The attributes prime are {S,I} • The dependency SI P does not violate the 3NF in that it verifies condition (i) • The dependency S A violate the 3NF in that S is not a superkey and A is not prime**Example**• Schema R = (S, I, D, M) S = Store I=Item D=Department M=Manager • functional dependencies: • SI D each item is sold by at most one department per store • SD M each department has only one manager • the key is SI: it is determined by computing (SI)+ • the dependency SD M violates the 3NF in that SD is not a superkey for R and M is not prime**Decomposition of relational schemas**• The decomposition of a relational schema R={A1, A2,...,An} is the substitution of R with a collection C={R1, R2,...,Rk} of subsets of the set of attributes of R such that R = R1 U R2 U... U Rk • The various schemas Ri, i=1,...,k, are not typically disjoint • If we have a schema which is not normalized, it is possible to decompose it into normalized subschemas**Decomposition of relational schemas - examples**• A decomposition (in BCNF) of the example schema is employees (name,salary) projects (project,budget) worksin (name,project,function) • a decomposition (in BCNF) of supplier (sname, saddress, item, price) with functional dependencies sname saddress and sname item price is SA (sname, saddress) and SIP(sname , item, price)**Decomposition of relational schemas**• It is important however that decompositions have certain properties • wrong decompositions may generate relations that, combined with the join, produce relations with additional data • We have thus information loss • Consider the following example: • Branches (name, project, branch) • with dependencies: • name branch and • project branch**Decomposition of relational schemas**The key of Branches is: name, project project branch name Red DB Milan Green WEB Turin Green SE Turin White CAD Milan Black SE Turin If we decompose according to the two dependencies:**Decomposition of relational schemas**employees projects name branch project branch Red Milan DB Milan Green Turin WEB Turin White Milan SE Turin Black Turin CAD Milan The join on the common attribute: SELECT E.name, P.project, P.branch FROM employees E, projects P WHERE E.branch = P. branch**Decomposition of relational schemas**project branch name Red DB Milan It creates tuples that did not exist! Red CAD Milan Green SE Turin Green WEB Turin White DB Milan White CAD Milan Black SE Turin Black WEB Turin