Normalization Good database design ensures: that the data that we want to store is consistent. and can be retrieved or updated in the most efficient manner. Database design deals with fitting of a given piece of data into an appropriate logical organization. Database design considers the following factors: How many tables should we have? How should they be interrelated? How many columns should each table have? Which columns should be key columns? What sort of constraints should we have?
Functional Dependency: Functional dependency is very important in formulating database design. For two columns A and B of a table, we can define the functional dependency between them as: Column B is said to be functionally dependent on column A if, given A we can precisely determine B.
E.g. Consider the following Employee table: Is there any functional dependence between any two columns of the table? Consider a combination of Emp_id & Emp_name. Are any of the following statements true? Given an Emp_id we can certainly determine a precise Emp_name. Given an Emp_name, we can certainly determine a precise Emp_id.
The first statement is true, but second one is not. Emp_id is the P.K. of the table and therefore is always unique. Emp_name can repeat and therefore can be duplicated in the table. Therefore we can determine Employee name given employee id. Thus Emp_name is functionally dependent on Emp_id. Symbolically: Emp_id Emp_name This should be read as Emp_id functionally determines Emp_name.
In general if column A determines the value of column B, then we write it as: A B This should be read as: A functionally determines B There can be several functional dependencies between various columns of a table. E.g. Emp_id Age
Expanded Employee Table: List of functional dependencies: Emp_id Emp_name Emp_id Age Emp_id Salary Emp_id Project_id Emp_id Completion_date
List that cannot be termed as functional dependencies: Age NOT Emp_id (because more than one employee can have same age). Salary NOT Emp_id (because more than one employee can have the same salary.
Decomposition: Decomposition refers to the breaking down of one table into multiple tables. It is not much different from RDBMS projection operation. During projection, we choose only the needed columns of the table, discarding the rest. In decomposition we do the same. Discarding means not deletingforever but it means they are placed in another table where they should logically belong. Decomposition is done to reduce redundancy. Redundancy means unnecessary and uncontrolled duplication of data. Redundancy leads to loss of data integrity and data consistency.
Redundancy exists in the table. Student names appear three times in the table. Similarly with the subject. To minimize the redundancy following rule is to be followed: Keep one fact in one place. Decomposing the Examination table breaks it into two tables:
Student Rno Name Rno Name Sub_id Sub_name Mks Sub_id Sub_name Mks Subject
After decomposition: Student’s information in Student Table. Subject’s information in Subject table. Rnoprimary key in Student table. Sub_idprimary key in Subject table. For a given student there will be only one entry in student table and therefore student information cannot repeat. For a given subject there will also be exactly one entry in subject table and therefore subject information cannot repeat. Hence the problem of redundancy is solved.
Although the problem of redundancy is solved another undesirable problem has been introduced. Where do we now have the information as to which student has obtained how many marks in which subject? We have lost it. LossofinformationduetodecompositioniscalledLossyDecomposition. Lossy decomposition is not acceptable in any case and cannot be used to reduce redundancy.
How to figure out lossy Decomposition? Perform reversibility of decomposition i.e. recomposition. Check whether we can reassemble the data in the original form. If we are not able to take back the data in original form, it means we have lost some data from the original database design. On the other hand, if we are able to successfully reassemble the split tables so that we can recreate the original data, then the decomposition is desirable and successful. When all information found in the original database is preserved after decomposition, it is called lossless decomposition or non - lossy decomposition.
Student Rno Name Sub_id Sub_name Mks Rno Name Rno Sub_id Mks Result Modification in the previous table designs: Sub_id Sub_name Subject
Three tables are created as a result of the decomposition. The student and the subject tables continue to exist as before, but the structure of the student table is changed. The marks column no longer exists. The newly created table is the Result table. This table shows the marks obtained by a student in a particular subject. If we observe carefully, there are referential integrity relationships as follows: Between Student and result tables based on the Rno. Between the Subject and the result tables, based on the Sub_id.
Student Table Rno Result Table Rno Sub_id Subject Table Sub_id
If we join these three tables to perform a recomposition, we would get the following columns (eliminate duplicates): Rno Name Sub_id Sub_name Mks This is precisely what was contained in the original Examination table. This is an example of losslessdecomposition.
In lossydecomposition, we lose some of the functional dependency relationships. In losslessdecomposition, all functional dependency relationships are preserved.
Normalization: Normalization is the process of successive reduction of a given set of relations to a better form. It is the process of efficiently organizing data in a database. In the field of relational database design, normalization is a systematic way of ensuring that: a database structure is suitable for general-purpose querying, and free of certain undesirable characteristics as insertion, update, and deletion anomalies—that could lead to a loss of data integrity. The end result of normalization is that redundant data is eliminated and only data related to the attribute is stored within a table.
E.F. Codd, the inventor of the relational model, introduced the concept of normalization. Goals of the normalization process: 1. Eliminating redundant data (for example, storing the same data in more than one table) 2.Ensuring data dependencies(only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.
Advantages of Normalization: 1. Data integrity: • The primary thing that normalization gives you is data integrity. • With a fully normalized database design, you're assured that your transactions will completed or roll backed. • It also assures that inserts, updates, and deletes i.e., any type of modification to the data can't compromise your data.
2. Data Retrieval: • A fully normalized database will have many tables, each with a handful of columns. • to retrieve any single piece of data, then, requires joining together multiple tables. • With a fully normalized database, you can retrieve any piece of data with a single query. • And generally, the queries aren't that complex.
3. Performance: • A fully normalized database exhibits greater performance because: • Queries are shorter and simpler • Tables have fewer records since duplicate information has been eliminated through the normalization process, and • Indexes are more efficient because each table only contains a limited amount of non-redundant information.
Disadvantages: 1. Normalization Procedures: • The proper normalization of a database is a challenging process. • An improperly normalized database can cause significant problems. • Furthermore, planning is even more important because databases don't lend themselves to rapid development.
2. Database restructuring: • Proper database design is also important because it can be very difficult to modify the structure of a database after. • Modifying a properly normalized database is easier than modifying an improperly normalized database . • Database restructuring is also problematic since the database generally serves as the foundation of an larger application. • Changes to the database risk breaking the application.
3. Data insertion and Maintenance: • A fully normalized database makes it easy to retrieve data, it can be a serious pain to insert new records and update existing records. • This is because inter-table dependencies (i.e., they have referential integrity on themselves). • These tables are locked down so tightly that you can only insert, update, or delete data by acting on all related records. • It is impossible to update individual records.
Normal Forms: • The database community has developed a series of guidelines for ensuring that databases are normalized. • These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). • Normal forms are guidelines and guidelines only. • Sometimes it becomes necessary to stray from them to meet practical business requirements.
These would repeat for as many items as we have for a given order E.g.: Order Table: Order_number Order_date Customer_number Item_number Item_name Quantity Unit_price Bill_amount
First Normal Form: A table is in the first normal form(1NF) if it does not contain any repeating columns or repeating groups of columns. Order table does not conform to the first normal form as it can contain duplicate items. To bring it into 1NF it is required to perform lossless decomposition. This is done by following a rule: Move all the repeating columns to another table.
The repeating columns in the Order table are: Item_number Item_name Quantity Unit_price Bill_amount These columns can repeat many times in the same order. These columns are to be moved to another table. Thus we have two table:
Table 1: Order Order_number Order_date Customer_number Table 2 : Order_item Item_number Item_name Quantity Unit_price Bill_amount
For joining them there is a need for some commonality between these two tables. There is no such commonality. This is an example of lossy decomposition. One way to achieve commonality is to add column Order_number to Order_item table. It will link item sold to a particular order. This will create referential integrity relationship based on this column, to be established between the two tables. Order_number will be Primary Key in the Order table and Foreign Key in the Order_item table.
Table 1: Order Order_number Order_date Customer_number Table 2 : Order_item Order_number Item_number Item_name Quantity Unit_price Bill_amount
This results into lossless decomposition. What should be the primary key for Order_item table. It cannot be Order_number as there can be multiple rows for the same Order. Therefore here the primary key should be combination of the Order_number and the Item_number. Thus the primary key is a composite or concatenated primary key.
Second Normal Form: A table is in the second normal form (2NF) if it is in the first normal form and if all non key columns in the table depend on the entire primary key. Let us examine the above tables for 2NF: i) it is in the first normal form: First condition is already satisfied. ii) if all non key columns in the table depend on the entire primary key:
Order Table: Order_number (PK) Order_date Customer_number From Order_number we can derive the other non-key columns i.e. Order_date and Customer_number. These non key columns do not depend on each other at all. Thus this table is in 2NF.
Primary Key (Composite Key Order_item Table: Order_number Item_number Item_name Quantity Unit_price Bill_amount Can we determine the values of non key columns using Composite primary key? Not Quite. Item_name is base on Item_number alone, Order_number is not required. Unit_price column is determined with Item_number only. Thus all non key columns in the table do not depend on the entire primary key, but some of them depend on the part of the primary key. 2NF principle is violated.
TO bring it into 2NF: Move columns that do not depend on the entire primary key to another table. The columns that do not depend on the entire primary key are Item_name and Unit_price. They depend only on the Item_number. Move these two columns to a new table called Item. We also need Item_number (PK) in this table. The table would now look as:
Table 1: Order Order_number (PK) Order_date Customer_number Table 2 : Order_item Order_number (FK) Item_number (FK) Quantity Bill_amount Table 3 : Item Item_number (PK) Item_name Unit_price
Third Normal Form (3NF): A table is in the third normal form if it is in the second normal form and if all non key columns in the table depend non – transitively on the entire primary key. In other words, 3NF requires a table: to be in the second normal form and if all non-key column in the table must be independent of all other non-key columns.
Transitive Dependency: A B C B Functionally depends on A C Functionally depends on B C Transitively depends on A 3NF states to identify all transitive dependencies and get rid of them.
Possible transitive dependencies in current table structure: Order Table: Order_number (PK) Order_date Customer_number No such dependency as Order_date and Customer_number are functionally dependent on Order_number.
Order_item Table: Order_number (FK) Item_number (FK) Quantity Bill_amount Item_number and Quantity are functionally dependent on Order_number. Bill_amount does not depend directly on the PK of the Order_item table. It is calculated from: Quantity (Order_item) * Unit_price ( Item) This is the case of transitive dependency and have to be get rid of it.
Item Table: Item_number (PK) Item_name Unit_price Here the non key columns Item_name and Unit_price functionally depend on the primary key of the table, that is Item_number. Hence no transitive dependency in this table.
In summary we need to get rid of the Bill_amount column from the Order_item table in order to bring the table into the 3NF. DO we loose any information by doing so? No really, as value for Bill_amount can be calculated from quantity and Unit_price columns.
Table 1: Order Order_number (PK) Order_date Customer_number Table 2 : Order_item Order_number (FK) Item_number (FK) Quantity Table 3 : Item Item_number (PK) Item_name Unit_price
Boyce – Codd Normal Form (BCNF): The first three normal forms provide means to ensure that a table is reasonably normalized. The next normal form is Boyce-Codd normal form (BCNF), named after the creators of this normal form. Determinant: Functional dependency states that if B is functionally dependent on A, that is, A functionally determines B, then A B.
Based on this definition, let us define BCNF: A table is in BCNF if the only determinants in the table are the candidate keys. In other words: A table is in BCNF if every column, on which some other column is fully functionally dependent, is also a candidate for the primary key of the table.
Example: The relation student (sno, sname, cno, cname) has all attributes participating in candidate keys since all the attributes are assumed to be unique. We therefore had the following candidate keys: (sno, cno)(sno, cname)(sname, cno)(sname, cname) Since the relation has no non-key attributes, the relation is in 2NF and also in 3NF,