Normalization

Normalization

Normalization • Normalization is a procedure within relational database design to ensure that • Each unit of data (value that a property takes on) is accessible to querying • Relationships are properly modeled • Data redundancy (unnecessary repetition) is minimized.

Complementary Approaches • ER modeling is more of a top-down approach to database design. • One starts with an overview of the most generic categories of data. • Normalization is more of a bottom-up approach. • One starts with a collection of attributes and sees how they break into various tables.

Unique Data Specification • A piece of data should be identifiable by specifying first its Table, then its column and its row within that table No further specification should be required.

The data “Bart” is found in the Character Table in the FirstName Column in the 78th row (better yet in the row with CharacterID SC0003).

Rule 1: No multi-valued fields • A “cell” in the table should hold only one unit of data. • If a cell contained more than one unit of data, there would be no way to specify one of these units within the relational (table) approach. • (Multi-valued fields can also cause storage problems, which are hidden from us when we work at the 4GL level.)

No multi-valued fields!

Flattening the file (table) • The easy way to eliminate multi-valued fields is to introduce multiple records, one for each value of the multi-valued field • All of the other fields would be the same. • This is called flattening.

1st First Normal Form • If all of the tables in a database are set up such that each entry (cell) of each table contains at most one piece of data (i.e. is single-valued), then the database is said to be in the First Normal Form(1NF). • There is a series of criteria to apply, each more stringent than the next. There will be a second normal form, a third normal form and so on.

What has been gained? • The advantage of flattening is that each piece of data is now accessible without further parsing. • “Barney Gumble” is no longer part of an entry but is its own entry.

What has been lost? • The obvious price of flattening a file is redundancy (repetition of data). • In addition to the storage issues that repeated data causes (which are hidden from us), there are the so-called data anomalies. • These make maintaining the data’s integrity (correctness) more difficult.

What’s the key? • Recall the key is used to identify (uniquely) a record and is important to querying. • Note that personID was the key to the table that had multi-valued fields. • But in this new table personID alone cannot be a key.

Key consideration • The character field can be the key if we do not include the producers (who do not voice characters). • Or we could have a composite key. • One could introduce a noCharacter character and a noActor person to solve the NULL problem. • Note that no part of a composite key should be NULL.

Insertion Anomaly • Ignoring the issue of Maggie (who no one voices) and the producers (who do not voice characters) there’s another problem. • Since the character and real person information are both needed: • We cannot add a character without knowing the real person who provides the voice. • We cannot add a person without knowing at least one character he/she voices. • This type of problem is known as an insertion anomaly.

Deletion Anomaly • The way the data is arranged now, if Dan Castellenata is fired and eliminated from the database, so too are all of the characters he voices eliminated. • This feature is sometimes desired and can be built in. • When this feature is not desired, but occurs because of the database design, it is known as a deletion anomaly.

Update Anomaly D’oh • Isn’t his name Castellaneta, not Castellenata? • Because of all of the repetition caused by the flattening, the name must be changed in many places. • One either does a search to find all occurrences. • Or one runs the risk of having inconsistent data and consequently queries that don’t turn up complete and accurate information. • Inaccurate data caused by the inconsistent updating of redundant information is known as an update anomaly.

More Tables, Less Redundancy • One can eliminate the data redundancy by introducing more tables. • This will simplify updating but can complicate querying. • The repeated data will be moved to its own table and in its place will be inserted a foreign key to identify the appropriate row of the new table.

Splitting • The table • PeopleVoicingCharacters(personID, characterID, personFName, personLName, regular,characterFName, characterLName, description) becomes two tables • RealPerson(personID, fName,lName, regular) • Character(characterID, fName, lName, personID, description)

Repetition cannot be completely eliminated • When PeopleVoicingCharacters became RealPerson and Character, the personID field appeared in both tables. This repetition is required to maintain the relationship between a character and a real person. • The decomposition minimized (but did not eliminate) the repetition. • Another feature of the split is that we can add unvoiced characters to the Character table and people who are not actors to the RealPerson table.

Important Properties • Two important properties of decomposition: • One should be able to put the original table back together by using a join, thus no information is lost by the split. This is called the Lossless-join property. • Any constraint on the data imposed on the original table should be imposed on the resulting tables. This is called theDependency preservation property.

How to do the splits? • We would like to have a systematic approach for splitting tables in a way that minimizing redundancy. • This is where the notion of functional dependence comes in. • How the fields depend on each other determines where they end up when the table splits.

Functional dependence • If for all of the records in a table, one can determine what value field B takes on simply by knowing what value field A has, then field B is said to be functionally dependent on field A. • Either of the fields above may be composite, that is, comprised of other fields (e.g. name is a composite field comprised of fName and lName fields).

Derived attributes are functionally dependent • We called an attribute “derived” if it was determined by another attribute. • E.g. age or ageCategory can be derived from dateOfBirth • Thus age is functionally dependent on dateOfBirth. • Functional dependence is not necessarily a two-way street • You can determine ageCategory from dateOfBirth but you cannot determine dateOfBirth from ageCategory.

Other functional dependencies • A dateOfBirth is functionally dependent on a socSecNum. • There’s no mathematical formula relating the two. • But the socSecNum identifies a person, and that person as a unique dateOfBirth. • Knowing the socSecNum, one can determine the dateOfBirth.

All fields depend on the primary key. • Recall that a table’s primary key is a field or fields that can be used to identify uniquely each record belonging to that table. • If a field can be uniquely identified by the primary key then it is functionally dependent on the primary key.

Composite primary keys • When the primary key is composite, the question arises: • Is a field uniquely determined by (functionally dependent on) part of the primary key or the whole primary key? • Data that is dependent on part of a primary key can be split off into another table. • Known as partial dependencies.

Splitting off fields that depend on part of a key. • Imagine a MovieDirector table MovieDirector(movieID, directorID, movieTitle, movieType, movieDescription, directorFirstName, directorLastName, directorDateOfBirth) • It would have a composite primary key comprised of movieID and directorID.

Depending on parts • The movieTitle, movieType, movieDescription fields depend only on the movieID portion of the primary key. • The directorFirstName, directorLastName and directorDateOfBirth fields depend only on the directorID portion of the primary key.

MovieDirector: Three-Way Split • We split off any fields that depend only on movieID. • Partial dependence (dependent on a part) • We split off any fields that depend only on directorID. • Partial dependence • We are left with any fields that depend on both. • Full dependence

MovieDirector: Three-Way Split • One gets: • Movie(movieID, movieTitle, movieType, movieDescription) • Director(directorID, directorFirstName, directorLastName, directorDateOfBirth) • MovieDirector2(movieID, directorID) • (If you assume that a movie has one director, then the directorID would be functionally dependent on the movieID, and you would only need two tables above.)

Two or Three-Way Split • In the Simpsons example, the PeopleVoicingCharacters became two tables, while in film example, the MovieDirector table became three tables. • The difference is that personID was functionally dependent on the characterID, whereas there was no dependency between movieId and directorID. • This agrees with the notion that many-to-many relationships between entities require their own tables.

2nd Second Normal Form • If a table has no multi-valued fields AND does not have any non-key fields which are functionally dependent on only part of the primary key, then such a table is said to be in the second normal form (2NF). • The higher normal forms will be presented next time.

References • Database Systems, Rob and Coronel • Database Systems, Connolly and Begg • Fundamentals of Relational Databases, Mata-Toledo and Cushman • Concepts of Database Management, Pratt and Adamski

Normalization

Normalization

Presentation Transcript

Normalization

Normalization

Normalization

Normalization

Normalization

Normalization

Normalization

Normalization

Normalization

Normalization

Normalization

Normalization

Normalization

Normalization

Normalization

NORMALIZATION

Normalization

Normalization

Normalization

Normalization