CM20145 Further DB Design – Normalization

Dr Alwyn Barry Dr Joanna Bryson CM20145Further DB Design –Normalization

Last Time • Database design is an ongoing, iterative process. • Requirements come from data, user demands, design issues. • Change occurs: • Corporations & technologies grow. • Programmers & users learn. • Views / security. • Lossless-join decomposition Now: Science for improving design.

Design Process & Normalization • We assume a schema R is given. • R could have been generated when converting E-R diagram to a set of tables. • R could have been a single relation containing all attributes that are of interest (called universal relation). • Normalization breaks R into smaller relations. R could be the result of any ad hoc design of relations, which we then test & convert to normal form.

Overview • First Normal Form. • Functional Dependencies. • Second Normal Form. • Third Normal Form. • Boyce-Codd Normal Form. • Fourth Normal Form. • Fifth Normal Form. • Domain Key / Normal Form. • Design Process & Problems.

First Normal Form – 1NF • You aren’t supposed to have more than one value per attribute of a tuple. • All tuples have the same number of attributes. • Necessary for a relational database. BAD

Getting Caught Out With 1NF • A domain is atomic if its elements are considered to be indivisible units. • Examples of non-atomic domains: • Set-valued attributes, composite attributes. • Identifiers like CS101 that can be broken up into parts. • A relational schema R is in first normal form if the domains of all attributes of R are atomic. • Non-atomic values: • complicate storage, • encourage redundancy, • Depend on interpretation built into application programs.

Are You Atomic? • Atomicity is not an intrinsic property of the elements of the domain. • Atomicity is a property of how the elements of the domain are used. • E.g. strings containing a possible delimiter (here: a space) • cities = “Melbourne Sydney” (non-atomic: space separated list) • surname = “Fortescue Smythe” (atomic: compound surname) • E.g. strings encoding two separate fields • bucs_login = cssjjb • If the first two characters are extracted to find the department, the domain bucs_login is not atomic. • Leads to encoding of information in application program rather than in the database.

Second Normal Form (2NF) • Violated when a nonkey column is a fact about part of the primary key. • A column is not fully functionally dependent on the full primary key. • CUSTOMER-CREDIT in this case: From Watson BAD FIX ITEM *itemid … ORDER quantity … CUSTOMER *customerid customer-credit …

Def: Functional Dependency • Let R be a relation schema   R and   R • The functional dependency (FD)  holds on R (“ is FD on ”) iff for any legal relations r(R): • whenever any two tuples t1and t2 of r agree on the attributes  • they also agree on the attributes . • i.e. (t1) = (t2)   (t1) =  (t2) • Example: Consider r(A,B) with the following instance of r: • AB does NOT hold, but BA does hold

Functional Dependencies: Uses • Way to encode “business rules”. • Specify constraints on the set of legal relations. • We say that Fholds onR if all legal relations on R satisfy the set of FDs F. • Test relations to see if they are legal under a given set of FDs. • If a relation r is legal under a set F of FDs, we say that rsatisfies F.

Functional Dependencies • An FD is an assertion about a schema, not an instance. • If we only consider an instance or a few instances, we can’t tell if an FD holds. • Inspecting only a few bird species (e.g. crows, cardinals and canaries) we might conclude colour  species. • However, this would be a bad FD as we would find out if we found some ravens. • Thus, identifying FDs is part of the data modelling process.

Trivial Functional Dependencies • An FD is trivial if it is satisfied by all instances of a relation • E.g. • customer-name, loan-number customer-name • customer-name customer-name • In general,   is trivial if    • Permitting such FDs makes certain definitions and algorithms easier to state.

Functional Dependency vs Key • FDs can express the same constraints we could express using keys: • Superkeys: • K is a superkey for relation schema R if and only if K R • Candidate keys: • K is a candidate key for R if and only if • K R, and • there is no K’  K such that K’ R • Of course, which candidate key becomes the primary key is arbitrary.

FDs <> Keys • FDs can represent more information than keys can on their own. • Consider the following Loan-info-schema: Loan-info-schema = (customer-name, loan-number,branch-name, amount). We expect these FDs to hold: loan-numberamount loan-number  branch-name We could try to express this by making loan-number the key,however the following FD does not hold: loan-number customer-name • Incidentally, this isn’t a very good table! (¬2NF)

FD Closure • Given a set F of FDs, other FDs are logically implied. • E.g. If A  B and B  C, we can infer that A  C • The set of all FDs implied by F is the closure of F, written F+ . • Find F+by applying Armstrong’s Axioms: • if   , then   (reflexivity) • if  , then    (augmentation) • if  , and   , then   (transitivity) • Additional rules (derivable from Armstrong’s Axioms): • If   and  holds, then    holds (union) • If    holds, then   holds and  holds (decomposition) • If   holds and   holds, then   holds (pseudotransitivity)

Bad Decomposition Example(From Last Time) • A Non Lossless-Join Decomposition R = (A, B) R1 = (A), R2 = (B) A B A B A B     1 2 1 2    1 2 1   1 2 B(r) A(r) r A (r) ⋈ B (r) • Thus, r is different to A (r) ⋈ B (r) • So A,B is not a lossless-join decomposition of R.

FDs & Lossless Decomposition • All attributes of an original schema (R) must appear in the decomposition (R1, R2): R = R1  R2 • Lossless-join decomposition.For all possible relations r on schema R r = R1 (r) ⋈ R2 (r) • A decomposition of R into R1 and R2 is lossless-join if and only if at least one of the following dependencies is in F+: • R1 R2R1 • R1 R2R2

Second Normal Form (2NF) • Violated when a nonkey column is a fact about part of the primary key. • A column is not fully functionally dependent on the full primary key. • CUSTOMER-CREDIT in this case: From Watson BAD FIX ITEM *itemid … ORDER quantity … CUSTOMER *customerid customer-credit …

Third Normal Form (3NF) • Violated when a nonkey column is a fact about another nonkey column. • A column is not fully functionally dependent on the primary key. • R is 3NF iff R is 2NF and has no transitive dependencies. • EXCHANGE RATE violates this. FIX BAD

Boyce-Codd (BCNF) • Arises when a table: • has multiple candidate keys, • the candidate keys are composite, • the candidate keys overlap. • R is BCNF iff every determinant is a cand. key. • E.g. Assume one consultant per problem per client, and one problem per consultant. • If client-problem is the primary key, how do you add a new consultant? • Like 3NF but now worry about all fields. BAD FIX

Design Goals & their discontents • Goals for a relational database design: • eliminate redundancies by decomposing relations, • must be able to recover original data using lossless joins, • prefer not to loose dependencies. • BCNF: • no redundancies, • no guarantee of dependency preservation. • 3NF: • dependency preservation, • but possible redundancies.

Fourth normal form (4NF) • A row should not contain two or more independent multivalued facts. • 4NF iff BCNF & no non-trivial multi-valued dependencies. • Multivalued dependency means the value of one attributed determines a set of values for another. BAD FIX

Fifth normal form (5NF) • 5NF iff a relation has no join dependency. • The schemas R1, R2,.., Rn have a join dependency over R if they define a lossless-join decomposition over R. • This is way too complicated, don’t worry about it.

Domain Key Normal Form • Every constraint on the relation must be a logical consequence of the domain constraints and the key constraints that apply to the relation. • Key: unique identifier. • Constraint: rule governing attribute values. • Domain: set of values of the same data type. • No known algorithm gives DK/NF.

E-R Model and Normalization • When an E-R diagram is carefully designed, identifying all entities correctly, the tables generated should not need further normalization. • However, in a real (imperfect) design there can be FDs from non-key attributes of an entity to other attributes of the entity. • The keys identified in E-R diagrams might not be minimal - FDs can help us to identify minimal keys. • FDs from non-key attributes of a relationship set are possible, but rare.

Denormalization & Performance • May want to use non-normalized schema for performance. • E.g. displaying customer-name along with account-number and balance requires join of account with depositor. • Alternative 1: Use denormalized relation containing attributes of account as well as depositor. • faster lookup. • extra space and extra execution time for updates. • extra coding work for programmer and possibility of error in extra code. • Alternative 2: use a materialized view defined as account ⋈ depositor • as above, except less extra coding, errors.

Limits of Normalization • Examples of bad database design, not caught by normalization. • Good: • earnings(company-id, year, amount) • Bad: • earnings-2000, earnings-2001, earnings-2002, etc., on (company-id, earnings) • all are BCNF, but querying across years difficult • needs a new table each year • company-year(company-id, earnings-2000,earnings-2001, earnings-2002) • in BCNF, but querying across years difficult • requires new attribute each year

Summary 1 – Rules to Watch • 1NF: attributes not atomic. • 2NF: non-key attribute FD on part of key. • 3NF: one non-key attribute FD on another. • Boyce-Codd NF: overlapping but otherwise independent candidate keys. • 4NF: multiple, independent multi-valued attributes. • 5NF: join dependency. • Domain Key / NF: all constraints either domain or key

Summary 2 – Concepts • Functional Dependencies: • Axioms & Closure. • Lossless-join decomposition. • Design Process. • Normalization Problems. Next: Interfaces and Architectures

Reading & Exercises • Reading • Connolly & Begg Chapter (13, 14) • Silberschatz Chapters 7. • Any other book, the design/normalization chapter. • Exercises: • Silberschatz • 7.1, 7.2, 7.16, 7.23, 7.24, 7.27-29

Next Week • Architectures and Implementations • Integrity and Security

Slides after and including this one you are not responsible for, but I am saving in case I decide to use them in the future.

Goal: Formalize “Good Design” • Process: • Decide whether a particular relation R is in “good” form. • In the case that a relation R is not in “good” form, decompose it into a set of relations {R1, R2, ..., Rn} such that: • each relation is in good form, • the decomposition is a lossless-join decomposition. • Theory: • Constraints on the set of legal relations. • Require that the value for a certain set of attributes determines uniquely the value for another set of attributes – functional dependencies.

CM20145 Further DB Design – Normalization

CM20145 Further DB Design – Normalization

Presentation Transcript

Educating Designers and Design Educators about Universal Design

Database Lecture Notes Normalization 1 – Functional Dependencies

The PCB Design Process

Component-Level Design

SQL Unit 15 Normalization

Digital IC Design Flow: A quick look

Design Considerations (Web)

Design Patterns

Introduction to Design Patterns

Design Pattern ： An Introduction

Perpetual Pavement Design

PART 2

Java Design Patterns

Schema Refinement and Normal Forms

Functional Dependency Armstrong’s Axioms Closure of FDs Closure of attribute(X) – Find Keys

DATABASE DESIGN

Agenda for Design Activity

OpAmp (OTA) Design

MYP Design Cycle