slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
CMSC 424 Database Design Section 401 Dr. David Kuijt PowerPoint Presentation
Download Presentation
CMSC 424 Database Design Section 401 Dr. David Kuijt

Loading in 2 Seconds...

play fullscreen
1 / 211

CMSC 424 Database Design Section 401 Dr. David Kuijt - PowerPoint PPT Presentation

  • Uploaded on

CMSC 424 Database Design Section 401 Dr. David Kuijt. Contact Info. Professor: David Kuijt Office: AVW 3205 Phone: 5-0534 Email: Office Hours: T/Th 3:15-4:00 (or by appointment) TA: Debbie Heisler Office: AVW 3270 Phone: 405-7027 Email:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'CMSC 424 Database Design Section 401 Dr. David Kuijt' - johana

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
CMSC 424

Database Design

Section 401

Dr. David Kuijt

contact info
Contact Info

Professor: David Kuijt

  • Office: AVW 3205
  • Phone: 5-0534
  • Email:
  • Office Hours: T/Th 3:15-4:00 (or by appointment)

TA: Debbie Heisler

  • Office: AVW 3270
  • Phone: 405-7027
  • Email:
  • Office Hours: TBA
basic information
Basic Information
  • Required text:

Korth & Silberschatz Database System Concepts, Fourth Edition, McGraw Hill 2001.

  • Warnings:
    • Late homework or projects are not acceptable. Hand in what you have finished. Exceptions will be made only for emergencies or medical reasons with a doctor's note.
    • No makeup exams. Exceptions as above.
    • Cheating will result in an immediate grade of XF ("failure through academic dishonesty" -- this goes on your permanent transcript), and may result in suspension or expulsion from University. This is your only warning. Don't do it.

We live in a database world. The simplest acts are tied to databases. The last time I called out for delivery pizza, it involved at least four enormous databases. What were they?

  • Pizza Hut knew what I had ordered before -- they asked if I wanted the same pizza as last time. They probably store lots more information than that -- perhaps all my old order information. They could use this information to make corporate decisions (quantities of materials to order; forecasting pizza trends) as well as a local aid.
motivation 2
Motivation (2)
  • Every delivery food place around here uses caller ID. That's a relatively simple database, just giving names and incoming telephone number, but it helps them avoid some types of fraudulent orders and errors when writing down names, addresses, and stuff.
  • I used a telephone. When you use a telephone, all the details about the call are stored in a database. Call length, what number you called, time of call, billing information, and so on. Cell phone databases are even more complex.
motivation 3
Motivation (3)
  • I paid by credit card. Huge databases are involved.
  • Every time somebody runs your credit card through a swipe reader or types in the number, they're checking information in a database.
    • Is this a valid credit account?
    • Does it have enough money to cover the bill?
    • Is the credit card stolen?
    • Debit the merchant account, credit the consumer account
  • Whether the transaction is accepted or rejected, all the details are recorded in a database somewhere.
more motivation
More Motivation
  • Every time you go to an ATM, use a credit card, buy something with a UPC bar code at a supermarket or department store, go to a movie, concert, or Caps game, register for classes, or get a parking ticket in Lot 4, you are working with databases.
  • Everything in your wallet that isn't a photograph is an entry in a database somewhere
    • medical plan cards
    • credit cards
    • student ID
    • driver's license
    • membership cards in clubs or interest groups
    • everything. Even the currency!
final motivation
Final Motivation
  • Databases are all around us.
    • Knowledge is power.
    • Databases give us power (the ability to do things we couldn’t otherwise do)
    • they give other people power over us, and knowledge about what we are doing.
    • This class is about Databases.

So what is a Database?

what is a database
What is a Database?

At the simplest level, a database architecture has two components.

(1) Data.

  • Usually a whole lot of it
  • Representing multiple types of different objects
  • Each type may be related to itself and to other types in multiple ways

(2) A set of methods to access and manipulate the data.

  • For any reasonable-size database the data may be quite complex.
  • It is an attempt to record or model all the aspects of the real world that are important to one specific purpose -- telephone calls, for example, or credit card accounts.
  • Lots of different objects need to be stored as data, and they need to be stored in such a way as to reflect the ways that the objects can interact with each other.
database example
Database Example
  • For example, consider a local telephone system database. Types of data stored include:
    • Account information:
      • customers: individuals, groups, companies that have leased numbers;
      • billing addresses, payment history, calling plans and billing contracts;
    • Hardware information:
      • network structure (call routing),
      • hardware age, reliability, and maintenance information,
      • system load tracking,
      • network billing pattern (what numbers are long distance from what other numbers, and what ones are local)
database example 2
Database Example (2)
  • Local telephone system database continued. Additional types of data would include:
    • Call information:
      • start and end time,
      • telephone number that initiated the call;
      • telephone number(s) that received the call
  • All that information could be stored in files with much less fuss and bother. Why use a database? Why not just store the information in flat files?
why not flat files
Why Not Flat Files?
  • Why use a database? Why not just use a flat file?
  • Databases have a number of advantages over flat files.
    • Data Access. The set of programs that provide access to a database allow much more complex and flexible queries to the database with greater efficiency and convenience.
    • Reduced duplication and better control over data consistency. Data redundancy is bad. Which item to change in an update? How do you know that you've found all the copies? Data inconsistency (disagreement between various copies of the same data) is a serious problem.
    • Integrity Constraints can be enforced inside a database -- telephone numbers all 10 digits; phone numbers in Maryland all have the first three numbers being 410 or 301.
why not flat files 2
Why Not Flat Files? (2)
  • Uniform access and control of data using a standard language
  • Data Independence. We want the data to be independent of the representation chosen for it within the system. Tying the data to a given representation is what caused the Y2K fuss -- only two digits were used for a "year" field.
  • Concurrency control. Multiple users on a single database is a big advantage.
  • Recovery.
  • Security. Different users of the database may need different levels of access to information.
  • Centralized Control
  • Platform independence (portability). Since the internal file structure and access program details are hidden from the user, it is much easier to use the database on multiple platforms.
data abstraction
Data Abstraction
  • Most users don't need to understand all the details of the implementation and data design of a complex database. To make a database convenient to use, the system provides users with an abstract view of the data, limiting the information available to them. There are usually three levels of data abstraction.
    • Physical Level
    • Conceptual Level
    • View Level
data abstraction 2
Data Abstraction (2)
  • Physical level. The actual implementation details of low-level data structures are described at this level.
  • Conceptual level. This level describes all the different data types that exist by defining a relatively small number of simple structures, including all the relationships that these data types have with each other. Implementation of these objects might be complex, but it is hidden from the user at this level. Database administrators are usually the only ones who have access at this level.
data abstraction 3
Data Abstraction (3)
  • View level. There may be multiple different views, each of which represents a simpler subset of the functions and data available at the conceptual level. Different user types may require different parts of the database (for example, a bank account database might be accessed by cashiers, account holders, credit card companies, and the bank's payroll manager. Each of them can only access a small part of the full database of bank account information). Creating a number of restricted views makes the database more useful for the individual user types, giving each type access according to the needs of that type.
data abstraction 4
Data Abstraction (4)
  • Definition: a Schema is a specification of a particular database using a particular data model.

The three levels of data abstraction are often referred to as:

  • External Schema(s) (for the view level(s)).
  • Conceptual Schema (for the conceptual level)
  • Internal Schema (for the physical level)
database as model
Database as Model
  • A model represents a perception of a real system
  • Models help us manage or understand the real world system they represent.
  • When modeling a system we select aspects and characteristics we want to represent; we abstract them to form a simple(r) system
    • examples: a map, an airplane flight simulator, computer weather analysis program
  • A database is a model of reality
data models underlying the database
Data Models underlying the Database
  • The data model is a collection of conceptual tools for describing data and its attributes
    • data objects
    • interrelationships of the data
    • data semantics and consistency constraints
  • There are two well-established data models used in database design
    • Entity-Relationship (E-R) model
    • Relational model
    • older methods included the Network and Hierarchical data models
      • Each was tied closely to the underlying implementation, which made it more difficult to model data and to modify or update the database. As a result they aren’t much used any more
entity relationship model
Entity-Relationship Model
  • Diagram based model
  • Two primitives
    • Entities -- each represents a unique real-world object
    • Relationships -- each represents an association among several entities
    • Each are associated in sets of the same type (for example, one entity set might be customer, representing the set of all entities that represent customers at a given bank)
  • Third important notion: Attributes
    • Entities are associated with a set of attributes
  • Entity: a distinguishable object we want to model
    • e.g., room CSI 3120, Celine Dion, Elizabeth I of England
  • Entities have attributes (single-valued properties)
    • e.g., a person has a name, SSN#, gender, …
    • if an attribute has more than a single value, we should model it as an Entity
  • Entity Set: a set of entities of the same type
  • Entity Sets may overlap
    • CSI 3120 is a member of CLASSROOMs and also a member of CSI BUILDING ROOMs.
  • Relationship is an association among entities
    • David Kuijt teaches-in CSI 3120
  • Relationship Set is a collection of relationships of the same type
    • FACULTY teach-in CLASSROOMs
  • Relationships may also have attributes
    • e.g., the relationship teach-in has an attribute “weekday” and another attribute “time” to store the day and time in which a given Entity of the set FACULTY teaches in a given Entity of the type CLASSROOM
example database design 1
Example Database Design (1)
  • Application: library database. Authors have written books about various subjects; different libraries in the system may carry these books.
  • Entities (with attributes in parentheses):
    • Authors (SS#, name, tel, birthdate)
    • Books (ISDN, title)
    • Subjects (sname)
    • Libraries (lname)
  • Relations [associating entities in square brackets]:
    • Wrote-on [Authors, Subjects]
    • Carry [Libraries, Subjects]
    • Index [Subjects, Books]
poor initial design
Poor Initial Design
  • Our first design is a poor model of the real-world system we are examining. Problems in our first design:
    • no relationship associating authors and books
    • no relationship associating libraries and books
    • common queries will be complex and difficult:
      • Q: what libraries carry books by a given author?
      • Q: what books has a given author written?
      • Q: who is the author of a given book?
      • Q: how many copies of a given book exist at each library?
      • Q: what edition of a book does the library have?
example database design 2
Example Database Design (2)
  • Application: library database as before
  • Entities (with attributes in parentheses):
    • Authors (SS#, name, tel, birthdate)
    • Books (ISDN, title)
    • Subjects (sname)
    • Libraries (lname)
  • Relations [associating entities in square brackets] (attributes in parentheses):
    • Wrote [Authors, Books]
    • In-stock [Libraries, Books] (quantity, edition)
    • Index [Subjects, Books]
  • Fundamental concept for databases
  • Must be able to uniquely identify things within a database (in the E-R model, Entities and Relationships)
    • Avoid duplication of results in a search; identify data redundancy in other operations
    • Halt search on positive results
    • Quick lookup in underlying data structures used at the Physical Level of abstraction
  • Examples of possible keys
    • Student ID number (SS#) is used as a key for most UMD databases having to do with students
entity keys
Entity Keys
  • Superkey: set of attributes whose values uniquely identify the entity
  • candidate key: a minimal superkey (a minimal subset of a superkey whose values still uniquely identify the entity)
  • primary key: if there is more than one possible candidate key, one is chosen as the primary one used for most entity-identification purposes
  • weak entity:has no primary key; instead it depends upon another strong entity’s primary key to exist
    • e.g., CHILDren of EMPLOYEEs are weak; the primary key of EMPLOYEE in addition to the attributes of the CHILD are used for identification
    • weak entities are “existent dependent” on a strong entity -- when the strong entity gets deleted, so does the weak one
relationship keys
Relationship Keys
  • Depend upon the entity mapping of the relationship
    • one-one: the primary key of any of the entities can be used to uniquely distinguish a given relationship between two unique entities.
    • one-many: the primary key of the “many” entity, plus possibly a subset of the attributes of the relationship, will uniquely identify a given relationship
      • e.g., MOTHER gave-birth-to CHILD; to identify a specific gave-birth-to relationship requires the primary key of MOTHER and possibly the (date) and (time) attributes of gave-birth-to
    • many-many: the union of the primary keys of the entities associated, plus possibly a subset of the attributes of the relationship, will uniquely identify a given relationship
      • e.g., PERSON married PERSON; SS# of both and possibly date
special cases
Special Cases
  • Relationships may associate different entities of the same type
  • Ternary versions of the above
  • M-N relationships: many-one mappings are often more useful in practice than many-many mappings.
    • DUMMY Entities can be used to convert an M-N mapping relationship to a pair of relationships, one M-1 and one N-1.
specialization generalization isa hierarchy
Specialization-Generalization(ISA Hierarchy)
  • This is a way to represent entity complexity
  • specialization: top-down refinement of entities with distinct attributes
    • Entity type BANK ACCOUNT might be subdivided into related but different types CHECKING ACCT and SAVINGS ACCT
  • generalization: bottom-up abstraction of common attributes
    • Course types DATABASE, SYSTEM, and NETWORK all have common attribute (project). From them we can abstract a new course type PRACTICAL COURSE
    • other common course attributes are included (e.g., course number)
isa hierarchy example top down refinement
ISA Hierarchy Example: Top-down Refinement
  • Account entity with attributes balance and number
  • additional complexity: we want to represent two subtypes of account
    • Savings Account with attribute Interest Rate
    • Checking Account with attribute Overdraft Limit
isa hierarchy example bottom up abstraction
ISA Hierarchy Example: Bottom-up Abstraction
  • Three related entities with similar attribute project
  • we abstract a new type of super entity Practical Course and link the three entities as subtypes
  • other shared attributes (e.g., course number) are also promoted to the upper level entity
aggregation part of hierarchy
Aggregation(Part-of Hierarchy)
  • This is a way to represent relationship complexity
    • relationships among relationships are not supported by the E-R model
    • often we want to model lower-level relationships differently
  • Groups of entities and relationships can be abstracted into higher level entities
part of hierarchy example
Part-of Hierarchy Example
  • Entities driver, car, tires, doors, engine, seats, piston, valves
  • Relationship drives is insufficient to model the complexity of this system
  • Part-of relationships allow abstraction into higher level entities (piston and valves as parts of engine; engine, tires, doors, seats aggregated into car)
mapping an e r schema to tables
Mapping an E-R Schema to Tables
  • Motivation - translating E-R database designs into Relational designs
    • Both models are abstract, logical representations of a real-world enterprise
    • Both models employ similar design principles
    • Converting an E-R diagram to tables is the way we translate an E-R schema to a Relational schema.
    • Later on we’ll examine how to convert a Relational schema to an E-R schema
mapping an e r schema to tables 2
Mapping an E-R Schema to Tables (2)
  • Strong Entity E with primary key PK and attributes A, B, … ==> E(PK, A, B, …)
  • Weak Entity F with (non-primary) key WK and attributes C, D, … depending upon E above for primary key ==> F(PK, WK, C, D, …)
  • Relationship R with attributes L, M, … and associating Entities E (with primary key PK), E2 (PK2), E3 (PK3), … ==> R(PK, PK2, PK3, …, L, M, …)
  • Relationships between weak entities and the strong one on which they are dependent usually do not require representation because it is usually a many-one relationship with no attributes on the relationship (they are on the weak entity) and so the resulting table R(PK, WK) is a subset of the weak entity itself.
table details
Table Details
  • The whole table represents a single Entity Set or Relationship Set.
  • Each entry (row) in the table corresponds to a single instance (member in that set)
  • For an Entity Set each column in the table represents an attribute in the E-R diagram
  • For a Relationship Set each column in the table represents either an attribute of the Relationship or one of the parts of the primary key of the Entity Sets it associates
mapping an e r schema to tables 3
Mapping an E-R Schema to Tables (3)
  • ISA relationships: choose either to
    • Represent the super class entity, then represent each subclass with the primary key of the super class and its own attribute set. This is very similar to the way weak entities are treated.
    • Or, map the subclasses to separate relations and ignore the whole super class. This is good when the subclasses partition the whole superclasses between them (the subclasses are disjoint and the union of the subclasses covers the whole super class).
  • Aggregate (part-of) relationship
    • Translation is straightforward -- just treat the aggregate as an entity and use the methods defined above.
  • With last week’s lecture, this covers the material of chapter 2.
relational database model
Relational Database Model
  • Most popular logical data model
  • Relations (also called tables) represent both Entity Sets and Relationship Sets.
    • Attributes form the columns of the table (column and attribute are synonymous)
    • Each row represents a single entity or relationship (called a row or tuple)
  • Each instance of an attribute takes values from a specific set called the domain of the column (the domain defines the type)
relational database model cont
Relational Database Model (cont)
  • A relation schema is made up of the name and attributes of a relation with their underlying domains
  • A database schema is a set of all relation schemas.
  • The notions of keys, primary keys, superkeys are all as previously described
query languages
Query Languages
  • a language in which a user requests information from the database
    • a higher level language than standard programming languages
  • query languages may be procedural or non-procedural
    • procedural languages specify a series of operations on the database to generate the desired result
    • non-procedural languages do not specify how the information is generated
    • most commercial relational database systems offer a query language that includes procedural and non-procedural elements
relational algebra
Relational Algebra
  • procedural query language
  • set of operators that map one or more relations into another relation
  • closed algebraic system
    • best feature - operations on operations
    • form relational algebraic expressions
  • two types of operations: set-theoretic and database specific
relational algebra operations
Relational Algebra Operations
  • database specific:
    • (horizontal) selection ()
    • (vertical) projection ()
    • join
    • outer join
    • semijoin
    • division
  • set operators
    • union
    • difference
    • intersection
    • cartesian (cross) product
example relations
EMP ename salary dept

Gary 30K toy

Shirley 35K candy

Christos 37K shoe

Robin 22K toy

Uma 30K shoe

Tim 12K (null)

DEPT dept floor mgr

candy 1 Irene

toy 2 Jim

men 2 John

shoe 1 George

Example Relations
database specific operators
Database Specific Operators
  • (horizontal) selection ()
    • picks a subset of the rows
  • (vertical) projection ()
    • picks a subset of the columns
  • join
    • creates a new relation (table) out of two
      • equijoin (based upon equality of attributes)
      • natural join (equijoin plus projection to eliminate duplicated columns)
set operators
Set Operators
  • union
    • both relations must be union-compatible -- same degree and same domains
  • set difference
    • both relations must be union-compatible as above
  • intersection
    • same deal
  • cartesian (cross) product
    • note similarity to join operation; join can be defined as a cross product followed by a selection criteria
more operators
More Operators
  • rename ()
    • results of operations in the relational algebra do not have names
    • it is often useful to be able to name such results for use in further expressions later on
    • conceptually similar to an assignment operator in most programming languages
  • semijoin
    • very useful in practical implementation of large queries
    • semijoin of R and S is equivalent to the join of R and S projected onto the attributes of R.
still more operators outer join
Still More Operators: Outer Join
  • outer join is an extension of the join operation to deal with missing information
    • three types: left outer join, right outer join, and full outer join
    • left outer join computes the natural join, then takes all tuples (rows) in the left relation that did not match on the join attribute and includes them in the result, with all attributes of the right relation padded with null values
    • right outer join is the same, except non-matching tuples in the right relation are included in the result padded with null values
    • full outer join includes all non-matching tuples of both relations appropriately padded
    • see examples in text, p108-109
still more operators
Still More Operators
  • division
    • R/S: given R(A,B) and S(B), then a given tuple t is in R/S if for all s in S there exists an r in R such that r.B=s.B and t.A=r.A.
    • So tuple t with attribute t.A is in the result if and only if R contained tuples (t.A, B1), (t.A, B2), (t.A, B3), … for every possible value Bi contained in S.
    • Note that S must be defined on a subset of the attributes of R for the operation to be meaningful.
a short interlude integrity
A Short Interlude: Integrity
  • the preceding slides covered chapter three up to section 3.3
  • before attacking chapter 4 (SQL), we’re going to make a brief excursion up to chapter 6, touching sections 6.1 - 6.4
  • Integrity constraints attempt to enforce data consistency and prevent accidental damage to the database during updates
  • We’ve already seen two forms of integrity constraints:
    • key declarations (stipulating that certain attributes form a candidate key for a given entity set)
    • mapping form of a relationship (one-one, one-many, many-many)
integrity constraints
Integrity Constraints
  • Domain Constraints
    • simplest form of integrity constraint
    • type declarations are one such domain constraint (e.g., integer, floating point, double-precision, fixed length character string).
    • domains can be further restricted (e.g., check clause in SQL can ensure that hourly wages are  4.00 dollars)
    • easily tested whenever a new data item is entered into the database
    • extensions like date or currency can be easily supported on a strongly typed programming language
    • Null values can be useful for values to be filled in later, but some attributes may need to be specified as “not Null” (e.g., primary keys cannot have a null value)
integrity constraints 2
Integrity Constraints (2)
  • Key Constraints
    • keys must have unique values
    • primary key -- a candidate key declared primary
    • unique key -- a candidate key
    • foreign key -- a set of attributes that are a primary key for some other relations
      • foreign keys are an important concept because we need to treat foreign keys differently from other attributes (for example, protecting their uniqueness and insuring referential integrity) even though they aren’t a primary key in the current relation
referential integrity
Referential Integrity
  • We often want to be able to ensure that an attribute value in a tuple of a relation appears in at least one tuple of another relation. For example:
        • EMP(eno, ename, salary)
        • DEPT(dno, dname, floor)
        • WORKS-IN(eno,dno, hours)
  • note that eno is a foreign key in WORKS-IN
  • We want the following to be true:
    • eno(WORKS-IN)  eno(EMP) (every eno is a real employee)
    • dno(WORKS-IN)  dno(DEPT) (every dno is a real department)
  • SQL allows the declaration of domain/key/referential integrity constraints with the clause check in its DDL
referential integrity sql ddl example
Create table customer

(cust-name char(20) not null,

cust-street char(30),

cust-city char(30),

primary key (cust-name))

Create table branch

(branch-name char(15) not null,

branch-city char(30),

assets number,

primary key (branch-name),

check (assets 0))

Create table account

(account-no char(10) not null,

branch-name char(15),

balance number,

primary key (account-no),

foreign key (branch-name) references branch,

check (balance 0))

Create table depositor

(cust-name char(20) not null,

account-no char(10) not null,

primary key (cust-name),

foreign key (cust-name) references customer,

foreign key (account-no) references account)

Referential Integrity: SQL DDL Example
referential integrity and database modifications
Referential Integrity and Database Modifications
  • Database modifications may violate referential integrity
  • Insertion: inserting a value into the referencing relation that is not in the referenced relation
  • Deletion: deleting the last example of a given value in the referenced relation and leaving that value in the referencing one
    • proper handling may lead to cascading deletions
  • Update to the referencing relation (constraints as Insertion)
  • Update to the referenced relation (constraints as Deletion)
  • An assertion is an arbitrary expression that the database must always satisfy
    • e.g., student GPA > 2.8, or sum(all-charges) < credit-line
    • Domain constraints and referential integrity constraints are special forms of assertion that are easy to test
    • SQL supports assertions as follows:

create assertion <assertion-name> check <predicate>

    • When an assertion is made the system checks it for validity. If it is validated, every future modification of the database is checked against the assertion and allowed only if it is not violated.
    • This can be very expensive if assertions are complex or numerous
  • A trigger is a statement that the system executes automatically as a side effect of an update to the database.
  • A trigger has two parts:
    • condition under which it is executed
    • actions to be taken if it is executed
  • Example: instead of having an assertion “balance 0” for a checking account, use a trigger on negative balances that sets the balance to zero and creates a new loan for the amount of the overdraft
  • Triggers make the system reactive
  • Triggers are also called active rules
  • Like Assertions, Triggers can be very expensive.
trigger example
Trigger Example

define trigger overdraft on update of account T

(ifnew T.balance < 0 then (insert into borrow values

(, T.account-number,

T-customer-name, - new T.balance)

update deposit S

set S.balance = 0

where S.account-number = T.account-number))

(note: SQL syntax given here is slightly different from that in the text, p235)

sql structured query language astrahan gray lindsay selinger
SQL (Structured Query Language)(Astrahan, Gray, Lindsay, Selinger, …)
  • Most common and influential commercial query language; well established as the industry standard query language for relational databases
  • Developed (as “Sequel”) at the IBM Research Lab in San Jose in the early 70s
  • Four basic commands
    • select
    • insert
    • delete
    • update
  • Result of each query is a relation
sql example
SQL Example


from emp e

where e.age > 30;

  • e is a tuple variable ranging over the emp relation
  • a tuple variable followed by a “.” and an attribute is an indexed tuple variable and specifies the corresponding attribute of the tuple, very similarly to in many programming languages
  • what follows the keyword select is the target list
  • what follows from is the tuple variable list and consists of a list of relations and variable names
  • what follows where is the qualification clause; an arbitrary boolean expression
  • Basic format of the select command

select [distinct] target_list

from tuple_variable_list

where qualification

[order by target_list_subset];

  • Semantics
    • evaluate qualification: select the subset of the cartesian product of the ranges of the tuple variables that satisfy the qualification
    • evaluate target list: eliminate columns that are not in the target list
    • prepare the result as a relation with columns according to the target list
    • if distinct is used, eliminate duplicate tuples
    • if order by is used, sort the result accordingly
sql some example queries
SQL: some example queries
  • We will give a number of simple query examples using the following relational schema:

sailors(sid, sname, rating)

boats(bid, bname, colour)

reserve(sid, bid, date)

(1) Find the names of sailors who have reserved boat #2

select s.sname

from sailors s, reserve r

where s.sid=r.sid and

sql example queries 2
SQL: example queries (2)

(2) Find the names of sailors who have reserved a red boat

select s.sname

from sailors s, reserve r, boats b

where s.sid=r.sid and and b.colour=“red”

(3) Find the colours of all boats reserved by Pat

select b.colour

from sailors s, reserve r, boats b

where s.sname=“Pat” and s.sid=r.sid and

sql example queries 3
SQL: example queries (3)

(4) Find the names of sailors who have reserved at least one boat

select s.sname

from sailors s, reserve r

where s.sid=r.sid

(5) Find the names of sailors who have reserved a red or a green boat

select s.sname

from sailors s, reserve r, boats b

where s.sid=r.sid and and

(b.colour=“red” or b.colour=“green”)

sql example queries 4
SQL: example queries (4)

(6) Find the names of sailors who have reserved a red and a green boat

select s.sname

from sailors s, reserve r, boats b, reserve r2, boats b2

where s.sid=r.sid and and b.colour=“red”

and s.sid=r2.sid and and


Note: in the above query if sailor Pat has reserved one green boat and two red ones, the name Pat will appear twice in the results. To avoid that, use the keyword distinct in the select line, as in:

select distinct s.sname

  • Basic format of the select command

select [distinct] target_list

from tuple_variable_list

where qualification

[order by target_list_subset];

  • Simple query examples use this relational schema:

sailors(sid, sname, rating)

boats(bid, bname, colour)

reserve(sid, bid, date)

sql target list
SQL: target list
  • “*” is an abbreviation for all attributes in the from list

select *

from sailors s

where order by s.rating

  • Each item in the target list can be as general as attribute_name =expression, where the expression is any arithmetic or string expression over indexed tuple variables and constants. It can also contain some built-in functions like sqrt, sin, mod, etc. as well as aggregates (coming up later)
sql target list expression example
SQL: target list expression example

With rating an integer from 1 to 10, this query gives a rating bonus to sailors who sailed two different boats on the same day.

select s.sid, s.sname, rating=s.rating+2

from sailors s, reserve r, reserve r2

where s.sid=r.sid and s.sid=r2.sid and and !=

What’s wrong with the above?

  • What happens if s.rating = 9 before this query?
  • Domain constraints might take care of this, but we need to be careful

Qualifications: each item in a qualification (where clause) can be as general as expression=expression


select name1 = s1.sname, name2 = s2.sname

from sailors s1, sailors s2

where 2*s1.rating = s2.rating-1


Further elaboration:

  • tuple variables can be implicit if the system can figure out which relation each attribute belongs to
  • table names can be used as tuple variables

Example: find names, ages, and departments of employees who are over 40 and work on the first floor.

select ename, age, emp.dname

from emp, dept

where age>40 and floor=1 and emp.dname=dept.dname


SQL provides set operators: union, intersect, and minus

Example: find the names of employees who work in the toy department and make at most 60K

(select ename

from emp

where dname=“toy”)


(select ename

from emp

where sal>60K)


Note that it is usually possible to phrase a single query in multiple ways. The previous query could have also been written:

(select ename

from emp

where dname=“toy”)


(select ename

from emp

where sal60K)


Or also (even simpler):

select ename

from emp

where dname=“toy” and sal60K

Writing a query in different ways will usually change how efficient the query is -- the above query is very likely to be faster than the example using intersects, and that one is likely to be faster than the one using minus.


SQL also provides set operators: contains (a set being a superset of another) and exists (a set not being empty). Both return Boolean results, so may be negated (using not).


Example: find the names of employees who manage all the departments on the first floor.

select mgr

from dept d1

where (select d2.dname

from dept d2

where d1.mgr=d2.mgr)


(select dname

from dept

where floor=1)


SQL allows nested queries using the keyword in

Example: find the names of employees who work on the first floor.

select ename

from emp

where dname in

(select dname

from dept

where floor = 1)

The same query in flat form is

select dname

from emp, dept

where emp.dname=dept.dname and floor=1


The connective in tests for set membership. Similar connectives are:

  • not in (set non membership)
  • op any (op relationship with some tuple in the set)
  • op all (op relationship with all tuples in the set)

where op is one of (=, !=, <, >, <=, >=)

Example: find the names of employees who make more than everybody on the first floor.

select ename

from emp

where sal > all

(select sal

from emp, dept

where emp.dname=dept.dname and floor = 1)


Scoping of variables works exactly as in Pascal or C

Example: find the names of students who take a course from their advisor.

select sname

from student

where s# in

(select s#

from enroll

where c# in

(select c#

from class

where prof=student.advisor))

recap sql
Recap: SQL

Four basic commands

  • select
  • insert
  • delete
  • update
sql insert
SQL Insert
  • Insert command format:

insert into relation_name values (value_list)


insert into relation_name select_statement

  • Semantics of insert
    • format one: add the tuple corresponding to value_list into relation_name
    • format two: execute the select statement, then add all the resulting tuples into relation_name


insert into student values (1, “Carey”, “CS”, “Stonebraker”)

sql insert1
SQL Insert

Example: relation register(S#, name, paid)

in which registered students are recorded. After the end of registration week, we execute:

insert into student

select r.s#,

from register r

where r.paid=“yes”

sql delete
SQL Delete
  • Delete command format:

delete relation_name where qualification

  • Semantics of delete: execute the corresponding select command:

select full_target_list (or “*”)

from relation_name

where qualification

and then remove the resulting tuples from relation_name

sql delete1
SQL Delete

Example: with the following schema

student(s#, name, major, advisor)

enroll(s#, c#, grade)

course(c#, dept)

The following command expels CS majors who received a grade of less than 2.5 in a CS course:

delete student

where major=“CS” and s# in

(select s#

from enroll, course

where enroll.s#=student.s# and grade<2.5

and enroll.c#=course.c# and dept=“CS”)

sql update
SQL Update
  • Update format

update relation_name

set target_list

where qualification

  • Semantics of update: it is equivalent to executing:
    • insert into del_temp

select *

from relation_name

where qualification

sql update1
SQL Update
  • Semantics of update (cont): … then executing
    • insert into app_temp

select ext_target_list

from relation_name

where qualification

    • delete the tuples in del_temp from relation_name
    • add the tuples in app_temp to relation_name

Ext_target_list is identical to target_list in the original update command, but augmented with tuple_variable.attribute_name for all attributes of the range of tuple_variable that don’t appear in target_list.

sql update2
SQL Update

Example: give a 10% grade raise to every CS major in CS564

update enroll

set grade=1.1*grade

where c#=“CS564” and s# in

(select s#

from student

where major=“CS”)

sql update3
Which is equivalent to:

insert into del_temp

select s#, c#, grade

from enroll

where c#=“CS564” and s# in

(select s#

from student

where major=“CS”)

insert into app_temp

select s#, c#, grade=1.1*grade

from enroll

where c#=“CS564” and s# in

(select s#

from student

where major=“CS”)

SQL Update
sql aggregates
SQL Aggregates

Aggregate functions are functions that take a collection of values as input and return a single value. SQL supports five built-in aggregate functions:

  • average: avg
  • minimum: min
  • maximum: max
  • total: sum
  • cardinality: count

using distinct to aggregate only unique values is often important with avg, sum, and count

sql aggregates1
SQL Aggregates

Example: find the number of students

select num_of_students = count(s#)

from student

why do we not need to use distinct in this example?

Example: find the number of employee records

select count (*)

from emp

if an employee appears more than once in the emp relation, for example if he had switched jobs or had two jobs, then this command would count that employee once for each record

sql aggregates2
SQL Aggregates
  • Qualified Aggregates:

Example: find the average age of employees in the toy department

select avg(age)

from emp

where dname=“toy”

sql group by clause
SQL: Group By clause
  • Group aggregates: groups of tuples are computed using the group by clause
    • the attributes given in the clause are used to form groups
    • typles with the same value on all attributes in the clause are placed in one group

Example: in each department, find the minimum age of employees who make more than 50K

select dname, min(age)

from emp

where sal>50K

group by dname

sql having clause
SQL: Having clause
  • Sometimes it is useful to state a condition that applies to groups in group by rather than to tuples. We do that in SQL with the having clause. SQL applies predicates of having after groups have been formed.

Example: find the average salary for employees under 30 for each department having more than 10 such employees

select dname, avg(sal)

from emp

where age<30

group by dname

having count(*)>10

sql multiple group bys
SQL: Multiple Group Bys

Example: using relation emp(ss#, ename, dept, cat, sal)

Count the employees and average monthly salary for each employee category in each department

select dept, cat, count(*), avg(sal)/12

from emp

group by dept, cat

sql multiple group bys1
Select … from emp

group by cat

Select … from emp

group by dept

SQL: Multiple Group Bys
sql multiple group by
SQL: Multiple Group By

Select … from emp

group by dname, cat

note that some dname/cat groups are empty.

sql examples on having
SQL: Examples on Having

Find the average salary of employees under 30 for each department with more than 10 such employees

select dname, avg(sal)

from emp

where age<30 (employee age under 30)

group by dname (group by department)

having 10<count(*) (group size > 10)

sql examples on having1
SQL: Examples on Having

Find the average salary of employees under 30 for each department with more than 10 employees

select e.dname, avg(e.sal)

from emp e

where e.age<30 (employee age under 30)

group by e.dname (group by department)

having 10<any

(select count(ee.ename) (number of employees in group)

from emp ee

where ee.dname=e.dname) (… from the same dept as e)

(why is this query different from the previous one?)

sql examples on having2
SQL: Examples on Having

Find categories of employees whose average salary exceeds that of programmers

select cat, avg(sal)

from emp

group by cat

having avg(sal)> (select avg(sal)

from emp

where cat=“programmer”)

sql examples on having3
SQL: Examples on Having

Find all departments with at least two clerks

select dname

from emp

where job=“clerk”

group by dname

having count(*) >= 2

sql examples
SQL: Examples

Find the names of sailors with the highest rating

select sname

from sailors

where rating = (select max(rating)

from sailors)

sql examples1
SQL: Examples

For each boat, find the number of sailors of rating >7 who have reserved this boat

select bid, bname, count(s.sid)

from sailors s, boats b, reserve r

where s.sid=r.sid and and rating>7

group by

sql examples2
SQL: Examples

For each red boat, find the number of sailors who have reserved this boat

select bid, bname, count(s.sid)

from sailors s, boats b, reserve r

where s.sid=r.sid and

group by

having colour=“red”

sql examples3
SQL: Examples

Difference between the last two queries?

  • First one gave a qualification on the tuples

(take all tuples of the multijoin

discard tuples that do not fulfill ratings>7

then group them by boat id

then find the cardinality of each group)

  • Second one gave a qualification for the groups

(take all tuples of the multijoin

group them by boat id

discard groups representing boats that are non-red

find the cardinality of remaining groups)

and now for something completely different
And Now, For SomethingCompletely Different...
  • The recent SQL material largely covers chapter 4, at least sections 4.1 through 4.6 and some of 4.9.
  • Earlier we examined Relational Algebra, covering sections 3.1 through 3.3
  • Now we leave chapter 4 and head back to examine sections 3.6 and 3.7, covering Relational Calculi
    • based upon predicate calculus
    • non-procedural query languages (descriptive rather than prescriptive)
    • we will examine two relational calculi: tuple calculus and domain calculus
tuple calculus
Tuple Calculus

Query: {t | P(t)} P: is a predicate associated with some relation R

t: is a tuple variable ranging over the relation R

t[A]: is the value of attribute A in tuple t

  • students in CMSC 424
    • {t | t  enroll  t[course#] = CMSC424}
  • students in CMSC 424 conforming with the CMSC-420 prerequisite
    • {t | t  enroll   s  enroll  t[course#] = CMSC424 

s[course#] = CMSC420  t[ss#] = s[ss#]}

tuple calculus1
Tuple Calculus
  • Quantifiers and free variables
    • ,  quantify the variables following them, binding them to some value. (in the previous slide, s was bound by )
    • A tuple variable that is not quantified by  or  is called a free variable. (in the previous slide, t was a free variable)
  • Atoms
    • R(t) where t is a tuple variable
    • t[x]  s[y] where t,s are tuple variables and

  {, , , , , }

tuple calculus2
Tuple Calculus
  • Formulas
    • an Atom is a Formula
    • If P and Q are Formulas, then so are (P), P, PQ, PQ, and PQ
    • If P(t) is a Formula, then so are t P(t) andt P(t)
  • Equivalences
    • (P  Q)  P   Q
    • (P  Q)  P   Q
    • t P(t)   (t ( P(t)))
    •  t P(t)   ( t ( P(t)))
tuple calculus3
Tuple Calculus
  • Safety
    • Math is too powerful; we can easily phrase expressions that describe infinite sets

{t | t  enroll}

    • These expressions are called unsafe
    • When we are dealing with finite sets, unsafe expressions happen in expressions that involve negation ()
    • We can avoid this problem by using an entirely positive (non-negated) scope as the first operand of any conjunction where we use negation. The first operand establishes the scope and the second one filters the established scope.

{t | t  enroll  t[course#]  CMSC-420}

domain calculus
Domain Calculus
  • Another form of relational calculus
  • Uses domain variables that take values from an attribute’s domain, rather than values representing an entire tuple
  • Closely related to tuple calculus
  • Domain Calculus serves as the theoretical basis for the query language QBE, just as the relational algebra we examined earlier forms the basis for SQL
  • Expressions are of the form:

{< x1, x2, x3, ..., xn> | P( x1, x2, x3, ..., xn) }

domain calculus1
Domain Calculus
  • Atoms
    • < x1, x2, x3, ..., xn>  R
    • x  y where x,y are domain variables and

  {, , , , , }

    • x  c where c is a constant
  • Formulas
    • an atom is a formula
    • If P and Q are formulas, then so are (P), P, PQ, PQ, and PQ
    • If P(x) is a formula and x is a domain variable, then x P(x) andx P(x) are also formulas
domain calculus2
Domain Calculus
  • Queries are of the form:

{< x1, x2, x3, ..., xn> | P( x1, x2, x3, ..., xn) }

  • Examples

{<ss#, course#, semester> | Enroll(ss#, course#, semester)}

{<x, y, z> | Enroll(x, y, z)  y = CMSC-424}

reductions of relational algebra and calculi
Reductions of Relational Algebra and Calculi
  • Relational Algebra (sections 3.2-3.5), Tuple Calculus (section 3.6), and Domain Calculus (section 3.7) can be reduced to each other: they have equivalent expressive power. For every expression in one, we can compute an equivalent expression in the others.
functional dependencies
Functional Dependencies
  • Important concept in differentiating good database designs from bad ones
  • FD is a generalization of the notion of keys
  • An FD is a set of attributes whose values uniquely determine the values of the remaining attributes.

Emp(eno, ename, sal) key FDs: eno => ename

Dept(dno, dname, floor) eno => sal

Works-in(eno,dno, hours) (eno,dno) => hours

dno => dname

dno => floor

functional dependencies1
Functional Dependencies
  • If   R and   R, then  =>  holds in the extension r(R) of R iff for any pair t1 and t2 tuples of r(R) such that t1()=t2() , then it is also true that t1() = t2()
  • We can use FDs as
    • constraints we wish to enforce (e.g., keys)
    • for checking to see if the FDs are satisfied within the database

R( A B C D)

1 1 1 1 A => B satisfied? no

1 2 1 2 A => C satisfied? yes

2 2 2 2 C => A satisfied? no

2 3 2 3 AB => D satisfied? yes

3 3 2 4

functional dependencies2
Functional Dependencies
  • Trivial dependencies:  => 

 =>  if   

  • Closure
    • we need to consider all FDs
    • some are implied by others; e.g., FDs are transitive; if A=>B and B=>C, then A=>C
    • Given F = set of FDs, we want to find F’ (the closure of all FDs logically implied by F)
armstrong s axioms
Armstrong’s Axioms
  • Reflexivity if    then  => 
  • Augmentation if  =>  then   
  • Transitivity if  =>  and  =>  then  => 

Armstrong’s Axioms can be used to derive three additional useful rules:

  • Union rule if  =>  and  =>  then  => 
  • Decomposition rule if  => then  =>  and  => 
  • Pseudotransitivity rule if  =>  and  =>  then  => 
fd example
FD Example

R(A, B, C, G, H, I)

F=( A=>B


CG => H

CG => I

B => H )

F+=( A => H transitivity: A => B => H

CG => HI union rule: CG => H, CG => I

AG => I augmentation: A=> C, AG => CG => I

AG => H ) augmentation: A=> C, AG => CG => H

closure of attribute sets
Closure of Attribute Sets
  • Useful to test if an attribute set is a superkey
  • the closure + of a set of attributes  under F is the set of all attributes that are functionally determined by 
  • there is an algorithm to compute the closure

R(A, B, C, G, H, I) F=(A=>B, A=>C, CG => H, CG => I, B => H )

Example: Algorithm to compute (AG)+

starts with result=(AG)

A=>B expands result=(AGB)

A=>C expands result=(AGBC)

CG=>H expands result=(AGBCH)

CG=>I expands result=(AGBCHI)

B=>H causes no more expansion

uses of attribute closure
Uses of Attribute Closure
  • Testing for superkey
    • to test if  is a superkey, compute + and determine if + contains all attributes of R
  • Testing functional dependencies
    • to check if a functional dependency  =>  holds, (is in F+) just check to see if   +
    • in other words, we compute + using attribute closure, and then check if the result contains .
    • Simple, cheap, and useful test.
relational database design
Relational Database Design
  • A major goal in designing a database is to have a schema that
    • makes queries simpler (easy to phrase)
    • avoids redundancies and update anomalies (about which more later)
schema and query simplicity 1
Schema and Query Simplicity (1)

Example Schema 1: EMP(eno, ename, sal, dno)

DEPT(dno, dname, floor, mgr)

Query 1: find all employees that make more than their manager

select e.ename from EMP e, EMP m, DEPT d

where e.dno = m.dno and d.mgr=m.eno and e.sal>m.sal

Query 2: for each department, find the maximum salary

select d.dname, max(e.sal) from EMP e, DEPT d

where e.dno = d.dno group by d.dno

Q1 requires two joins; Q2 requires a join and a group-by.

schema and query simplicity 2
Schema and Query Simplicity (2)

Example Schema 2: (a single relation)

ED(eno, ename, sal, dno, dname, floor, mgr)

Query 1: find all employees that make more than their manager

select e.ename from ED e, ED m

where e.mgr=m.eno and e.sal>m.sal

Query 2: for each department, find the maximum salary

select d.dname, max(sal) from ED e

group by dno

Q1 requires one join; Q2 requires just a group-by.

schema and query simplicity 3
Schema and Query Simplicity (3)
  • How did we get simpler queries?
  • Schema 2 was a more complicated relation with more information; in essence ED was EMP and DEPT from Schema 1 with the join pre-computed
  • Should we just precompute the joins and store bigger relations?
  • Taken to the extreme, we could compute the universal relation with all attributes inside it and null values for those values that make no sense
  • Why wouldn’t we want to do that?
  • Problems with too-complex relations: repetition of information (data redundancy) and inability to represent certain information (update anomalies)
db design redundancy and anomalies
DB Design: Redundancy and Anomalies
  • Redundancy (repetition of information)
    • each department is repeated for each employee in it
    • great risk of inconsistencies -- suppose the department is moved to a new floor?
    • A simple update (change in mgr name, department floor, etc) in Schema 1 becomes multiple updates in Schema 2
  • Anomalies (inability to represent some types of information)
    • departments can’t exist without employees. A department cannot exist until the first employee is inserted, and it can no longer exist when the last employee is deleted from the ED relation
db design dealing with anomalies
DB Design: Dealing with Anomalies
  • So complex relations make for simpler queries, but have the disadvantages of data redundancy and creation of anomalies. How do we balance the two objectives? We want:
    • simple queries
    • no anomalies; minimize data redundancy
  • If we start with Schema 2 and discover anomalies we can decompose the relation(s) until the problems go away. This process is called normalization.
objectives of db design normalization
Objectives of DB Design (Normalization)
  • no redundancy
    • for space efficiency and to reduce the potential for inconsistencies
  • update integrity
    • avoid update anomalies
  • linguistic efficiency
    • simpler queries are much better for the application programmer and for the query optimizer
  • good performance
    • smaller relations imply more joins (bad)
lossy decompositions
Lossy Decompositions
  • Not all decompositions are reversible (lossless)


Shipment(S#, P#, J#) decomposed into SP(S#, P#) and SJ(S#, J#)

s1 p1 j1 s1 p1 s1 j1

s2 p2 j1 s2 p2 s2 j1

s2 p3 j2 s2 p3 s2 j2

s3 p3 j3 s3 p3 s3 j3

s4 p4 j3 s4 p4 s4 j3

lossy decompositions1
Lossy Decompositions

Shipment(S#, P#, J#) decomposed into SP(S#, P#) and SJ(P#, J#)

s1 p1 j1 s1 p1 p1 j1

s2 p2 j1 s2 p2 p2 j1

s2 p3 j2 s2 p3 p3 j2

s3 p3 j3 s3 p3 p3 j3

s4 p4 j3 s4 p4 p4 j3

If we join SP and SJ again into SP-PJ(S#, P#, P#, J#) we get:

s1 p1 p1 j1

s2 p2 p2 j1

s2 p3 p3 j2 from the joined tuples we cannot

s2 p3 p3 j3 deduce the original form of the data.

s3 p3 p3 j2 this is called the connection trap

s3 p3 p3 j3 and the decomposition is lossy

s4 p4 p4 j3

example of lossy join decomposition
Example of Lossy Join Decomposition
  • Lossy-join decompositions result in information loss
  • Example: decomposition of R=(A,B) into R1=(A) and R2=(B)

R= (A, B) R1= (A) R2= (B)

 1  1

 2  2

 1

R1 X R2 = (A, B)

 1

 2

 1

 2

decomposition continued
Decomposition Continued
  • Decompose the relation schema
  • All attributes of an original schema (R) must appear in the decomposition (R1, R2)
  • Lossless (reversible) join decomposition: for all possible relations r on schema R, the decomposition into (R1, R2) is lossless if

r = R1(r) R2 (r)

  • The decomposition of R into R1 and R2 is lossless if and only if at least one of the following dependencies is in F+:

R1  R2  R1

R1  R2  R2

lossless join decomposition and functional dependencies
Lossless Join Decomposition and Functional Dependencies
  • So FDs can help determine whether a decomposition is lossless
  • R is a relation schema and F its FDs. Then a decomposition

R = R1  R2

is lossless if at least one of the following dependencies holds

R1  R2  R1

R1  R2  R2

  • either of the above FDs guarantees uniqueness in the mapping (and therefore that the decomposition is lossless)
dependency preservation
Dependency Preservation
  • Dependencies are preserved in a decomposition if we do not need to join in order to enforce FDs -- all FDs remain intra-relational and do not become inter-relational
  • To check if a decomposition is dependency preserving, we need to examine all FDs in F+
  • There is an algorithm for testing dependency preservation (requires the computation of F+)
goals of normalization
Goals of Normalization
  • Decide whether a particular relation R is in “good” form
  • if it is not in “good” form, decompose it into a set of relations (R1, R2, R3, …, Rn) such that:
    • each relation is in “good” form
    • the decomposition is a lossless-join decomposition, based upon functional dependencies
  • Types of FDs in R(A, B, C, D) with (A, B) a candidate key:
    • trivial: AB ==> A
    • partial: A ==> C (C depends upon a part of the key)

TEACH(student, teacher, subject)

student, subject ==> teacher (students not allowed in the same subject

of two different teachers)

teacher ==> subject (each teacher teaches only one subject)

    • transitive: A ==> C ==> D

ED(eno, ename, sal, dno, dname, floor, mgr)

eno ==> dno ==> mgr

normalization using fds
Normalization using FDs
  • When we decompose a relation schema R with a set of functional dependencies F into R1, R2, R3, …, Rn we want:
    • lossless-join decomposition: otherwise the decomposition results in loss of information relative to the original schema R
    • no redundancy: the relations Ri should be in either BCNF (Boys-Codd Normal Form) or 3NF (Third Normal Form) (about which more in a slide or two)
    • Dependency preservation: let Fi be the set of dependencies in F+ that include only attributes in Ri:
      • preferably the decomposition should be dependency perserving. That is, F1  F2  F3  …  Fn = F+
      • Otherwise checking updates for violation of FDs may require computing joins, which is expensive
the normal forms
The Normal Forms
  • 1NF: every attribute has an atomic value
  • 2NF: 1NF and no partial dependencies
  • 3NF: 2NF and no transitive dependencies.

Equivalently (text definition): if for each FD X==> Y either

      • it is trivial, or
      • X is a superkey, or
      • Y-X is a proper subset of a candidate key (each attribute in Y that isn’t in X is contained in some candidate key)
  • BCNF: if for each FD X==> Y either
      • it is trivial, or
      • X is a superkey
distinguishing examples
Distinguishing Examples
  • 1NF but not 2NF: SUPPLY(sno, pno, jno, scity, jcity, qty)
    • (sno, pno, jno) is the candidate key
    • sno ==> scity, jno ==> jcity are both partial dependencies
  • 2NF but not 3NF: ED( eno, ename, sal, dno, dname, floor, mgr)
    • transitive FD: eno ==> dno ==> dname
  • 3NF but not BCNF: TEACH(student, teacher, subject)
    • student, subject ==> teacher
    • teacher ==> subject
boyce codd normal form
Boyce-Codd Normal Form

BCNF is perhaps the most useful Normal Form for database design

A relation schema R is in BCNF with respect to a set F of functional dependancies if for all functional dependancies in F+ of the form X==> Y where XR, YR; at least one of the following holds:

  • X ==>Y is trivial (that is, Y  X)
  • X is a superkey for R
bcnf example
BCNF Example
  • R = (A, B, C)
  • F = (A==> B,

B==> C)

  • R is not in BCNF
  • Decomposition R1 = (A, B), R2 = (B, C)
    • R1 and R2 are in BCNF
    • Lossless-join decomposition
    • Dependency preserving
third normal form motivation
Third Normal Form: Motivation
  • There are some situations where
    • BCNF is not dependency preserving, and
    • efficient checking for FD violation on updates is important
    • In these cases BCNF is too severe and a looser Normal Form would be useful
  • Solution: define a weaker Normal Form, called Third Normal Form, where
    • FDs can be checked on individual relations without performing a join (no inter-relational FDs)
    • There is always a lossless-join, dependency-preserving decomposition
third normal form
Third Normal Form
  • A relation schema R is in 3NF with respect to a set F of functional dependancies if for all functional dependancies in F+ of the form X==> Y where XR, YR; at least one of the following holds:
    • X ==>Y is trivial (that is, Y  X)
    • X is a superkey for R
    • Each attribute A in X==>Y is contained in a candidate key for R

(note: possibly in different candidate keys)

  • A relation in BCNF is also in 3NF
  • 3NF is a minimal relaxation of BCNF to ensure dependency preservation
3nf example
3NF Example
  • R = (J, K, L)
  • F = (JK==> L,

L==> K)

  • Two candidate keys: JK and JL
  • R is in 3NF
    • JK==>L JK is a superkey
    • L==>K K is contained in a candidate key
  • BCNF decomposition has R1 = (J, L), R2 = (J, K)
    • testing for JK==>L requires a join
  • There is some redundancy in this schema
testing for 3nf
Testing for 3NF
  • Optimization: need to check only FDs in F, need not check all FDs in F+
  • Use attribute closure to check, for each dependency X==>Y, if X is a superkey
  • If X is not a superkey, we have to verify if each attribute in Y is contained in a candidate key of R
    • This test is rather more expensive, since it involves finding candidate keys
    • Testing for 3NF has been shown to be NP-hard
    • Interestingly, decomposition into 3NF can be done in polynomial time (testing for 3NF is harder than decomposing into 3NF!)
comparison of bcnf and 3nf
Comparison of BCNF and 3NF
  • It is always possible to decompose a relation into relations in 3NF such that:
    • the decomposition is lossless
    • the dependencies are preserved
  • It is always possible to decompose a relation into relations in BCNF such that:
    • the decomposition is lossless
    • but it may not be possible to preserve dependencies
bcnf and 3nf comparison cont
BCNF and 3NF Comparison (cont.)

Example of problems due to redundancy in 3NF

  • R = (J, K, L) J L K

F = (JK==> L, L==> K) j1 l1 k1

j2 l1 k1

j3 l1 k1

null l2 k2

A schema that is in 3NF but not BCNF has the problems of:

  • repetition of information (e.g., the relationship between l1 and k1)
  • need to use null values (e.g., to represent the relationship between l2 and k2 when there is no corresponding value for attribute J)
design goals
Design Goals
  • Goal for a relational database design is:
    • BCNF
    • Lossless Join
    • Dependency Preservation
  • If we cannot achieve this, we accept one of
    • lack of dependency preservation (or use of more expensive inter-relational methods to preserve dependencies)
    • data redundancy due to use of 3NF
  • Interestingly, SQL does not provide a direct way of specifying functional dependencies other than superkeys
    • can specify FDs using assertions, but they are expensive to test
    • Even if we have a dependency preserving decomposition, using SQL we cannot efficiently test an FD whose left hand side is not a key
bcnf and over normalization
BCNF and Over-normalization
  • Goal is to obtain schemas that are:
    • BCNF
    • Lossless Join
    • Dependency Preserving
  • but sometimes we have to look at the meaning, too

Example: TEACH(student, teacher, subject)

student, subject ==> teacher (students not allowed in the

same subject of two teachers)

teacher ==> subject (each teacher teaches one subject)

  • This 3NF has anomalies:
    • Insertion: cannot insert a teacher until we have a student taking his subject
    • Deletion: if we delete the last student of a teacher, we lose the subject he teaches
bcnf and over normalization 2
BCNF and Over-normalization (2)
  • What is the problem? Schema overload. We are trying to capture two meanings:
    • 1) subject X can be taught by teacher Y
    • 2) student Z takes subject W from teacher V

It makes no sense to say we lose the subject he teaches when he does not have a student. Who is he teaching the subject to?

  • Normalizing this schema to BCNF cannot preserve dependencies, so we better stay with the 3NF TEACH and another (BCNF) relation SUBJECT-TAUGHT (teacher, subject) to capture the meaning of the real-world environment more effectively.
getting physical storage and file structure chapter 11
Getting Physical: Storage and File Structure (Chapter 11)
  • Up until now we have examined database design from a high-level conceptual view, passing over actual implementation and underlying hardware.
    • Appropriate focus for database users
    • But hardware does have an influence on implementation, and implementation does have an influence on what conceptual designs will be more efficient and useful
  • Now we get physical -- examine physical storage media to give a background for later focus on implementation of the data models and languages already described
chapter 11
Chapter 11

At this point we are focussing on the following sections

  • 11.1 Overview of Physical Storage Media
  • 11.2 Magnetic Disks
  • 11.3 RAID (very briefly)
  • 11.4 Tertiary Storage
  • 11.5 Storage Access
  • 11.6 File Organization
  • 11.7 Organization of Records in Files
  • 11.8 Data-Dictionary Storage
classification of physical storage media
Classification of Physical Storage Media
  • Media are classified according to three characteristics:
    • speed of access
    • cost per unit of data
    • reliability
      • data loss on power failure or system crash
      • physical failure of the storage device
  • We can also differentiate storage as either
    • volatile storage
    • non-volative storage
physical storage media overview 11 1
Physical Storage Media Overview (11.1)
  • Typical media available are:
    • Cache
    • Main memory
    • Flash memory
    • Mag disk
    • Optical storage (CD or DVD)
    • Tape storage
physical storage media cache and main memory
Physical Storage Media -- Cache and Main Memory
  • Cache
    • fastest and most costly form of storage
    • volatile
    • managed by computer system hardware
  • Main memory
    • fast access (10s to 100s of nanoseconds)
    • generally too small or expensive to hold the entire database
      • current capacities commonly used are up to a few Gigabites
      • capacities have gone up and per-byte costs have decreased steadily, roughly a factor of 2 every 2-3 years
    • volatile
physical storage media flash memory
Physical Storage Media -- Flash Memory
  • Also known as EEPROM -- Electrically Erasable Programmable Read-Only Memory
  • non-volatile
  • reading data is comparable to main memory speeds
  • writing is more complex
    • can’t overwrite a single location -- a whole bank of memory must be erased to permit writing within that bank.
    • erasing is only supported a limited number of times -- 10,000 to one million erase cycles
    • writes are slow (a few microseconds), and erases are slower
  • cost comparable to main memory
  • widely used in computer systems embedded in other devices, such as digital cameras and hand-held computers
physical storage media magnetic disk
Physical Storage Media -- Magnetic Disk
  • data is stored on a spinning disk and read/written magnetically
  • primary medium for long-term storage of data
  • typically stores entire database
  • data must be moved from disk to main memory for access, and written back for storage
    • much slower access than main memory (about which more later)
  • direct access -- possible to read data on disk in any order, unlike magnetic tape
  • capacities up to 100 gig
    • much larger capacity and cheaper cost/byte than main memory or flash memory
    • capacity doubles every two or three years
  • survives power failures and system crashes
    • disk failure can destroy data, but this is more rare than system crashes
physical storage media optical storage
Physical Storage Media -- Optical Storage
  • Non-volatile; data is read optically from a spinning disk using a laser
  • CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
  • Write-once, Read-many (WORM) optical disks used for archival storage (CD-R and DCD-R)
  • Multiple-write versions also available (CD-RW, DVD-RW, and DVD-RAM)
  • Reads and writes are slower than with magnetic disk
  • Juke-box systems available for storing large volumes of data
    • large numbers of removable disks
    • several drives
    • mechanism for automatic loading/unloading of disks
physical storage media tape storage
Physical Storage Media -- Tape Storage
  • Non-volatile
  • used primarily for backup (to recover from disk failure) and for archival data
  • sequential access -- much slower than disk
  • very high capacity (40-300 GB tapes available)
  • tape can be removed from drive; storage costs much cheaper than disk, but drives are expensive; data is read optically from a spinning disk using a laser
  • Juke-box systems available for storing large volumes of data
    • e.g., remote sensing data, possibly hundreds of terabytes (1012 bytes) or even a petabyte (1015 bytes)
storage hierarchy
Storage Hierarchy
  • Primary storage: fastest media but volatile
    • cache
    • main memory
  • secondary storage: next level in hierarchy; moderately fast access time, non-volatile
    • also called on-line storage
    • flash memory, magnetic disks
  • tertiary storage: lowest level in hierarchy; slower access time, non-volatile
    • also called off-line storage
    • optical storage, magnetic tape
magnetic disks 11 2
Magnetic Disks (11.2)
  • Read-write head
    • positioned very close to the platter surface (almost touching it)
    • Reads or writes magnetically coded information
  • Surface of platter divided into circular tracks
    • over 16,000 tracks per platter on typical hard disks
  • Each track is divided into sectors
    • a sector is the smallest unit of data that can be read or written
    • sector size is typically 512 bytes
    • typically 200 (on inner tracks) to 400 (outer tracks) sectors per track
magnetic disks cont
Magnetic Disks (cont)
  • To read/write a sector
    • disk arm swings to position head on the right track
    • platter spins continually; data is read/written as sector passes under head
  • Head-disk assemblies
    • multiple disk platters on a single spindle (typically 2 to 4)
    • one head per platter, mounted on a common arm
  • Cylinder i consists of ith track of all the platters
magnetic disks cont1
Magnetic Disks (cont)
  • Earlier generation disks were susceptible to head crashes
    • disk spins constantly at 60, 120, even 250 revolutions per second
    • head is very close to the surface; if it touches the surface it can scrape the recording medium off the surface, wiping out data and causing the removed medium to fly around, causing more head crashes
    • newer disks have less friable material; less subject to head crashes
magnetic disks cont2
Disk controller -- interfaces between the computer system and the disk drive

accepts high-level commands to read or write a sector

initiates actions such as moving the disk arm to the right track and actually reading or writing the data

computes and attaches checksums to each sector to verify that data is read back correctly

Ensures successful writing by reading back sector after writing it

Performs remapping of bad sectors

Multiple disks are connected to a computer system through a controller

controllers functionality (checksum, bad sector remapping) often carried out by individual disks, reducing load on controller

Two disk interface standards are ATA (AT attachment) and SCSI (Small Computer System Interconnect)

Magnetic Disks (cont)
disk performance measures
Disk Performance Measures
  • Access time -- the time it takes from when a read or write request is issued to when data transfer begins. Consists of:
    • Seek time -- time it takes to reposition the arm over the correct track
      • average seek time is 1/2 the worst case seek time
      • 4 to 10 milliseconds on typical disks
    • Rotational latency -- time it takes for the sector to appear under the head
      • average latency is 1/2 the worst case latency
      • 4 to 11 milliseconds on typical disk (5400 to 15000 rpm)
  • Data Transfer Rate -- rate of retrieval from or storage to disk
    • 4 to 8 MB per second is typical
    • Multiple disks may share a controller, so the rate that controller can handle is also important. E.G., ATA-5: 66MB/sec, SCSI-3: 40MB/sec, Fiber Channel: 256 MB/s
disk performance measures cont
Disk Performance Measures (cont.)
  • Mean time to failure (MTTF) - the average time the disk is expected to run continuously without any failure.
    • Typically 3-5 years
    • Sounds good, but if you have 1500 disks then 300 per year will fail, or about 1 per day
    • MTTF decreases as the disk ages
  • RAID (Redundant Arrays of Independent Disks) (11.3)
    • disk organization techniques that manage a large number of disks, providing a view of a single disk of
      • high capacity and high speed by using multiple disks in parallel
      • high reliability by storing data redundantly, so that data can be recovered even if a disk fails
      • MTTdata loss can be as high as 500,000 to 1,000,000 hours on a RAID
optimization of disk block access motivation
Optimization of Disk-Block Access: Motivation
  • Requests for disk I/O are generated both by the file system and by the virtual memory manager
  • Each request specifies the address on the disk to be referenced in the form of a block number
    • a block is a contiguous sequence of sectors from a single track on one platter
    • block sizes range from 512 bytes to several K (4 -- 16K is typical)
    • smaller blocks mean more transfers from disk; larger blocks makes for more wasted space due to partially filled blocks
    • block is the standard unit of data transfer between disk to main memory
  • Since disk access speed is much slower than main memory access, methods for optimizing disk-block access are important
optimization of disk block access methods
Optimization of Disk-Block Access: Methods
  • Disk-arm Scheduling: requests for several blocks may be speeded up by requesting them in the order they will pass under the head.
    • If the blocks are on different cylinders, it is advantageous to ask for them in an order that minimizes disk-arm movement
    • Elevator algorithm -- move the disk arm in one direction until all requests from that direction are satisfied, then reverse and repeat
    • Sequential access is 1-2 orders of magnitude faster; random access is about 2 orders of magnitude slower
optimization of disk block access methods1
Optimization of Disk-Block Access: Methods
  • Non-volatile write buffers
    • store written data in a RAM buffer rather than on disk
    • write the buffer whenever it becomes full or when no other disk requests are pending
    • buffer must be non-volatile to protect from power failure
      • called non-volatile random-access memory (NV-RAM)
      • typically implemented with battery-backed-up RAM
    • dramatic speedup on writes; with a reasonable-sized buffer write latency essentially disappears
    • why can’t we do the same for reads? (hints: ESP, clustering)
optimization of disk block access methods2
Optimization of Disk-Block Access: Methods
  • File organization (Clustering): reduce access time by organizing blocks on disk in a way that corresponds closely to the way we expect them to be accessed
    • sequential files should be kept organized sequentially
    • hierarchical files should be organized with mothers next to daughters
    • for joining tables (relations) put the joining tuples next to each other
    • over time fragmentation can become an issue
      • restoration of disk structure (copy and rewrite, reordered) controls fragmentation
optimization of disk block access methods3
Optimization of Disk-Block Access: Methods
  • Log-based file system
    • does not update in-place, rather writes updates to a log disk
      • essentially, a disk functioning as a non-volatile RAM write buffer
    • all access in the log disk is sequential, eliminating seek time
    • eventually updates must be propogated to the original blocks
      • as with NV-RAM write buffers, this can occur at a time when no disk requests are pending
      • the updates can be ordered to minimize arm movement
    • this can generate a high degree of fragmentation on files that require constant updates
      • fragmentation increases seek time for sequential reading of files
storage access 11 5
Storage Access (11.5)
  • Basic concepts (some already familiar):
    • block-based. A block is a contiguous sequence of sectors from a single track; blocks are units of both storage allocation and data transfer
    • a file is a sequence of records stored in fixed-size blocks (pages) on the disk
    • each block (page) has a unique address called BID
    • optimization is done by reducing I/O, seek time, etc.
    • database systems seek to minimize the number of block transfers between the disk and memory. We can reduce the number of disk accesses by keeping as many blocks as possible in main memory.
    • Buffer - portion of main memory used to store copies of disk blocks
    • buffer manager - subsystem responsible for allocating buffer space in main memory and handling block transfer between buffer and disk
buffer management
Buffer Management
  • The buffer pool is the part of the main memory alocated for temporarily storing disk blocks read from disk and made available to the CPU
  • The buffer manager is the subsystem responsible for the allocation and the management of the buffer space (transparent to users)
  • On a process (user) request for a block (page) the buffer manager:
    • checks to see if the page is already in the buffer pool
    • if so, passes the address to the process
    • if not, it loads the page from disk and then passes the address to the process
    • loading a page might require clearing (writing out) a page to make space
  • Very similar to the way virtual memory managers work, although it can do a lot better (why?)
buffer replacement strategies
Buffer Replacement Strategies
  • Most operating systems use a LRU replacement scheme. In database environments, MRU is better for some common operations (e.g., join)
    • LRU strategy: replace the least recently used block
    • MRU strategy: replace the most recently used block
  • Sometimes it is useful to fasten or pin blocks to keep them available during an operation and not let the replacement strategy touch them
    • pinned block is thus a block that is not allowed to be written back to disk
  • There are situations where it is necessary to write back a block to disk even though the buffer space it occupies is not yet needed. This write is called the forced output of a block; useful in recovery situations
  • Toss-immediate strategy: free the space occupied by a block as soon as the final tuple of that block has been processed
buffer replacement strategies1
Buffer Replacement Strategies
  • Most recently used (MRU) strategy: system must pin the block currently being processed. After the final tuple of that block has been processed the block is unpinned and becomes the most recently used block. This is essentially “toss-immediate” with pinning, and works very well with joins.
  • The buffer manager can often use other information (design or statistical) to predict the probability that a request will reference a particular page
    • e.g., the data dictionary is frequently accessed -- keep the data dictionary blocks in main memory buffer
    • if several pages are available for overwrite; choose the one that has the lowest number of recent access requests to replace
buffer management cont
Buffer Management (cont)
  • Existing OS affect DBMS operations by:
    • read ahead, write behind
    • wrong replacement strategies
    • Unix is not good for DBMS to run on top
    • Most commercial systems implement their own I/O on a raw disk partition
  • Variations of buffer allocation
    • common buffer pool for all relations
    • separate buffer pool for each relation
    • as above but with relations borrowing space from each other
    • prioritized buffers for very frequently accessed blocks, e.g. data dictionary
buffer management cont1
Buffer Management (cont)
  • For each buffer the manager keeps the following:
    • which disk and which block it is in
    • whether the block is dirty (has been modified) or not (why?)
    • information for the replacement strategy
      • last time block was accessed
      • whether it is pinned
      • possible statistical information (access frequency etc.)
buffer management and disk block access optimization end
Buffer Management and Disk-block Access Optimization (end)
  • Disk-block access methods must take care of some information within each block, as well as information about each block:
    • allocate records (tuples) within blocks
    • support record addressing by address and by value
    • support auxiliary (secondary indexing) file structures for more efficient processing
  • These concerns are linked in to our next topic:

file organization.

file organization
File Organization
  • The database is stored logically as a collection of files.
  • Each file is a sequence of records
  • A record is a sequence of fields
  • Easy so far, but as we just finished discussing, anything stored on disk is stored in blocks, which are a physical constraint unrelated to the storage system used for files.
  • So how do we organize a file into blocks and records?
    • formatting fields within a record
    • formatting records within a block
    • assigning records to blocks.
fixed length records
Fixed-Length Records
  • Simplest approach. We know the length of each record, and they are all the same
    • store record i starting from byte n * (i - 1), where n is record length
    • record access is simple
fixed length records1
Fixed-Length Records
  • Problems:
    • records may cross blocks
      • normal modification: don’t permit records to cross block boundaries
    • deletion of record i leaves a gap, which requires some way of dealing with the empty space.
    • E.G., record 2 (A-215) is deleted from the example block on the right
fixed length records deletion
Fixed-Length Records: Deletion
  • One simple fix is to shift all the records down to fill the gap, as shown on the right.
  • This involves a lot of work, so it might be slow
fixed length records more deletion
Fixed-Length Records: more Deletion
  • Another fix would be to shift the last record (record n) to the deleted position I
  • Much less work
  • Not useful if the records are stored in order (sorted)
fixed length records still more deletion
Fixed-Length Records: still more Deletion
  • Another possibility is to not move records at all
  • Maintain a header at the beginning of the file
  • Store a link to the list of addresses of deleted records
  • Use each deleted record to store the link to the next deleted record
  • Essentially a linked list, often called a free list
variable length records
Variable Length Records
  • Can occur in a database system in several ways
    • storage of multiple record types in a file
    • record types that allow variable lengths for one or more fields
    • record types that allow repeating fields
  • Several ways to store variable length records
    • attach an “end of record” symbol (delimiter) to mark the end of each record
    • store the length of the record at the beginning of each record
    • store header information at the beginning of the file with the location and length of each record
    • these techniques can be applied at the file, block, or record level
variable length records1
Variable Length Records
  • Files:
    • delimit each record within the file
    • store a length field at the beginning of each record
    • store header information at the beginning of the file with the location and length of each record within the file
  • Blocks:
    • delimit each record within the block
    • store a length field at the beginning of each record
    • store header information at the beginning of the block with the location and length of each record inside the block
  • Records:
    • delimit each field within the record
    • store a length field at the beginning of each field
    • store header information at the beginning of the record with the location and length of each field
variable length records2
Variable Length Records

Two more techniques for storing variable-length records

  • use fixed-length fields
    • reserve space -- if there is a maximum space, just reserve that, and mark unused space with a special null (end-of-record) symbol
    • wasteful if the maximum record length is much larger than the average record length
  • list representation
    • represent variable-length records by lists of fixed-length records, chained together by pointers
    • useful for variable-length records caused by repeating same-length fields
    • we don’t want a single field of the variable-length record to cross the boundary of two fixed-length records in its representation, so this can also be wasteful of space
organizing records in a block
Organizing Records in a Block
  • Two major ways we can organize records within a block (disk page)
    • fixed-packed (contiguous storage)
    • slotted page structure (indexed heap)

1) fixed-packed -- records are stored contiguously

    • highly inflexible
    • records may span over block boundary
    • fragmentation with deletions and insertions
    • external pointers may prevent internal block reorganization -- records are pinned to their address in the block
organizing records in a block1
Organizing Records in a Block

2) slotted page structure

  • an initial block header storing block address and offset is used to reference the record
  • records are indexed within a block
  • insertions and deletions are easy (there are no assumptions about contiguity of records and record-address startpoints to deal with)
  • records may be rearranged within the block without concerns about external pointers
  • records are not pinned within the block
organizing records in a files
Organizing Records in a Files
  • Given a set of records, how do we organize them in a file? Three possible methods are:
    • 1. Heap -- no order at all. A record can be placed anywhere in the file where there is space
    • 2. Sequential -- records are stored in a sorted order according to the value of a search key
    • 3. Hashing -- a hash function computed on some attribute of each record is used to specify in which block of the file the record should be placed
    • Records of each relation are often stored in separate files. Sometimes it is useful to use a clustering file organization, where records of several relations might be stored in a single file.
heap file organization
Heap File Organization
  • Heap -- no order at all. A record can be placed anywhere in the file where there is space
    • easy insert, easy delete
    • lack of any structure makes queries (including finding a particular record) very difficult
    • not usually useful for anything except very small relations
sequential file organization
Sequential File Organization
  • Sequential -- records are stored in a sorted order according to the value of a search key
    • designed for efficient queries in sorted order
    • very suitable for applications that require sequential processing of the entire file
    • difficult to maintain sorted order with insert/delete
    • deletions can use a free list (pointer chain) to mark empty space as previously described
sequential file organization1
Sequential File Organization
  • Insertions use the following method:
    • locate the location to be inserted
    • if there is space there, insert with no more work
    • otherwise insert the record in an overflow block
    • in either case the pointer chain must be updated
  • Every so often we need to reorganize the whole file to restore sequential order
clustering file organization
Clustering File Organization
  • Simple file structure stores each relation in a separate file
    • tuples can be represented as fixed-length records
    • easy to implement
    • well-suited to small databases
  • Large databases often attempt to store many relations in one file using a clustering file organization
  • E.G., relations customer and depositor shown to the right:
clustering file organization1
Clustering File Organization
  • depositor relation stores the different accounts that a particular customer has
  • customer relation stores the address information of a given customer
  • Both relations use customer-name as a key
  • Some common queries on the two relations join them based on the customer-name attribute
clustering file organization2
Clustering File Organization
  • Storing the two relations together, sorted on customer-name, allows the join to be computed much more quickly
  • There is a price to pay -- some operations are now more expensive (slower)
  • for example, consider

select *

from customer

  • sequential pass through the customer relation is now hard
clustering file organization3
Clustering File Organization
  • To allow sequential access through all tuples of the customer relation, we chain together all the tuples of that relation using pointers
  • clustering results in variable-size records
  • Careful use of clustering can produce significant performance gains
data dictionary
Data Dictionary

The data dictionary (also called system catalog) stores metadata -- data about data. For example:

  • Information about relations
    • names of relations
    • names and types of attributes of each relation
    • names and definitions of views
    • integrity constraints
  • User and account information, including passwords
data dictionary1
Data Dictionary
  • Statistical and descriptive data
    • number of tuples in each relation
  • Physical file organization information
    • how relation is stored (sequential, hash, clustered, etc.)
    • physical location of relation -- operating system file name or disk addresses of blocks containing records of the relation
  • Information about indices
    • (about which more after the midterm, when we cover chapter 12)
data dictionary2
Data Dictionary
  • In effect, the data dictionary is a mini database. The data within can either be stored as:
    • special-purpose data structures and code to access it, or
    • a set of relations, using the existing database structures and code to access it (most common solution)
data dictionary3
Data Dictionary

Example of a possible system catalog representation:

Relation-metadata = (relation-name, number-of-attributes,

storage-organization, location)

Attribute-metadata = (attribute-name, relation-name, domain-type, position, length)

User-metadata = (user-name, encrypted-password, group)

Index-metadata = (index-name, relation-name, index-type,


View-metadata = (view-name, definition)

reading and review
Reading and Review

Upcoming dates:

  • Mar 11: HW#2 is due
  • Mar 11 and 13: midterm review
  • Mar 18: midterm

Text Sections we’ve covered:

  • Chapter 1: Introduction
  • Chapter 2: E-R data model
  • Chapter 3 (except 3.4 & 3.5): Relational model
  • Chapter 4: SQL
  • Chapter 6: Integrity and Security
  • Chapter 7 up to and including 7.7: Relational-Database Design
  • Chapter 11 (except 11.3 and 11.9): Storage and File Structure
reading and review chapter 1 introduction
Reading and ReviewChapter 1: Introduction

1.1: Applications and motivation

1.2: Database systems vs. file systems

1.3: Views

1.4: Data models

1.5: Database languages

1.6: Database users and administrators

1.7-1.10: other stuff

1.11: Summary

reading and review chapter 2 entity relationship model
Reading and ReviewChapter 2: Entity-Relationship Model

2.1: Basic concepts: entities (entity-sets), relationship (set)s, and attributes

2.2: Constraints and mapping

2.3: Keys

2.4: Design issues

2.5: E-R Diagram

2.6: Weak entity sets

2.7: Extended E-R features (specialization, aggregation)

2.8: Design of an E-R schema

2.9: Reduction of an E-R schema to tables

2.10: (ignore)

2.11: Summary

reading and review chapter 3 relational model
Reading and ReviewChapter 3: Relational Model

3.1: Structure of relational databases

3.2: The relational algebra

3.3: Extended relational-algebra operations

3.4, 3.5: (ignore)

3.6: Tuple relational calculus

3.7: Domain relational calculus

3.8: Summary

reading and review chapter 4 sql
Reading and ReviewChapter 4: SQL

4.1: Background

4.2: Basic structure

4.3: Set operations

4.4: Aggregate functions

4.5: Null values

4.6: Nested subqueries

4.7, 4.8: (ignore)

4.9: Modification of the database

4.10: (ignore)

4.11-4.14: (really ignore)

4.15: Summary

reading and review chapter 6 integrity and security
Reading and ReviewChapter 6: Integrity and Security

6.1: Domain Constraints

6.2: Referential Integrity

6.3: Assertions

6.4: Triggers

6.5: Security and Authorization

6.6, 6.7: (ignore)

6.8: Summary

reading and review chapter 7 relational database design
Reading and ReviewChapter 7: Relational-Database Design

7.1: First Normal Form

7.2: Pitfalls in Relational-Database Design

7.3: Functional Dependencies

7.4: Decomposition

7.5: Desirable Properties of Decomposition

7.6: BCNF

7.7: Third Normal Form

7.8-7.10: (ignore)

7.11: Summary

reading and review chapter 11 storage and file structure
Reading and ReviewChapter 11: Storage and File Structure

11.1: Overview of physical storage media

11.2: Magnetic Disks

11.3: RAID (not responsible for this)

11.4: Tertiary Storage

11.5: Storage Access

11.6: File Organization

11.7: Organization of Records in Files

11.8: Data Dictionary Storage

11.9: (ignore)

11.10: Summary