Loading in 2 Seconds...
Loading in 2 Seconds...
Fundamentals of Relational Database Design and Database Planning. J.Trumbo Fermilab CSS-DSG. Outline. Definitions Selecting a dbms Selecting an application layer Relational Design Planning A very few words about Replication Space. Definitions What is a database?.
A database is the implementation of freeware or commercial software that provides a means to organize and retrieve data. The database is the set of physical files in which all the objects and database metadata are stored. These files can usually be seen at the operating system level. This talk will focus on the organize aspect of data storage and retrieval.
Commercial vendors include MicroSoft and Oracle.
Freeware products include mysql and postgres.
For this discussion, all points/issues apply to both commercial and freeware products.
A database instance, or an ‘instance’ is made up of the background processes needed by the database software.
These processes usually include a process monitor, session monitor, lock monitor, etc. They will vary from database vendor to database vendor.
A SCHEMA IS NOT A DATABASE, AND A DATABASE IS NOT A SCHEMA.
A database instance controls 0 or more databases.
A database contains 0 or more database application schemas.
A database application schema is the set of database objects that apply to a specific application. These objects are relational in nature, and are related to each other, within a database to serve a specific functionality. For example payroll, purchasing, calibration, trigger, etc. A database application schema not a database. Usually several schemas coexist in a database.
A database application is the code base to manipulate and retrieve the data stored in the database application schema.
Entity Relationship Diagram or ER is a pictorial representation of the application schema.
Constraints are rules residing in the database’s data dictionary governing relationships and dictating the ways records are manipulated, what is a legal move vs. what is an illegal move. These are of the utmost importance for a secure and consistent set of data.
Data Manipulation Language or DML, sql statements that insert, update or delete database in a database.
Data Definition Language or DDL, sql used to create and modify database objects used in an application schema.
A transaction is a logical unit of work that contains one or more SQL statements. A transaction is an atomic unit. The effects of all the SQL statements in a transaction can be either all committed (applied to the database) or all rolled back (undone from the database), insuring data consistency.
Database triggers are PL/SQL, Java, or C procedures that run implicitly whenever a table or view is modified or when some user actions or database system actions occur. Database triggers can be used in a variety of ways for managing your database. For example, they can automate data generation, audit data modifications, enforce complex integrity constraints, and customize complex security authorizations. Trigger methodology differs between databases.
Replication is the process of copying and maintaining database objects, such as tables, in multiple databases that make up a distributed database system.
Backups are copies of the database data in a format specific to the database. Backups are used to recover one or more files that have been physically damaged as the result of a disk failure. Media recovery requires the restoration of the damaged files from the most recent operating system backup of a database. It is of the utmost importance to perform regularly scheduled backups.
Mission Critical Applications
An application is defined as mission critical, imho, if
1. there are legal implications or financial loss to the institution if the data is lost or unavailable.
2. there are safety issues if the data is lost or unavailable.
3. no data loss can be tolerated.
4. uptime must be maximized (98%+).
‘large’ or ‘very large’ or ‘a lot’
Seems odd, but ‘large’ is a hard definition to determine. Vldb is an acronym for very large databases. Its definition varies depending on the database software one selects. Very large normally indicates data that is reaching the limits of capacity for the database software, or data that needs extraordinary measures need to be taken for operations such as backup, recovery, storage, etc.
Commercial databases do not a have a practical limit to the size of the load. Issues will be backup strategies for large databases.
Freeware does limit the size of the databases, and the number of users. Documentation on these issues vary widely from the freeware sites to the user sites. Mysql supposedly can support 8T and 100 users. However, you will find arguments on the users lists that these numbers cannot be met.
Many options, many decisions, planning, costs, criticality.
For lots of good information, please refer to the urls on the last slides. Many examples of people choosing product.
Which database product is appropriate for my application? You must make a requirements assessment.
Does you database need 24x7 availability?
Is your database mission critical, and no data loss can be tolerated?
Is your database large? (backup recovery methods)
What data types do I need? (binary, large objects?)
Do I need replication? What level of replication is required? Read only? Read/Write? Read/Write is very expensive, so can I justify it?
If your answer to any of the above is ‘yes’, I would strongly suggest purchasing and using a commercial database with support. Support includes:
Freeware is an alternative for applications. However, be fore warned, support for these databases is done via email to a ad hoc support group. The level of support via these groups may vary over the life of your database. Be prepared. Also expect less functionality than any commercial product. See http://www-css.fnal.gov/dsg/external/freeware/
Freeware is free.
Freeware is open source.
Freeware functionality is improving.
Freeware is good for smaller non-mission critical applications.
Again, planning takes center stage. In the end you want stability and dependability.
Application maintenance issues
Misc. application definitions…
This presentation is not an application presentation, but I will mention a few terms you may hear.
Sql the query language for relational databases. A must learn.
ODBC, open database connectivity. The software that allows a database to talk to an application.
JDBC, java database connectivity.
The design of the application schema will determine the usability and query ability of the application. Done incorrectly, the application and users will suffer until someone else is forced to rewrite it.
The database group has a standard 3 tier infrastructure for developing and deploying production databases and applications. This infrastructure provides 3 database instances, development, integration and production. This infrastructure is applicable to any application schema, mission critical or not. It is designed to insure development, testing, feedback, signoff, and an protected production environment.
Each of these instances contain 1 or more applications.
The 3 instances are used as follows:
2. The integration instance is used for moving what is thought to be ‘complete’ functionality to a pre production implementation. Power users and developers work in concert in integration to make sure the specs were followed. The users should use integration as their sign off area. Cuts from dev to int are frequent and common to maintain the newest releases in int for user testing.
3. The production instance, real data. Needs to be kept pure. NO testing allowed. Very few logons. The optimal setup of a production database server machine has ~3 operating system logons, root, the database logon (ie oracle), and a monitoring tool. In a critical 24x7 supported database, developers, development tools, web servers, log files, all should be kept off the production database server.
Let’s talk about mission critical & 24x7 a bit.
3. All database applications and database software require modifications. Most times these modification require down time because the schema or data modifications need to lock entire tables exclusively. If you are sharing your database instance with other many other applications, and 1 of those applications needs the database for an upgrade, all apps may have to take the down time. Avoid this by insuring your 24/7 database application is segregated from all other software that is not absolutely needed. In that way you insure any down times are specific to your cause.
A cpu can
1 or more
(sam, runs, calib)
(sam, runs, calib)
An database can
accommodate 1 or
An instance may
contain 1 or more
One implements a schema by running scripts. These scripts can be run against multiple servers and should be archived.
Using your design tool, you will begin by relating objects that will eventually become tables. All the other schema objects will fall out of this design.
You will spend LOADS of time in your design tool, honing, redoing, reacting to modifications, etc.
The end users and the designers need to be working almost at the same desk for this process. If the end user is the designer, the end user should involve additional users to insure an unbiased and general design.
It is highly suggested that the design be kept up to date for future documentation and maintainers.
Tables are related, most frequently in a 0 to many relationship. Example, 1 run will result in 0 or more events. Analyzing and defining these relationships results in an application schema.
I am afraid the 80% planning 20% implementation rule applies. Gather requirements.
You will not be able to anticipate all requirements, but a document will be a start. A well designed schema naturally allows for additional functionality.
Who are the users? What is their mission?
Identify objects that need to be stored/tracked.
Think about how objects relate to each other.
Do not be afraid to argue/debate the relationships with others.Relational DesignLet’s get started
Design tools are available, however, they do not think for you. They will give you a clue that you are doing something stupid, but it won’t stop you. It is highly recommended you use a design tool.
A picture says 1000 words. Create ER, entity relationship, diagrams.
Get a commitment from the developer(s) to see the application through to implementation. We have seen several applications redone multiple times. A string of developers tried, left the project, and left a mess. A new developer started from scratch because there was no documentation or design.
Adhere to the recommendations of your database vendor for setup and architecture.
Don’t be afraid to ask for help or to see other examples.
Don’t be afraid to pilfer others design work, if it is good, if it closely fits your requirements, then use it.
Ask questions, schedule reviews with experts and users.
Work with your hardware system administrators to insure you have the hardware you need for the proposed job to be done.
Mistakes we see ALL the time
USE DATABASE CONSTRAINTS!!!!!!
Have examples where constraints were not used, but ‘implemented’ via the api. Bugs in the api allowed data to be deleted that should not have been deleted, and constraints would have prevented the error. Have also seen apis error with ‘cannot delete’ errors. They were trying to force an invalid delete, luckily the database constraints saved the data.
Calibration type might have 3
rows, drift, pedestal, & gain
This is a parent table.
Each calibration record will be
Defined by drift, pedestal or gain.
In addition to start and end times.
This is a child table.
You have now created 3 different children, all reporting the same information, when 1 child would
suffice. Code will have to be written, tested, and maintained for 4 tables now instead of 2.
Now you have created 3 different applications, using 6 tables. All of which could be managed with 2 tables.
Extra code, extra testing, extra maintenance.
AHHH, back to normal, or normalization as we refer to it.
TO SAVE TIME AND PRECIOUS PEOPLE RESOURCES!
Personnel consistency does not exist. Application developers come and go regularly. The documentation that a design product provides will the next developer an immediate understanding of the application in picture format.
Application sharing is enhanced when others can look at your design and determine whether the application is reusable in their environment. Sam is a good example of an application that 3 experiments are now using.
When an application is under construction, the ER diagram goes to every application meeting, and quite possibly the wallet of the application leader. It is the pictorial answer to many issues.
Planning for disk space has been an issue, the designer tool should assist with this task.
What do I need to plan for?
People, hardware, software, obsolescence, maintenance, emergencies.
How far out do I need to plan?
Initially 2-4 years.
How often do I need to review the plans?
What if my plan fails or looks undoable?
Nip it in the bud, be proactive, come up with options.
Will user requirements influence your hardware & software decisions?
Do you need replication?
What architecture is your api going to be?
How many users will be loading the database and hardware?
Backup and recovery procedures of vldb (very large databases) are difficult at best. Vldb is normally defined as mulitple Gig or tera byte databases. This is probably the most sensitive area when choosing a freeware database.
Hardware plays a part here as well. Insure when planning for hardware there is plan for backup and recovery. Disk and tape may be needed.
Make a standards document and enforce its use. When dbas and developers are always on the same page, life is easier for both. Expectations are clear and defined. Anger and disappointment are lessened.
System as well as database standards need to be followed and enforced.
Yikes, we are down!
Everyone always wants 24x7 scheduled uptime. Until they see the cost.
Make anyone who insists on real 100% uptime to justify it (and pay for it?). 98-99% uptime can be realized at a much lower cost.
Uptime requirements will influence, possibly dictate, database choices, hardware choices, fte requirements.
The cheapest method of addressing a failure is proactive planning.
Make sure your database and database software are backed up. Unless you are using a commercial database with roll forward recovery, assume you will lose all dml since your last backup if you need to recover. This should dictate your backup schedule.
Do not forget tape backups as a catastrophic recovery method.
Practice recovery on your integration and development databases. Practice different scenarios, delete a datafile, delete the entire database.
Replication is the process of copying and maintaining database objects in multiple databases that make up a distributed database system. Replication can improve the performance and protect the availability of applications because alternate data access options exist.
READ ONLY Snapshots replication from a Sun box to a Sun & Linux box(s) is being done in CDF. When a replica is under maintenance there is failover to another replica. The replicas are up and running in read only mode if the master is down for maintenance.
Oracle master to master replication allows for updates on both the master and replica sides.
Master to master is a complex and a high maintenance replication. It seems to be the 1st option the unwitting opt for. Both Cern and Fermi dbas have requested firm justification before considering this type of replication request.
Every link in the multi master would be required to be a fully staffed, as downtime will be critical.
MySQL has replication in the last stable version (3.23.32, v4.1 is out). It is master-slave replication using binary log of operations on the server side. It is possible to build star or chain type structures.
There is a PostgreSQL replication tool. We have not tested it yet.
Space is the 1 area consistently under estimated in every application I have seen. Imho, consistently, data volume initial estimates were undersized by a factor of 2 or 3. For example, RunII events were estimated at 1 billion rows. This estimate was surpassed Feb. 2004. We will probably end up with 4-5 billion event rows. That is a lot of disk space.
Disk hardware becomes unsupported, and obsolete in what seems to be a blink of an eye.
8 x N Gb
All databases use disk to store data.
Good rule of thumb:
You need 10x the disk to hold a given amount of data in an RDB.
You will use as much disk space as you purchase, and then some.
Database indices will take MINIMALLY at least as much space as the tables, probably considerably more.
Give WIDE lead time to purchase disk storage. New disks are not installed and configured over night. They require planning, downtime and $.
**WARNING some of these may be database specific.