From LOGICAL to PHYSICAL

From LOGICAL to PHYSICAL The steps needed to turn a logical data model into a ready-to-implement physical data model for an operational database. François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL - Topics • Logical names to physical names • Logical rules to physical rules • Logical structure to physical structure • Validation rules to reference data • Implementation strategies • Performance tuning • Data model maintenance François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Names • DBMS limitation: need for abbreviations • Naming standards & classwords • Physical naming conventions • List of acronyms and abbreviations • Automated name conversion • Impact of legacy • Data model maintenance François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL - Names DBMS Limitations DBMS limitations (length, etc.) on names for: • Columns • Tables & Views • Indexes • Constraints • Schemas & Databases • Triggers & Stored Procedures François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL - Names Naming Standards (1) • Definition A naming standard is a standard by which you can measure the quality of the names that have been coined or chosen for model components: subject areas, entities and tables, attributes and column-names, domains, validation rules, relationships, etc., in regards to applicable criteria: industry conventions, common usage, ease of understanding, relating and search, reduced homonymy and antinomy, identified synonyms, environmental limitations. François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL - Names Naming Standards (2) • Definition (continued) A naming convention is a documented practice of what kind of words are used and where in a name for what kind of object or concept. Examples of naming conventions are “major keywords” and “classwords”. The applicability of a naming convention is dependent and may vary on the type of the target metadata, i.e. entity names should use a different set of classwords than the ones used for attribute names. François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Names Entity/Table Classwords (partial list) François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Names Attribute/Column Classwords (partial list) François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Names Physical Naming Conventions • Name formats by model objects • Model object codification • Rules regarding prefixing • Rules regarding suffixing • Rules regarding appending sequence • Rules regarding when abbreviating is mandatory, optional, or excluded François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Names Acronyms and abbreviations lists • List of proposed acronyms • List of proposed abbreviations • List of adopted (standard) acronyms • List of adopted (standard) abbreviations • List of legacy acronyms • List of legacy abbreviations • List of “do not abbreviate” words François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Names Automated name conversion Naming Standard Metafile (NSM): • Procedure for maintaining the NSM • Treatment of special characters • Need to keep versions of the NSM • Which model uses which version • Procedure for hardening names François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Names Impact of legacy • Inventory of already implemented names (names from 3rd party vendor products may be excluded) • Inventory of already implemented words, acronyms and abbreviations • Highlight synonyms and homonyms • Decide which is part of the standard/legacy/DNA • Document cases of improper naming François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Names Data model maintenance • Reverse engineer DBMS generated names back into the physical model • Document & resolve differences • Harden implemented names • Names of model objects that have not been implemented yet should stay soft until frozen • Point to latest NSM François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL - Rules • Data format rules, domains • Optionallity: NULL or default value? • Value constraint: CHECK or F.K.? • Relationship cardinality to R.I. (triggers?) • Referential Integrity parent/child actions • Substitute key generation rules • DBMS maintained vs. Process maintained François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Rules Data format rules, domains • What is wrong with using NUMBER? Hint: a NUMBER PK for a table with a quintillion (1018) rows needs only to have NUMBER(18) • NUMBER(n): should n always be odd? • What is the maximum n for VARCHAR(n)? • Rules for BLOB’s, CLOB’s, etc. • INTEGER vs. NUMBER(9) • Unique datatypes from domains? • User defined logical datatypes • Using a Datatype Standard Metafile (DSM) François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Rules Optionallity: NULL or default value? • What does “NULL” mean? (“?” or “N/A”?) • Use of NULL makes joins & coding more “complicated” (elaborate – e.g. for DB2) • For some DBMS (but not for DB2), recursive relationships require the availability of NULL • Default for dates: 9999/12/31 • Default for codes: “ZZZ”, “U”, “0”, “ ” … • Default for text: “ ” • Default for amounts: problem! • Disallowed optionallity forces higher degree of normalization François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Rules Value constraint: CHECK or F.K.? • Answer: it depends on number of values and how often the constraint can change • Flags: use the CHECK constraint • More than “n” values: use reference table • CHECK constraint more performant • Reference table more flexible • Reference table documents, provided a name or description is always associated with the value • Don’t put a validation rule in the code, only the consequence of a violation! François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL - Rules Relationship cardinality to R.I. (triggers?) • Some DBMS don’t like generated R.I. triggers • R.I constraints do not replace R.I. triggers • R.I. triggers to not replace R.I. constraints • Especially needed to enforce mutual exclusivity where subtyping structure is preserved François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Rules Referential Integrity parent/child actions • 3 types of R.I. action: CASCADE, RESTRICT, SET NULL - Applies on parent/child insert, update, or delete • Using RESTRICT only, forces any other needed action to be specified into the code: lost opportunity • SET NULL is great for optional FK’s • CASCADE is dangerous if impact not fully analyzed, but performant – simplifies code – and mostly applicable to supertype-subtype relationships • Guidelines must be set do indicate when to use, or when not to use, any of these 3 actions, depending on cardinality case & other factors • Implement the guidelines in the model R.I. parameters François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Rules Substitute key generation rules • DBMS will maintain next available value for a sequence – some numbers may be skipped • Multiple sequences (one per zone) can be maintained for the same PK for multiple applications, but zones may eventually collide • Sequentially generated PK’s can cause index page lock and page split problems which a randomly generated PK doesn’t have • How to guarantee uniqueness of a randomly generated PK? A: Concatenation of flipped timestamp concatenated and random number • Sequences are O.K. for OID’s (OID’s should never be PK’s) François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Rules DBMS maintained vs. Process maintained • Data related rules vs. process related rules • Multiple paths to the data: consistent application of the rules require their maintenance at one, and only one, place • Code maintenance gradually more expensive • Data driven processes are more flexible than program code driven processes and easier to change (more modular), but require more programming • Alternate: business rules engine where rules are codified as data driving the processes of the engine. • Alternate: data stream transformed by stored procedures into O-O message stream to be processed by methods François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL - Structure • Tradeoffs: space vs. flexibility vs. performance • Impact of disallowing NULL values • Normalization vs. denormalization • Natural keys vs. Substitute Keys • Data correction vs. data change • Current vs. Historical • Transforms on super/sub-types • Rarely used related columns in a child table • Views François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Structure Space vs. Flexibility vs. Performance • DB design tradeoff: gains in performance can only be obtained at the expense of optimal space utilization, with less flexibility, or both: denormalization, adding redundant data, reduce reference tables • DB design tradeoff: optimization in space utilization can only be obtained at the expense of performance, with less flexibility, or both: normalization up to 4th normal form, compression • DB design tradeoff: greater flexibility can only be obtained at the expense of optimal space utilization, and/or performance: normalization beyond 4th normal form, views • Concentrating only on performance yields short term benefits has hidden long term costs François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Structure Impact of disallowing NULL values • Increase in the number of tables to be maintained • Disallows direct recursive R.I. (except for DB2) • Increase in the number of joins in queries • Forces creating an artificial code in each reference table that corresponds to “N/A”, or to give a 2nd meaning to the code that corresponds to “Unknown”. • Forces R.I. check every time an FK is reset where a SET NULL would have been in force. • Disallows upper transforms on super/sub-types relationships • Disallows outer joins in views François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Structure Normalization vs. denormalization Normalized structure: • Reduces redundant data • Optimizes space utilization • Increases flexibility • Performance lost on increased joins • Natural for logical data models Denormalized structure: • Improves performance on reducing joins & fetches • Additional processes needed to maintain redundant data (added risk of failure) • Loss of flexibility, structure less resilient Bottom line: denormalization should be done judiciously, only where warranted, and the justification should be fully documented. François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Structure Natural keys vs. Substitute Keys • The habit of blindly making every PK a substitute key is very costly. It increases the need for joins, thus reducing performance, the number of indexes that are not really needed, thus reducing performance, and the number of sequences the system has to maintain, thus reducing performance. • Substitute keys should be used only when there is a need for a primary key, and: • When there is no natural key, or • The access to the natural key needs to be restricted, or • When the natural key is getting too long • OID’s (Object Identifiers) are a form of substitute key, but should never be used as PK’s (always AK’s) François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Structure Data correction vs. data change • Both will alter the data, but the meaning of the alteration will not be the same • A change corresponds to altering the data to match the altered reality. The change is effective as of when the reality was altered, or as of when the upstream business became aware of it. The alteration can in the future (planned). • A correction adjusts the data to match the existing reality. It is retroactive as to when the data had ceased to match reality. • Data corrections and data changes must be treated differently (i.e. update vs. insert) François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Structure Current vs. Historical How often have you heard: “We don’t need to keep the history of the data {now}*, we only need {now}* to know what the current value is,” during system of record design? *(most of the time, left out of the conversation) • Truth: having to deal with history, where not immediately needed, will increase the size and cost of a project • Truth: the “we” may exclude people not part of the project for whom history would have been useful downstream • Truth: not dealing now with history, where not immediately needed, safely, requires immediate availability through the data warehouse • Truth: In a logical data model, where the value of an attribute can change with time, that attribute is placed in a time dimensioned dependent entity • Truth: Keeping only current data corresponds to: a) denormalizing, b) getting rid of a table, c) getting rid of data, d) losing the distinction between changes and corrections • Truth: if we are just talking about reference data, it is probably O.K. François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Structure Transforms on super/sub-types • 3 types of logical-to-physical transforms of supertypes and their suptypes: • Identity transform (simplest), the physical structure reflects the logical structure, R.I. reflects the logical relationships • Up transform, all the subtypes are merged into the supertype • Down transform, the supertype is merged into each subtype • Up & down transforms are denormalizations • An up transform will cause loss of business rules (e.g. an attribute is mandatory or absent for a given subtype) • A down transform will cause increase in constraints (each child relationship the supertype has must be carried to each of the subtypes), and a common PK must be maintained. Any parent relationship the supertype has is problematical. François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Structure Rarely used related columns • Optional columns that are given a value in rare cases can be thrown in a child table of their own, with a (1-1) to (0-1) relationship, if the original table is sizeable • Two parameters need to be determined: • How big the parent table needs to be • How infrequent the use of the column(s) need to be • The code that updated the parent table now needs to update the new child table • A read-only view can be created that joins parent and child table François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Structure Views • The most maligned data base object: most of the negative press that has been said about views is either old info that no longer applies, or has never been true. • Performance hit: minimal (5-7%) for well built views • Views have multiple uses and can be used for multiple reasons: • To protect the underlying structure and sensitive data • To reduce impact of structural changes to processes • To reduce the complexity of processes • To narrow area of query • To provide derivable data • And soon: PK for R.I. constraints based on subset of underlying table PK (would solve up/down transform impact on constraints) François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Ref. Data • External standard codes • List of valid values in a validation rule • List of valid values in a spreadsheet • Shared reference data • Forbid reference code reuse François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Ref. Data External standard codes • There are codes out there that are already part of national/international standard • Banks • Currency • Gender • Geographical locations • Industry types • Languages • If there is more than one standard and none of the codes are guaranteed not to change (mutable PK), then create your own internal code and associate it to the external ones used for B2B (same recommendation as for external identifiers) • Otherwise, borrow one (of them) with the highest compatibility; there is no need to reinvent a convention that would require API’s to communicate with external systems François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Ref. Data List of valid values in a validation rule • Documents the meaning of reference code values with the data model • Can be used to highlight misuse of a code, or lack of completeness, when reported upon • Can be used to automatically generate initial load INSERT statements • Good only for simple reference entities (3 attributes, or less) François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Ref. Data List of valid values in a spreadsheet • Better solution for reference entities • That are recursive • 4 attributes or more • More than x rows (e.g. where x > 25) • Easy to generate initial load INSERT statements from a spreadsheet from a formula • Validation rule can mention spreadsheet file location • Forward engineering script can include the name of a file containing pre-generated INSERT statements François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Ref. Data Shared reference data • Policy: shared reference codes are never physically deleted • R.I. can exist between parent replicated shared reference tables and their logical children across systems • Shared reference data maintained in one place which is the source of replicated reference data • Replication of reference data is quick, and refresh rate controllable by target • Centralization of shared reference data maintenance has no impact on applications, eliminates need for API, and insures all systems have the same codes, with same meaning, at the same time (or close) • Change to centralized shared reference data must be tightly controlled due to the spreading of the impact on multiple systems and will involve more than one data steward. François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Ref. Data Forbid reference code reuse • “We ran out of codes, and this code doesn’t seem to be used any longer for anything; let’s pick that one.” • For an operational database with no history kept, no impact • For data warehouse, data marts: nightmare • That is just postponing code length expansion: at that point, it will eventually happen; better plan to expand/replace than to reuse before it becomes prohibitively expensive; use views making the new code look like the old one to ease in the new code. • Forbid reference code reuse (a form of procrastination), especially for shared reference codes where the chance reuse would seriously impact a system is greater François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Strategies • Implementation in Phases • Status, audit columns & shadow tables • O-O services through stored procedures • Filter sensitive data through views • Logical deletion + physical purging • Preservation of information reliability • Procedures for data stewardship transfer François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Strategies Implementation in Phases • Segmentation of projects: early deliverables, project more manageable • Use a subject area per implementation phase • Implementing all reference tables first is a workable strategy: • easy to set, • allows programmers what to expect at their fingertips, • user interfaces for reference maintenance can be built • Lays a foundation for the rest of the system being built • Caution: logical modeling work is not compatible with physical implementation phases François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Strategies Status, audit columns & shadow tables • Audit columns: when and who created & last updated, • A “shadow” table contains triggered result of any non-read action against the table it shadows; it is a copy of that table + action code + timestamp • Physical-only model objects that can be script-generated and the result DDL reverse engineered back • Physical-only: to keep the logical model clutter-clean • If the model is small, the tables/columns can be copied instead • Shadow tables are for complete audit tracking only • For installations without data warehouse, shadow tables are a very good seed investment • Most audit columns are commonly absent from views François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Strategies Filter sensitive data through views • Sensitive data: identity information • Grants on tables different from views • Special grants on views with sensitive columns • Using an encrypting/decrypting view on encrypted table columns: even the programmer doesn’t know what encryption algorithm is being used. • Regular views are without the sensitive columns François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Strategies Logical deletion + physical purging • A physical deletion is a more expensive transaction than an update • All fundamental & associative tables are given a status code column • If STATUS_CDE = “D”, the instance is considered logically deleted from the table • If the instance was deleted in error, the user can “undo” the deletion, and the previous status is restored. • After an instance has been deleted for a certain period of time (to be determined), the instance can be purged • A background process can run periodically to physically delete instances that have been logically deleted for longer than the determined amount of time François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Strategies Preservation of information reliability • How reliable is the data of a given table instance? Possible strategies: • Include a “verified flag” • Include a “verifier id” and a “verification date/time” • Include a “data provider id” for which the data provider instance has a “reliability ratio” • Same as above, but the “reliability ratio” is also dependent on the type of information provided François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Strategies Procedures for data stewardship transfer • Establish and document data stewardship transfer protocols for shared data • Diagram the state flow for the shared data showing data stewardship transfer points • Document case scenarios of data stewardship transfer points showing various state flow progressions • Add state flow column to shared tables • Add validation rule including valid precursor/successor states • Relate specific states to exclusive/shared data stewardship • Add stored procedures to update the state column François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Tuning • Non-unique indexes: peanut butter layer • Partitioning and load balancing • Some tables loaded in core memory • Transaction performance analysis • Materialization of some views • Summary columns to reduce recalculation • Volume (“stress”) testing François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Tuning Non-unique indexes: peanut butter layer • Operational databases: heavy on inserts, updates, and deletes • Every time a row is inserted or deleted, all the indexes are updated, every time a index member column is updated, that index is updated • Indexes make queries faster, and everything else slower. Doubling the number of indexes on a table is like doubling the thickness of a peanut butter layer on a slice of bread (toasted, or not) François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Tuning • Non-unique indexes: peanut butter layer • Partitioning and load balancing • Some tables loaded in core memory • Transaction performance analysis • Materialization of some views • Summary columns to reduce recalculation • Volume (“stress”) testing François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Maintenance • Logical  Physical  Database • Database  Physical  Logical • Document justification for alterations • Impact analysis • Data Police • Continue transaction performance analysis • Volumetrics for planned growth • Document semantic shifts François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL – Discussion • Questions • Comments • Personal experiences • Suggestions • Thank you François C. Cartier, e-Modelers

From LOGICAL to PHYSICAL