data warehousing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Warehousing PowerPoint Presentation
Download Presentation
Data Warehousing

Loading in 2 Seconds...

play fullscreen
1 / 26

Data Warehousing - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

Data Warehousing. DSCI 4103 Dr. Mennecke. Introduction and Chapter 1. Introduction:. Definitions Legacy Systems Dimensions Data Dependencies Model Dimensional Model. Ship Type. Shipper. Ship To. Product. District Credit. Order Item. ContactLocat. Product Line. Sales Order.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Data Warehousing


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data warehousing

Data Warehousing

DSCI 4103

Dr. Mennecke

Introduction and Chapter 1

introduction
Introduction:
  • Definitions
    • Legacy Systems
    • Dimensions
    • Data Dependencies Model
    • Dimensional Model
an er model

ShipType

Shipper

Ship To

Product

DistrictCredit

OrderItem

ContactLocat.

ProductLine

SalesOrder

Cust.Locat.

ProductGroup

Contract

ContractType

Customer

Contact

SalesRep

SalesDistrict

SalesRegion

SalesDivision

An ER Model
a dimensional model
A Dimensional Model

Time

Market

Product

why data warehouses
Why Data Warehouses?
  • To meet the long sought after goal of providing the user with more flexible data bases containing data that can be accessed “every which way.”
oltp vs olap
OLTP vs. OLAP
  • OLTP (Online transaction processing) has been the standard reason for IS and DP for the last thirty years. Most legacy systems are quite good at capturing data but do not facilitate data access.
  • OLAP (Online analytical processing) is a set of procedures for defining and using a dimension framework for decision support
the goals for and characteristics of a dw
The Goals for and Characteristics of a DW
  • Make organizational data accessible
  • Facilitate consistency
  • Adaptable and yet resilient to change
  • Secure and reliable
  • Designed with a focus on supporting decision making
the goals for and characteristics of a dw1
The Goals for and Characteristics of a DW
  • Generate an environment in which data can be sliced and diced in multiple ways
  • It is more than data, it is a set of tools to query, analyze, and present information
  • The DW is the place where operational data is published (cleaned up, assembled, etc.)
basic elements of the data warehouse
Basic elements of the data warehouse

Operational Source Systems

DataStaging Area

DataPresentation Area

DataAccessTools

  • Services:
    • Clean, combine, and standardizeConform DimensionsNo user query services
  • Data Store:
    • Flat files and relational tables
  • Processing:
    • Sorting and sequential processing
  • Data Mart #1
    • DimensionalAtomic and summary dataBased on a single business process

Ad hoc query tools

Report Writers

Analytical Applications

Modeling:

Forecasting

Scoring

Data Mining

Extract

Load

Access

DW Bus:Conformed facts and dimensions

Extract

  • Data Mart #2
    • Similar design

Extract

Load

Access

data staging area
Data Staging Area
  • Extract-Transformation-Load
    • Extract: Reading the source data and copying the data to the staging area
    • Transformation:
      • Cleaning
      • Combining
      • Duplicating
      • Assigning keys
    • Load: present data to the bulk loading facilities of the data mart
organization of data in the presentation area of the data warehouse
Organization of data in the presentation area of the data warehouse
  • Data in the warehouse are dimensional, not normalized relations
    • However, data that are ultimately presented in the data warehouse will often be derived directly from relational DBs
  • Data should be atomic someplace in the warehouse; even if the presentation is aggregate
  • Uses the bus architecture to support a decentralized set of data marts
updates to a data warehouse
Updates to a data warehouse
  • For many years, the dogma stated that data warehouses are never updated.
  • This is unrealistic since labels, titles, etc. change.
  • Some components will, therefore, be changed; albeit, via a managed load (as opposed to transactional updates)
dimensional modeling terms and concepts
Dimensional Modeling Terms and Concepts
  • Fact table
  • Dimension tables
fact tables
Fact Tables
  • Fact table: a table in the data warehouse that contains
    • Numerical performance measures
    • Foreign keys that tie the fact table to the dimension tables
fact tables1
Fact Tables
  • Each row records a measurement describing a transaction
    • Where?
    • When?
    • Who?
    • How much?
    • How many?
  • The level of detail represented by this data is referred to as the grain of the data warehouse
    • Questions can only be asked down to a level corresponding with the grain of the data warehouse
fact tables2
Fact Tables
  • Fact tables contain numeric data that can be one of three types
    • Additive
    • Semi-additive
    • Non-additive
  • Fact tables contain foreign keys
    • A group of foreign keys will be used to create a concatenated primary key
  • Fact tables generally don’t contain textual data
dimension tables
Dimension tables
  • Tables containing textual descriptors of the business
    • Dimension tables are usually wide (e.g., 100 columns)
    • Dimension tables are usually shallow (100s of thousand or a few million rows)
    • Values in the dimensions usually provide
      • Constraints on queries (e.g., view customer by region)
      • Report headings
dimension tables1
Dimension tables
  • The quality of the dimensions will determine the quality of the data warehouse; that is, the DW is only as good as its dimension attributes
  • Dimensions are often split into hierarchical branches (i.e., snowflakes) because of the hierarchical nature of organizations
    • Product part  Product  Brand
  • Dimensions are usually highly denormalized
dimension tables2
Dimension tables
  • The dimension attributes define the constraints for the DW. Without good dimensions, it becomes difficult to narrow down on a solution when the DW is used for decision support
bringing together facts and dimensions building the dimensional model
Bringing together facts and dimensions – Building the dimensional Model
  • Start with the normalized ER Model
  • Group the ER diagram components into segments based on common business processes and model each as a unit
  • Find M:M relationships in the model with numeric and additive non-key facts and include them in a fact table
  • Denormalize the other tables as needed and designate one field as a primary key
a dimensional model1

Sales Fact

Time Dimension

Product Dimension

time_key

product_key

store_key

dollars_sold

units_sold

dollars_cost

time_key

day_of_Week

month

quarter

year

holiday_flag

product_key

description

brand

category

Store Dimension

store_key

store_name

address

floor_plan_type

A Dimensional Model
so what is a dw
So, What is a DW?
  • A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisionsW.H. Inmon (the father of DW)
subject oriented
Subject Oriented
  • Data in a data warehouse are organized around the major subjects of the organization
integrated
Integrated
  • Data from multiple sources are standardized (scrubbed, cleansed, etc.) and brought into one environment
non volatile
Non-Volatile
  • Once added to the DW, data are not changed (barring the existence of major errors)
time variant
Time Variant
  • The DW captures data at a specific moment, thus, it is a snap-shot view of the organization at that moment in time. As these snap-shots accumulate, the analyst is able to examine the organization over time (a time series!)
  • The snap-shot is called a production data extract