Intro to data warehouse
This presentation is the property of its rightful owner.
Sponsored Links
1 / 49

Intro. to Data Warehouse PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on
  • Presentation posted in: General

Intro. to Data Warehouse. รศ.ดร. วรพจน์ กรีสุระเดช Worapoj Kreesuradej , Ph.D. Ass ociate Professor Data Mining & Data Exploration Laboratory (DME Lab), Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang , Web: www.it.kmitl.ac.th/dme

Download Presentation

Intro. to Data Warehouse

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Intro to data warehouse

Intro. to Data Warehouse

รศ.ดร. วรพจน์ กรีสุระเดช

WorapojKreesuradej, Ph.D.

Associate Professor

Data Mining & Data Exploration Laboratory (DME Lab),

Faculty of Information Technology,

King Mongkut's Institute of Technology Ladkrabang,

Web: www.it.kmitl.ac.th/dme

Email: [email protected]


Intro to data warehouse

Book

  • PaulrajPonniah, Data Warehousing Fundamentals, John Wiley & Sons, 2001.

  • Ralph Kimbal and Margy Ross, The Data Warehouse Toolkit, John Wiley and Sons, 2002.


Definition of dw

Definition of DW

  • “A collection of integrated, subject-oriented databases designed to supply the information required for decision-making.”- W. Inmon

  • A decision support database that is maintained separately from the organization’s operational databases.

  • A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format –E. Turban and etc.


R kimball s definition of a dw

R. Kimball’s definition of a DW

  • A data warehouse is a copy of transactional data specifically structured for querying and analysis.


Problem data management in large enterprises

Problem: Data Management in Large Enterprises

  • Vertical fragmentation of informational systems

  • Result of application (user)-driven development of operational systems

Sales Planning

Suppliers

Num. Control

Stock Mngmt

Debt Mngmt

Inventory

...

...

...

Sales Administration

Finance

Manufacturing

...


Problem data management in large enterprises1

?

Source

Source

Problem: Data Management in Large Enterprises

  • Two Approaches for accessing data:

    • Query-Driven (Lazy)

    • Warehouse (Eager)


The need for dw

The Need for DW

  • Query-driven (lazy, on-demand)

Clients

Metadata

Integration System

. . .

Wrapper

Wrapper

Wrapper

. . .

Source

Source

Source


Disadvantages of query driven approach

Disadvantages of Query-Driven Approach

  • Delay in query processing

  • Inefficient and potentially expensive for frequent queries

  • Competes with local processing at sources


The warehousing approach

The Warehousing Approach

  • Information integrated in advance

  • Stored in wh for direct querying and analysis

Clients

Data

Warehouse

Metadata

Integration System

. . .

Extractor/

Monitor

Extractor/

Monitor

Extractor/

Monitor

. . .

Source

Source

Source


Advantages of warehousing approach

Advantages of Warehousing Approach

  • High query performance

  • Doesn’t interfere with local processing at sources

  • Information copied at warehouse

    • Can modify, annotate, summarize, restructure, etc.

    • Can store historical information

    • Security, no auditing


Characteristics of dw

Characteristics of DW


Subject oriented

Subject Oriented

  • Data Warehouse is designed around

    “subjects” rather than processes

  • A company may have

    • Retail Sales System

    • Outlet Sales System

    • Catalog Sales System

  • DW will have a Sales Subject Area


Subject oriented1

OLTP Systems

Catalog Sales

System

Retail Sales System

Outlet Sales

System

Sales Subject Area

Data Warehouse

Subject-Oriented Sales Information

Subject Oriented


Integrated

Integrated

  • Heterogeneous Source Systems

  • Need to Integrate source data

  • For Example: Product codes could be different in different systems

  • Arrive at common code in DW


Integrated1

Integrated

  • Information integrated in advance

  • Stored in DW for direct querying and analysis

Clients

Data

Warehouse

Metadata

Integration System

. . .

Extractor/

Monitor

Extractor/

Monitor

Extractor/

Monitor

. . .

Source

Source

Source


Non volatile

Non-Volatile

  • Operational update of data does not occur in the data warehouse environment.

    • Does not require transaction processing, recovery, and concurrency control mechanisms

    • Requires only two operations in data accessing:

      • initial loading of data and access of data.


Non volatile read mostly

Non-Volatile(Read-Mostly)

Write

OLTP

USER

Read

DW

USER

Read


Time variant

Time Variant

  • The time horizon for the data warehouse is significantly longer than that of operational systems.

    • Operational database: current value data.

    • Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)


Time variant1

Time Variant

  • Most business analysis has a time component

  • Trend Analysis (historical data is required)

Sales

2001 2002 2003 2004


Data warehousing process overview

Data Warehousing Process Overview


Data warehousing process overview1

Data Warehousing Process Overview

  • The major components of a data warehousing process

    • Data sources

    • Data extraction

    • Data loading

    • Comprehensive Database/Data Store

    • Data Mart

    • Metadata

    • Middleware tools /information delivery tools


Intro to data warehouse

ETL

  • Data Extraction

  • Data Cleaning and Transformation

    • Convert from legacy/host format to warehouse format

  • Load

    • Sort, summarize, consolidate, compute views, check integrity, build indexes, partition


The etl process

The ETL Process

Source

Systems

Staging

Area

DW Database

Transform

Extract

Load


Intro to data warehouse

Data Staging Area

  • A storage area where extracted data is cleaned, transformed and deduplicated.

  • Initial storage for data

  • Need not be based on Relational model

  • Mainly sorting and Sequential processing

  • Does not provide data access to users

  • Analogy – kitchen of a restaurant


Intro to data warehouse

ETL Process

Issues & Challenges

  • Consumes 70-80% of project time

  • Heterogeneous Source Systems

  • Little or no control over source systems

  • Source systems scattered

  • Different currencies, measurement units

  • Ensuring data quality


Comprehensive database data store

Comprehensive Database/Data Store

  • Mostly a relational DB

    • Oracle, DB2, Sybase, SQL Server

  • New DB design for special purpose of DW (e.g., scale up, speed up, parallel processing)


Data warehouse design

Data Warehouse Design

  • OLTP Systems are Data Capture Systems

  • “DATA IN” systems

  • DW are “DATA OUT” systems

OLTP

DW


Dimensional modeling

Dimensional Modeling

  • Facts are stored in FACT Tables

  • Dimensions are stored in DIMENSION tables

  • Dimension tables contains textual descriptors of business

  • Fact and dimension tables form a Star Schema

  • “BIG” fact table in center surrounded by “SMALL” dimension tables


Star schema

Star Schema


Intro to data warehouse

Star Schema


Data mart

Data mart

  • Data mart = subset of DW for community users, e.g. accounting department

  • Sometimes exist as Multidimensional Database

  • Info mart = summarized data + report for community users


Meta data

Meta Data

  • Data about data

  • Needed by both information technology personnel and users

  • IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc.

  • Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.


Information delivery tools

Information Delivery Tools

  • Tools

    • Query & reporting

    • OLAP

    • Data mining, visualization, segmentation, clustering

    • New developments: text mining, web mining & personalization

    • Mining multimedia data


Information delivery tools1

Information Delivery Tools

  • Commercial tools

    • Crystal Report, Impromptu, WebFocus

  • Increasingly common mode of delivery: Web-enabled


Data warehouse architecture

Data Warehouse Architecture

  • Data Flow Architecture

  • System Architecture


Data flow architecture

Data Flow Architecture


Data flow architecture1

Data Flow Architecture


Data flow architecture2

Data Flow Architecture

  • Operational data stores (ODS)

    A type of database often used as an interim area for a data warehouse, especially for customer information files

  • MDB=Multidimensional databases


System architectures

System Architectures

  • Three parts of the data warehouse

    • The data warehouse that contains the data and associated software

    • Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse

    • Client (front-end) software that allows users to access and analyze data from the warehouse


System architectures1

System Architectures


System architectures2

System Architectures


System architecture

System Architecture


System architecture1

System Architecture


Data warehouse development

Data Warehouse Development

  • Data warehouse development approaches

    • Inmon Model: EDW approach, Enterprise-wide warehouse, top down

    • Kimball Model: Data mart approach, Data mart, bottom up

  • Which model is best?

    • There is no one-size-fits-all strategy to data warehousing

    • When properly executed, both result in an enterprise-wide data warehouse, but with different architectures


The data mart strategy

The Data Mart Strategy

  • The most common approach

  • Begins with a single mart and architected marts are added over time for more subject areas

  • Relatively inexpensive and easy to implement

  • Can be used as a proof of concept for data warehousing

  • Can perpetuate the “silos of information” problem

  • Can postpone difficult decisions and activities

  • Requires an overall integration plan


The enterprise wide strategy

The Enterprise-wide Strategy

  • A comprehensive warehouse is built initially

  • An initial dependent data mart is built using a subset of the data in the warehouse

  • Additional data marts are built using subsets of the data in the warehouse

  • Like all complex projects, it is expensive, time consuming, and prone to failure

  • When successful, it results in an integrated, scalable warehouse


Dw lifecycle ralph kimball

DW Lifecycle (Ralph Kimball )


Data warehouse development1

Data Warehouse Development

  • Some best practices for implementing a data warehouse (Weir, 2002):

    • Project must fit with corporate strategy and business objectives

    • There must be complete buy-in to the project by executives, managers, and users

    • It is important to manage user expectations about the completed project

    • The data warehouse must be built incrementally

    • Build in adaptability


Data warehouse development2

Data Warehouse Development

  • Some best practices for implementing a data warehouse (Weir, 2002):

    • The project must be managed by both IT and business professionals

    • Develop a business/supplier relationship

    • Only load data that have been cleansed and are of a quality understood by the organization

    • Do not overlook training requirements

    • Be politically aware


  • Login