Intro to data warehouse
Download
1 / 49

Intro. to Data Warehouse - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Intro. to Data Warehouse. รศ.ดร. วรพจน์ กรีสุระเดช Worapoj Kreesuradej , Ph.D. Ass ociate Professor Data Mining & Data Exploration Laboratory (DME Lab), Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang , Web: www.it.kmitl.ac.th/dme

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Intro. to Data Warehouse' - sanne


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Intro to data warehouse

Intro. to Data Warehouse

รศ.ดร. วรพจน์ กรีสุระเดช

WorapojKreesuradej, Ph.D.

Associate Professor

Data Mining & Data Exploration Laboratory (DME Lab),

Faculty of Information Technology,

King Mongkut's Institute of Technology Ladkrabang,

Web: www.it.kmitl.ac.th/dme

Email: [email protected]


Book

  • PaulrajPonniah, Data Warehousing Fundamentals, John Wiley & Sons, 2001.

  • Ralph Kimbal and Margy Ross, The Data Warehouse Toolkit, John Wiley and Sons, 2002.


Definition of dw
Definition of DW

  • “A collection of integrated, subject-oriented databases designed to supply the information required for decision-making.” - W. Inmon

  • A decision support database that is maintained separately from the organization’s operational databases.

  • A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format –E. Turban and etc.


R kimball s definition of a dw
R. Kimball’s definition of a DW

  • A data warehouse is a copy of transactional data specifically structured for querying and analysis.


Problem data management in large enterprises
Problem: Data Management in Large Enterprises

  • Vertical fragmentation of informational systems

  • Result of application (user)-driven development of operational systems

Sales Planning

Suppliers

Num. Control

Stock Mngmt

Debt Mngmt

Inventory

...

...

...

Sales Administration

Finance

Manufacturing

...


Problem data management in large enterprises1

?

Source

Source

Problem: Data Management in Large Enterprises

  • Two Approaches for accessing data:

    • Query-Driven (Lazy)

    • Warehouse (Eager)


The need for dw
The Need for DW

  • Query-driven (lazy, on-demand)

Clients

Metadata

Integration System

. . .

Wrapper

Wrapper

Wrapper

. . .

Source

Source

Source


Disadvantages of query driven approach
Disadvantages of Query-Driven Approach

  • Delay in query processing

  • Inefficient and potentially expensive for frequent queries

  • Competes with local processing at sources


The warehousing approach
The Warehousing Approach

  • Information integrated in advance

  • Stored in wh for direct querying and analysis

Clients

Data

Warehouse

Metadata

Integration System

. . .

Extractor/

Monitor

Extractor/

Monitor

Extractor/

Monitor

. . .

Source

Source

Source


Advantages of warehousing approach
Advantages of Warehousing Approach

  • High query performance

  • Doesn’t interfere with local processing at sources

  • Information copied at warehouse

    • Can modify, annotate, summarize, restructure, etc.

    • Can store historical information

    • Security, no auditing



Subject oriented
Subject Oriented

  • Data Warehouse is designed around

    “subjects” rather than processes

  • A company may have

    • Retail Sales System

    • Outlet Sales System

    • Catalog Sales System

  • DW will have a Sales Subject Area


Subject oriented1

OLTP Systems

Catalog Sales

System

Retail Sales System

Outlet Sales

System

Sales Subject Area

Data Warehouse

Subject-Oriented Sales Information

Subject Oriented


Integrated
Integrated

  • Heterogeneous Source Systems

  • Need to Integrate source data

  • For Example: Product codes could be different in different systems

  • Arrive at common code in DW


Integrated1
Integrated

  • Information integrated in advance

  • Stored in DW for direct querying and analysis

Clients

Data

Warehouse

Metadata

Integration System

. . .

Extractor/

Monitor

Extractor/

Monitor

Extractor/

Monitor

. . .

Source

Source

Source


Non volatile
Non-Volatile

  • Operational update of data does not occur in the data warehouse environment.

    • Does not require transaction processing, recovery, and concurrency control mechanisms

    • Requires only two operations in data accessing:

      • initial loading of data and access of data.


Non volatile read mostly
Non-Volatile(Read-Mostly)

Write

OLTP

USER

Read

DW

USER

Read


Time variant
Time Variant

  • The time horizon for the data warehouse is significantly longer than that of operational systems.

    • Operational database: current value data.

    • Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)


Time variant1
Time Variant

  • Most business analysis has a time component

  • Trend Analysis (historical data is required)

Sales

2001 2002 2003 2004


Data warehousing process overview
Data Warehousing Process Overview


Data warehousing process overview1
Data Warehousing Process Overview

  • The major components of a data warehousing process

    • Data sources

    • Data extraction

    • Data loading

    • Comprehensive Database/Data Store

    • Data Mart

    • Metadata

    • Middleware tools /information delivery tools


ETL

  • Data Extraction

  • Data Cleaning and Transformation

    • Convert from legacy/host format to warehouse format

  • Load

    • Sort, summarize, consolidate, compute views, check integrity, build indexes, partition


The etl process
The ETL Process

Source

Systems

Staging

Area

DW Database

Transform

Extract

Load


Data Staging Area

  • A storage area where extracted data is cleaned, transformed and deduplicated.

  • Initial storage for data

  • Need not be based on Relational model

  • Mainly sorting and Sequential processing

  • Does not provide data access to users

  • Analogy – kitchen of a restaurant


ETL Process

Issues & Challenges

  • Consumes 70-80% of project time

  • Heterogeneous Source Systems

  • Little or no control over source systems

  • Source systems scattered

  • Different currencies, measurement units

  • Ensuring data quality


Comprehensive database data store
Comprehensive Database/Data Store

  • Mostly a relational DB

    • Oracle, DB2, Sybase, SQL Server

  • New DB design for special purpose of DW (e.g., scale up, speed up, parallel processing)


Data warehouse design
Data Warehouse Design

  • OLTP Systems are Data Capture Systems

  • “DATA IN” systems

  • DW are “DATA OUT” systems

OLTP

DW


Dimensional modeling
Dimensional Modeling

  • Facts are stored in FACT Tables

  • Dimensions are stored in DIMENSION tables

  • Dimension tables contains textual descriptors of business

  • Fact and dimension tables form a Star Schema

  • “BIG” fact table in center surrounded by “SMALL” dimension tables




Data mart
Data mart

  • Data mart = subset of DW for community users, e.g. accounting department

  • Sometimes exist as Multidimensional Database

  • Info mart = summarized data + report for community users


Meta data
Meta Data

  • Data about data

  • Needed by both information technology personnel and users

  • IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc.

  • Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.


Information delivery tools
Information Delivery Tools

  • Tools

    • Query & reporting

    • OLAP

    • Data mining, visualization, segmentation, clustering

    • New developments: text mining, web mining & personalization

    • Mining multimedia data


Information delivery tools1
Information Delivery Tools

  • Commercial tools

    • Crystal Report, Impromptu, WebFocus

  • Increasingly common mode of delivery: Web-enabled


Data warehouse architecture
Data Warehouse Architecture

  • Data Flow Architecture

  • System Architecture




Data flow architecture2
Data Flow Architecture

  • Operational data stores (ODS)

    A type of database often used as an interim area for a data warehouse, especially for customer information files

  • MDB=Multidimensional databases


System architectures
System Architectures

  • Three parts of the data warehouse

    • The data warehouse that contains the data and associated software

    • Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse

    • Client (front-end) software that allows users to access and analyze data from the warehouse






Data warehouse development
Data Warehouse Development

  • Data warehouse development approaches

    • Inmon Model: EDW approach, Enterprise-wide warehouse, top down

    • Kimball Model: Data mart approach, Data mart, bottom up

  • Which model is best?

    • There is no one-size-fits-all strategy to data warehousing

    • When properly executed, both result in an enterprise-wide data warehouse, but with different architectures


The data mart strategy
The Data Mart Strategy

  • The most common approach

  • Begins with a single mart and architected marts are added over time for more subject areas

  • Relatively inexpensive and easy to implement

  • Can be used as a proof of concept for data warehousing

  • Can perpetuate the “silos of information” problem

  • Can postpone difficult decisions and activities

  • Requires an overall integration plan


The enterprise wide strategy
The Enterprise-wide Strategy

  • A comprehensive warehouse is built initially

  • An initial dependent data mart is built using a subset of the data in the warehouse

  • Additional data marts are built using subsets of the data in the warehouse

  • Like all complex projects, it is expensive, time consuming, and prone to failure

  • When successful, it results in an integrated, scalable warehouse


Dw lifecycle ralph kimball
DW Lifecycle (Ralph Kimball )


Data warehouse development1
Data Warehouse Development

  • Some best practices for implementing a data warehouse (Weir, 2002):

    • Project must fit with corporate strategy and business objectives

    • There must be complete buy-in to the project by executives, managers, and users

    • It is important to manage user expectations about the completed project

    • The data warehouse must be built incrementally

    • Build in adaptability


Data warehouse development2
Data Warehouse Development

  • Some best practices for implementing a data warehouse (Weir, 2002):

    • The project must be managed by both IT and business professionals

    • Develop a business/supplier relationship

    • Only load data that have been cleansed and are of a quality understood by the organization

    • Do not overlook training requirements

    • Be politically aware


ad