Fundamentals of data warehousing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Fundamentals of Data Warehousing PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on
  • Presentation posted in: General

Fundamentals of Data Warehousing. Dr. Akhtar Ali School of Computing, Engineering and Information Sciences. Lecture Outline. Inmon’s Four Characteristics of a DW Subject-Oriented, Integrated, Time Variant and Non-Volatile Some Useful Definitions Issues in Data Warehousing DW Design

Download Presentation

Fundamentals of Data Warehousing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fundamentals of data warehousing

Fundamentals of Data Warehousing

Dr. Akhtar Ali

School of Computing, Engineering and Information Sciences


Lecture outline

Lecture Outline

  • Inmon’s Four Characteristics of a DW

    • Subject-Oriented, Integrated, Time Variant and Non-Volatile

  • Some Useful Definitions

  • Issues in Data Warehousing

    • DW Design

    • Extraction

    • Integration

    • Creation and Loading

    • DW Maintenance

    • Querying and Optimization


1 1 subject oriented

Customer

Financial

Information

Savings

Data Warehouse

Subject Area

1.1 Subject-Oriented

Data is categorized and stored by business subject rather than

by application.

Equity

Plans

Shares

Loans

Insurance

Operational Systems


1 2 integrated

1.2 Integrated

Data on a given subject is defined and stored once.

Savings

Application

No

Application

Flavor

Current

Accounts

Application

Loans

Application

Subject = Customer

Data Warehouse

Operational Environment


1 3 time variant

Time

Data

01/03

Data for January

02/03

Data for February

03/03

Data for March

1.3 Time Variant

Data is stored as a series of snapshots, each representing a

period of time.

Data

Warehouse


1 4 non volatile

Operational Databases

Warehouse Database

1.4 Non-Volatile

Typically data in the data warehouse is not directly updated or deleted.

Load

INSERT Read (e.g. SELECT)

UPDATE

DELETE

Read


2 1 what is an operational data store

2.1 What is an Operational Data Store?

  • An operational data store (ODS) is the point of integration for operational/transaction-oriented systems.

    • For example Banks typically have several independent systems set up to support different financial products e.g. loans, checking accounts, savings accounts etc.

    • The advent of ATMs helped push many banks to create an ODS to integrate current balances and recent transactional data from these separate accounts under one customer number.

    • Such ODSs are normally kept separate from a DW.

  • An ODS may be seen as a lowest layer of a DW for lower-management to access detailed as well as integrated data.

    • This means that an ODS may be seen as a front edge of a DW.

    • Such ODSs are normally kept as part of a DW.


2 2 what is a data mart

2.2 What is a Data Mart?

  • A data mart (DMT) is a logical subset of a complete DW.

    • A DMT is a complete “pie-wedge” of the overall DW pie.

    • A DW is a made up of the union of all its DMTs.

    • Some people take the definition literally. They create several independent DMTs to meet the needs of several departments.

    • Will everyone be happy? Well, maybe. There may be serious issues of integrating these DMTs together.

  • A DMT is an extension of a DW.

    • Data is integrated as it enters the DW. DMTs then derive data from the central source, the DW.

    • Each department gets its own DMT.

    • Each department determines which of the data warehouse contents are of interest .

    • These subject areas are then replicated into the smaller and local DMT so that users can get to the data they want with less interference from other departments.


2 3 what is on line analytical processing olap

2.3 What is On-Line Analytical Processing (OLAP) ?

  • OLAP is complementary to data warehousing.

  • OLAP embodies general activities of querying and presenting text and number data from DWs.

  • OLAP is based on dimensional modelling as opposed to entity-relationship (ER) modelling.

  • A dimensional model may contain the same information as an ER model but packages data in a symmetric form.

  • A dimensional model is geared towards user understandability and high performance query processing.

  • ROLAP (Relational OLAP)

    • A set of user interfaces and applications that give a relational database (RDB) a dimensional flavour.

  • MOLAP (Multi-dimensional OLAP)

    • A set of user interfaces, applications and proprietary database technologies that have a strong dimensional flavour.


2 4 what is data mining dm

2.4 What is Data Mining (DM) ?

  • DM is often defined as “finding hidden information” in a database.

  • Alternatively, DM is “exploratory data analysis, data driven discovery, and deductive learning”.

  • Data mining software is a class of tools that apply artificial intelligence techniques to the analysis of data.

  • Given access to data, DM tools dig through the data looking for patterns and discovering relationships that the user might have never suspected.

  • DM tools work against an operational database or a DW.

  • Since data in a DW is usually integrated and summarized it may be more efficient to use it for DM.

  • But a DM tool may find more useful information from an operational database (compared to a DW) as a DW usually hosts data to support anticipated DSS and may miss out data useful for DM.


3 0 generic data warehouse architecture

3.0 Generic Data WarehouseArchitecture


3 1 warehouse design

3.1 Warehouse Design

  • Influenced by both maintenance and querying

  • Many trade-offs

    • Space vs. update time vs. query performance

  • Logical model of data

    • ER vs. Dimensional, Relational vs. OR vs. OO, ROLAP vs. MOLAP

  • Identify sources of data

  • Identify warehouse data – what to materialize?

    • Which summary tables?

    • Which fact/dimensional tables?

    • Which indices?

  • Choose software and hardware


3 2 data extraction

3.2 Data Extraction

  • Selecting relevant data from data sources (DSs) and moving it into DW.

  • DS types

    • Database (e.g. relational), flat file, WWW, XML, COBOL, etc

  • How to obtain the data?

    • Using data replication servers/tools

    • Dump file or Export tools

    • ODBC/JDBC/CORBA/RMI/COM and DCOM

    • Third party Wrappers/Middleware/Agents

  • Other activities: data transformation, change detection (monitoring), cleansing etc.


3 2 1 monitors

3.2.1 Monitors

  • Detecting changes (of interest to a DW) in data sources and propagate to DW.

  • How?

    • Triggers

    • Replication servers/tools

    • Log Sniffer

    • Compare query results

    • Compare snapshots/dumps/exported data


3 2 1 data cleansing

3.2.1 Data Cleansing

  • Finds and removes duplicate tuples

    • For example Judie Harris Morris vs. Judie H. Morris

  • Detect inconsistent or wrong data

    • Attribute values that do not match (because of wrong data types or violating certain constraints e.g. for Gender attribute a value of ‘N’ meaning neither may be rejected).

  • Unreadable or incomplete data

  • Notify DSs of errors found during the cleansing process


3 3 data integration

3.3 Data Integration

  • Receive data (changes) from multiple wrappers/monitors/data cleansers and integrate into DW.

  • Often Rule-based

  • Actions

    • Resolve inconsistencies

    • Eliminate duplicates

    • Integrate into DW

    • Summarize data

    • Fetch more data from DSs

    • Notify users that DW is now up-to-date


3 4 1 warehouse loading

3.4 .1 Warehouse Loading

  • Includes all of the previous processes

  • Similar to loading/populating a database but complex due to heterogeneity of data and dealing with multiple DSs, possibly remote and external.

    • Building indices

    • Checking integrity constraints, etc.

  • Issue: huge volumes of data but small time window to complete the process.

  • Computation of additional data

    • Auxiliary data to facilitate DW maintenance and support/speed up querying and analysis.


3 4 2 warehouse creation

3.4.2 Warehouse Creation

  • A DW can be seen as a collection of materialized views (MVs) over DSs.

    • Contains a copy of data (collected from DSs) tailored to end-users.

  • Steps

    • Create DW schema (e.g. creating fact and dimensional tables, defining MVs)

    • Load warehouse

    • Start monitoring for changes at DSs

    • Update/Maintain DW as needed.


3 5 dw maintenance

3.5 DW Maintenance

  • DSs on which a DW is based may change over time.

  • Changes at DSs may require changes at a DW.

  • How often to propagate changes to a DW?

    • At night, weekly/fortnightly/monthly, immediately, etc.

  • Off-line or on-line

    • Most current vendor products take a warehouse off-line during maintenance

  • How to propagate changes to a DW?

    • Completely re-build all affected tables at the DW (easy but inefficient)

    • Apply changes to affected tables incrementally (efficient but difficult)

  • Read my paper about MOVIE or wait until we discuss this topic in detail.


3 6 querying and optimization

3.6 Querying and Optimization

  • Queries are long-running and complex

    • Multiple/Nested joins and aggregation

    • Usually “touch all tuples” kind of queries

  • Query language or analysis tools must support multi-dimensional operations

    • Pivot, Slice/Dice, Rollup, Percentile, etc

    • Standard SQL does not provide adequate operations

  • Solution: pre-compute partial answers and reuse

    • Drawback: it may increase DW maintenance

  • Emergence of warehouse management systems (WHMS e.g. ADMS, WHIPS)


  • Login