Data warehouse
Download
1 / 27

Data Warehouse - PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on

Data Warehouse. Chapter 11. Multiple Files Problem. Added complexity of multiple source files Start simple. Logic to detect Correct source. Multiple Source files. Extracted data. Transforming Data from Multiple files. File. File. File. File. File. File. File. File. File.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Data Warehouse' - kairos


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data warehouse

Data Warehouse

Chapter 11


Multiple files problem
Multiple Files Problem

  • Added complexity of multiple source files

  • Start simple

Logic to detect

Correct source

Multiple

Source files

Extracted data


Transforming data from multiple files
Transforming Data from Multiple files

File

File

File

File

File

File

File

File

File


Missing values problem
Missing Values Problem

Solution

  • Ignore

  • Wait

  • Mark rows

  • Extract when time-stamped

If NULL then

Field=‘A’


Duplicate value problem
Duplicate Value Problem

Solution

  • SQL self-join techniques

  • RDMBS constrains utilities

SELECT…

FROM table_a, table_b

WHERE table_a.key(+)=table_b.key

UNION

SELECT…

FROM table_a, table_b

WHERE table_a.key=table_b.key(+)

ACME Inc

ACME Inc

ACME Inc

ACME Inc

ACME Inc


Element names problem
Element Names Problem

Solution

  • CTAS

  • SQL*Loader

Customer

Customer

Client

Contact

Name


Element meaning problem
Element Meaning Problem

  • Avoid misinterpretation

  • Complex solution

  • Document meaning in metadata

All customer

details

All details

Except name

Customer’s

name


Input format problem
Input Format Problem

EBCDIC

ASCII

12373

“123-73”


Referential integrity problem
Referential Integrity Problem

Solution

  • SQL anti-join

  • Server constraints

  • Dedicated tools

  • Emp Name Department

  • Smith 10

  • Jones 20

  • Doe 50

  • 6787 Harris 60

Department

10

20

30

40


Name and address problem
Name and Address Problem

  • No unique key

  • Missing values

  • Personal and commercial names mixed

  • Different addresses for same member

  • Different names and spelling for same number

  • Many names on one line

  • One name on two lines


Name and address problem1
Name and Address Problem

  • Single-field format

  • Multiple-field format

Mr.J.Smith, 100 Main St., Bigtown, County Luth, 23565


Clean and organize
Clean and Organize

  • Create atomic values.

  • Standardize formats.

  • Verify data accuracy.

  • Match with other records.

  • Identify private and commercial addresses and inhabitants.

  • Document in metadata.

    Requires sophisticated tools and techniques


Merging data
Merging Data

  • Operational transactions do not usually map one-to-one with warehouse data

  • Data for the warehouse is merged to provide information for analysis

Pizza sales/return by day, hour, seconds

Sale 1/2/98 12:00:01 Ham Pizza $10.00

Sale 1/2/98 12:00:02 Cheese Pizza $15.00

Sale 1/2/98 12:00:02 Anchovy Pizza $12.00

Return 1/2/98 12:00:03 Ham Pizza -$12.00

Sale 1/2/98 12:00:04 Sausage Pizza $11.00


Merging data1
Merging Data

Sale 1/2/98 12:00:01 Ham Pizza $10.00

Sale 1/2/98 12:00:02 Cheese Pizza $15.00

Sale 1/2/98 12:00:02 Anchovy Pizza $12.00

Return 1/2/98 12:00:03 Ham Pizza -$12.00

Sale 1/2/98 12:00:04 Sausage Pizza $11.00

Sale 1/2/98 12:00:01 Ham Pizza $10.00

Sale 1/2/98 12:00:02 Cheese Pizza $10.00

Sale 1/2/98 12:00:04 Sausage Pizza $11.00


Adding a date stamp
Adding a Date Stamp

  • Enables time analysis

  • Label loaded data with a date stamp

  • Add time to fact and dimension data


Adding a date stamp1
Adding a Date Stamp

Product Table

Product_id

Time_key

Product_desc

Store Table

Store_id

District_id

Time_key

Sales Fact Table

Item_id

Store_id

Time_key

Sales_dollars

Sales_units

Time Table

Week_id

Period_id

Year_id

Time_key

Item_Table

Item_id

Dept_id

Time_key


Adding a date stamp2
Adding a Date Stamp

  • Fact table

    - Add triggers

    - Recode applications

    - Compare tables

  • Dimension table

  • Time representation

    - Point in time

    - Time span


Adding keys to data
Adding Keys to Data

#1 Sale 1/2/98 12:00:01 Ham Pizza $10.00

#2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00

#3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00

#4 Sale 1/2/98 12:00:03 Ham Pizza -$12.00

#5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

Data values or artificial keys

#dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00

#dw2 Sale 1/2/98 12:00:02 Cheese Pizza $10.00

#dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00


Summarizing data
Summarizing Data

  • During extraction on staging area

  • After loading onto the warehouse server

Warehouse

database

Operational

databases

Staging

area


Maintaining transformation metadata
Maintaining Transformation Metadata

Contains transformation rules, algorithms, and routines

Sources Stages Rules Publish

Extract Transform Load Query


Transformation timing and location
Transformation Timing and Location

  • Transformation is performed:

    - Before load

    - In parallel

  • May be initiated at different points

Unlikely

Probable

Possible


Choosing a transformation point
Choosing a Transformation Point

* Workload * Network bandwidth

* Environment * Parallel execution

* CPU use * Load window time

* Disk space * User information

needs


Monitoring and tracking
Monitoring and Tracking

Transformations should:

  • Be self-documenting

  • Provides summary statistics

  • Handle process exceptions


Designing transformation processes
Designing Transformation Processes

  • Analysis:

    - Sources and target mappings, business

    rules

    - Key users, metadata, grain

  • Design options: PL/SQL, replication, custom, third-party tools

  • Design issues:

    - Performance

    - Size of the staging area

    - Exception handling, integrity maintenance


Transformation tools
Transformation Tools

  • Purchased

  • SQL*Loader

  • In-house developed


Data management quality and auditing tools
Data Management, Quality, and Auditing Tools

  • Data management:

    - Innovative Systems

    - Postalsoft

    - Vality Technology

  • Data quality and auditing:

    - Innovative Systems

    - Vality Technology


Summary
Summary

This lesson discussed the following topics:

  • Importance of data quality

  • Transformation processes

  • Data transformation issuess

  • Data anomalies

  • Name and address management

  • Tools


ad