Data warehousing mining comp 150 data warehousing design not in book
Download
1 / 39

Data Warehousing - PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on

Data Warehousing/Mining Comp 150 Data Warehousing Design (not in book). Instructor: Dan Hebert. Warehouse Design. What to materialize in the warehouse Which source data? Which summary tables? Which indices? Influenced by both querying and maintenance

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Warehousing' - dagmar


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data warehousing mining comp 150 data warehousing design not in book

Data Warehousing/MiningComp 150 Data Warehousing Design(not in book)

Instructor: Dan Hebert


Warehouse design
Warehouse Design

  • What to materialize in the warehouse

    • Which source data?

    • Which summary tables?

    • Which indices?

  • Influenced by both querying and maintenance

  • Trade storage space and update time for query speed


Designing a data warehouse
Designing a Data Warehouse

  • Data models designed to support DW require optimization strategies for DSS

  • Design option

    • Relational model in DW - ROLAP Servers for analysis

    • Special-purpose multi-dimensional data model in DW (MDDB) - MOLAP Servers for analysis


Why is dw design different
Why is DW Design Different?

  • DSS: few transactions, each accessing a large number of records

  • Typical ER designs tend to be complex and difficult to navigate


Multi dimensional data
Multi-Dimensional Data

  • Measures - numerical data being tracked

  • Dimensions - business parameters that define a transaction

  • Example: Analyst may want to view sales data (measure) by geography, by time, and by product (dimensions)

  • Dimensional modeling is a technique for structuring data around the business concepts

  • ER models describe “entities” and “relationships”

  • Dimensional models describe “measures” and “dimensions”


Dimensional modeling using relational dbms
Dimensional Modeling Using Relational DBMS

  • Special schema design: star, snowflake

  • Special indexes: bitmap, multi-table join

  • Special tuning: maximize query throughput

  • Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets

  • Products

    • IBM DB2, Oracle, Sybase IQ, RedBrick, Informix


Dimensional modeling using special purpose model mddb
Dimensional Modeling Using Special-Purpose Model (MDDB)

  • Facts stored in multi-dimensional arrays

  • Dimensions used to index array

  • Sometimes on top of relational DB

  • Products

    • Pilot, Arbor Essbase, Gentia


Example

“Sales by product line over the past six months”

“Sales by account between 1990 and 1995”

Example

Account Info

Key columns joining fact table

to dimension tables

Numerical Measures

Prod Code Time Code Acct Code Sales Qty

Fact table for

measures

Product Info

Dimension tables

Time Info

. . .


Dimensional modeling

Dimensions are organized into hierarchies

E.g., Time dimension: days  weeks  quarters

E.g., Product dimension: product  product line  brand

Dimensions have attributes

Physical architecture describe by Star Schema

Dimensional Modeling


Example cont d
Example Cont’d

Geography

Time

Geography Code

Region Code

Region Mgr

City Code

City Name

Time Code

Quarter Code

Quarter Name

Week Code

Day Code

Day name

Sales

Geography Code

Time Code

Account Code

Product Code

Dollar Amount

Units

Product

Account

Product Code

Product Name

Brand Mgr

Brand Code

Prod. Line Code

Prod. Line Name

Prod. Name

...

Account Code

Key Account Code

Account Name

Account Type

Account Market


Dimensional modeling cont d
Dimensional Modeling Cont’d

  • Fact tables are fully normalized

  • Dimension tables are denormalized

    • Repetitively stored for sake of simplicity and performance


Extending dimensional modeling
Extending Dimensional Modeling

  • Some instances when star schema is not ideal

    • Denormalized schema may require too much storage

    • Very large dimension tables are affecting performance negatively

  • “Snowflake schema”

    • Normalized dimensions


Advantages of dimensional modeling
Advantages of Dimensional Modeling

  • Define complex, multi-dimensional data with simple model

  • Reduces the number of phycial joins a query has to process

  • Allows the data warehouse to evolve with rel. low maintenance

  • HOWEVER! Star schema and rel. DBMS are not the magic solution

    • Query optimization is still problematic


Index structures
Index Structures

  • Traditional access methods

    • B-trees, hash tables, grid files, etc.

  • Popular in warehouses

    • Inverted indexes (lists)

    • Bit map indexes

    • Join indexes


Inverted index
Inverted Index

  • Index for every keyword

  • Query:

    • “Get people with age =20 and name =‘Fred’”

  • (1) Use age index and retrieve ids: r4,r18,r34,r35

  • (2) Use name index and retrieve ids: r18,r52

  • (3) Answer is intersection: r18


Bit map index
Bit Map Index

  • Developed for Model 204 DBMS in 1987

1

1

18

0

19

1

1

18

0

20

0

23

20

0

21

0

22

1

0

0

23

0

25

0

26


Using bit maps
Using Bit Maps

  • Query:

    • “Get people with age=20 and name =‘Fred’”

  • (1) Bit map for age =20: 1101100

  • (2) Bit map for name=‘Fred’: 0100000

  • (3) Answer is intersection: 0100000

  • Good if domain cardinality is small

  • Bit vectors can be compressed


Join index
Join Index

  • Index on one table for a quantity that involves a column value of a different table


Aggregation
Aggregation

  • Process by which low-level data is summarized in advanced and placed into intermediate tables

  • Speeds up query processing, less ad-hoc

    • “Show me total US sales for 1990”

  • How much to aggregate?

  • Data cube data model

    • All possible aggregations along all dimensions

    • Cells contain aggregated values

    • How much of the cells in cube should be pre-computed?


Aggregation cont d
Aggregation Cont’d

  • Special operators to navigate the hierarchies

    • Roll-up: remove a dimension element

    • e.g., Roll-up products to brands

    • Drill-down (opposite of roll-up),

    • Slice (defines a subcube)

    • Various visualization ops (e.g., pivot)


Example1
Example

roll-up to region

Dimensions:

Time, Product, Geography

Attributes:

Product (upc, price, …)

Geography …

Hierarchies:

Product  Brand  …

Day  Week  Quarter

City  Region  Country

Geography

NY

SF

roll-up to brand

LA

10

34

56

32

12

56

Juice

Milk

Coke

Cream

Soap

Bread

Product

roll-up to week

M T W Th F S S

Time

56 units of bread sold in LA on M


Warehouse dbms buzzwords
Warehouse DBMS—Buzzwords

  • Used primarily for decision support(DSS)

    • A.K.A. On-Line Analytical Processing (OLAP)

    • Complex queries, substantial aggregation

    • TPC-D benchmark

  • Multidimensional data model

    • Can be implemented either using rel. model or proprietary data model

    • Multi-dimensional database (MDDB)

  • Aggregation: Data Cube

    • All possible groupings and aggregations


Warehouse dbms buzzwords 2
Warehouse DBMS — Buzzwords (2)

  • ROLAP vs. MOLAP

  • Special purpose OLAP servers that directly implement multidimensional data and operations

    • Roll-up = aggregate on some dimension

    • Drill-down = deaggregate on some dimension

  • ROLAP: Oracle, Sybase IQ, RedBrick

  • MOLAP: Pilot, Essbase, Gentia


Warehouse dbms buzzwords 3
Warehouse DBMS - Buzzwords (3)

  • Clients:

    • Query and reporting tools

    • Analysis tools

    • Data mining: discovering patterns of various forms

  • Poses many new research issues in:

    • Query processing and optimization

    • Database design

    • View management



Common design activities oltp
Common Design Activities – OLTP

  • Schema design (base tables)

    • Normalization (3NF, BCNF, …)

  • Schema design (views)

    • Mostly for convenience, security

    • Usually NOT for performance

    • Exception: View indexing [Roussopolous 1982]

      • Materialize pointers to tuples instead of tuples themselves

  • Index selection

    • In practice, use rules of thumb

    • Tool: DBDSGN [IBM Almaden], RDT for System R


Relational views
Relational Views

  • Part of the ANSI/SPARC architecture

  • Derived, virtual table

  • View definition is an SQL query statement

  • View update problem

  • Good for logical data independence, security

  • How to implement a view for querying

    • Query modification: modify view query into a query on the underlying base tables

    • View materialization: physically implementing view as table


View indexing
View Indexing ...

  • In general, no need for materialized views in OLTP systems

    • Increase in performance through indexing

    • Secondary storage space used to be expensive

  • New idea (N. Roussopolous 1982) - view index

  • Store index whose elements point to tuples which comprise view

  • View selection problem: Find a subset of views, which, when indexed, minimizes the total cost of answering all queries as well as cost of maintaining the view structures


View indexing1
… View Indexing …

  • Assume N views to consider, 2N subsets

  • Can’t do simple enumeration (cost to answer all queries in a given subset)

    • NP-complete problem

  • Solution uses search algorithm to approximate the optimal view selection

    • Potential exponential worst case

    • Only subset of views needs to be considered

  • Cost function which computes for each state (set of views + remaining storage)

    (1) Cost to compute queries, maintenance of current index set +

    (2) Estimate of incremental cost that must be incurred in extending view set (upper bound on actual cost)


View indexing2
… View Indexing

  • But ...

  • Algorithm does not consider index selection on views (view indexes)

    • Indexes have impact on which view indexes to choose

  • Very simple cost model (maintenance cost ~ size of view)

    • Problem: Cost of maintaining view is a complex query optimization problem

    • Cannot be estimated without knowing which subview indexes are chosen

  • Good first treatment of subject


Indexing
Indexing ...

  • Which type of index structure, which attribute(s) to index on

    • “Access path selection” -> DBA

    • Many choices, depend on many factors

    • Space-time trade-off

  • Index selection problem: Which ordering rule for stored records and which non-clustered indices

  • Database practitioners use rules/guidelines (e.g., SYBASE manual)

  • Design tools available

    • Support dba during creation and maintenance of database, i.e., solve the index-selection problem


Factors that influence index selection
Factors that Influence Index Selection

  • Maintenance

  • Storage cost

  • Global solution depends on index selection of all tables combined


Example2
Example

ORDERS (OrderNo, SuppNo, PartNo, Date, Qty)

PARTS (PartNo, Descrip, SuppNo, QtyOnhand, Color, …)

Query:

SELECT O.SuppNo

FROM PARTS P, ORDERS O

WHERE O.PartNo = P.PartNo AND O.SuppNo = 15 AND P.QtyOnHand BETWEEN 100 AND 150

  • Situation 1: Assume PARTS clustered on Descrip and non-clustered index on PartNo

    Then: Best clustered index for ORDERS SuppNo

  • Situation 2: Assume PARTS clustered on PartNo

    Then: Best clustered index for ORDERS PartNo


Data warehouse design
Data Warehouse Design

  • Schema design (base tables)

    • Star schema (dimensions, measures)

  • Schema design (view/index selection)

    • Mostly for performance enhancement

  • Physical warehouse design. Balance three costs:

    (1) The cost of answering queries using warehouse relations and additional structures

    (2) Cost of maintaining additional structures

    (3) Cost of secondary storage


Wh schema design
WH Schema Design

  • Tables must map efficiently to the operational requests

  • OLTP: maximize concurrency, optimize insert/update/delete performance

  • OLAP: Queries large, complex, ad-hoc, data-intensive, no updates

  • Query centric view -> Star schema (facts, dimensions)

    • Widely accepted, intuitive, easy to navigate (query formulation)

  • Problem: Poor performance on OLTP db engines

    • Join processing (pair-wise join problem)

    • Number of pair-wise joins for N tables = N!

      • e.g., 7 tables -> 5,040 combinations, 5 different join algorithms -> 25,200 combinations


Star schema join problem
Star Schema Join Problem

  • Heuristic: “pick directly related tables” doesn’t work in star schema

  • Options:

    • Join unrelated tables (Cartesian product)

    • Parallelism (speed-up, scale-up)

    • New join techniques (e.g., bit vector star joins) in combination with new indexing schemes (e.g., bit maps, variant indexes)


Warehouse access path physical design problem
Warehouse Access Path (Physical) Design Problem

  • Materialize user queries as views (reduces cost 1)

  • How to reduce cost 2 and 3?

  • “View Index Selection Problem VIS”

    • Choose a set of supporting views and a set of indexes to materialize such that the total maintenance cost for the warehouse is minimized (cost 2 & 3)


Solutions relational db design practices
Solutions - Relational DB Design Practices

  • Rel. DB design algorithms must be adapted

    • View index approach has no index selection, simple cost model (cannot achieve global solution by locally optimizing each materialized subview)

    • Index selection approach can be extended - but trouble ahead

      • Algorithms require queries and frequencies as input


Solutions rule condition maintenance
Solutions - Rule Condition Maintenance

  • Work on rule condition evaluation

    • How to evaluate trigger conditions for rules efficiently ( ~ view maintenance problem: rule is triggered whenever view that satisfies its condition becomes non-empty)

    • Discrimination networks for each rule (view)

      • RETE model materializes selection and join nodes

      • TREAT materializes only selection nodes

    • Incremental evaluation techniques

  • Recommendations not generally applicable


ad