Parallel star join dataindexes efficient query processing in data warehousing and olap
Download
1 / 21

Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP. Anindya Datta Debra VanderMeer Krithi Ramamritham Presented by – Ashutosh Joshi. Motivation. OLAP involves efficient retrieval of data from data warehouses for decision-support purposes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP' - carina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Parallel star join dataindexes efficient query processing in data warehousing and olap

Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP

Anindya Datta

Debra VanderMeer

Krithi Ramamritham

Presented by –

Ashutosh Joshi


Motivation
Motivation Processing in Data Warehousing and OLAP

  • OLAP involves efficient retrieval of data from data warehouses for decision-support purposes

  • Data Warehouses are extremely large and queries are highly computationally expensive

  • DataIndex is a storage structure serving as both index and data

  • Parallel Star Join (PSJ) is an efficient algorithm for performing star join in parallel


The road map
The Road Map Processing in Data Warehousing and OLAP

  • A physical design principle for exploiting parallelism

  • Parallel Star Join algorithm

  • Experiment results


The star schema
The Star Schema Processing in Data Warehousing and OLAP

Dimension Table

PART

CUSTOMER

Fact Table

PartKey4

Name 55

Mfgr 25

Brand 10

Type 25

Size 4

Others... 41

164

CustKey 4

Name 25

Address 40

Nation 25

Region 25

Phone 15

AcctBal 8

MktSegment 10

Comment 117

269

SALES

PartKey 4

SuppKey 4

CustKey 4

Quantity 8

ExtPrice 8

Discount 8

Tax 8

RetFlag 1

Status 1

ShipDate 2

CommitDate 2

ReceiptDate 2

ShipInstruct 25

ShipMode 10

Comment 44

137

200,000

SUPPLIER

150,000

SuppKey 4

Name 25

Address 40

Nation 25

Region 25

Phone 15

AcctBal 8

Comment 101

243

TIME

TimeKey 2

Alpha 10

Year 4

Month 4

Week 4

Day 4

28

6,000,000

2,557

10,000


A physical design principle
A Physical Design Principle Processing in Data Warehousing and OLAP

  • DataIndexes

    • Serve as both index as well as data

    • Based on vertical partitioning of tables

    • Two types

      • Projection Index (PI)

      • Join Index (JI)


Projection index

CustKey Processing in Data Warehousing and OLAP

CK1

CK2

CK3

CK4

Projection Index

Base Table

CustKey

Qty

ExtPrice

Discount

CK1

Q1

E1

D1

CK2

Q2

E2

D2

CK3

Q3

E3

D3

CK4

Q4

E4

D4

PI

PI

PI

Qty

ExtPrice

Discount

Q1

E1

D1

Q2

E2

D2

Q3

E3

D3

Q4

E4

D4


Join index

RIDs Processing in Data Warehousing and OLAP

RID1

RID2

RID3

RID3

Join Index

Base Dimension Table

Base Fact Table

Name

Address

CustKey

CustKey

Tax

ExtPrice

Discount

N1

A1

CK1

CK1

T1

E1

D1

N2

A2

CK2

CK2

T2

E2

D2

N3

A3

CK3

CK3

T3

E3

D3

CK3

T4

E4

D4

PI

PI

PI

JI

PI

PI

Name

Address

CustKey

Tax

ExtPrice

Discount

N1

A1

CK1

T1

E1

D1

N2

A2

CK2

T2

E2

D2

N3

A3

CK3

T3

E3

D3

T4

E4

D4


The principle
The Principle Processing in Data Warehousing and OLAP

  • Each foreign key column in the fact table is stored as Join Index (JI)

  • Rest of the columns (for both dimension as well as fact table) are stored as Projection Index (PI)


Parallel star join
Parallel Star Join Processing in Data Warehousing and OLAP

  • Data placement strategy

    • Based on shared nothing architecture with N processors

    • Assume a d dimensional data warehouse

    • Partition N processors into d+1 groups

    • Assign to each group j, dimension table Djand Jj , the fact table join index

    • Assign metric PIs to the group d+1


Processor group partitioning
Processor Group Partitioning Processing in Data Warehousing and OLAP

  • Number of processors is governed by the size of dimension table Dj

  • Size of jth processor group

  • Size of metric group


Physical data placement
Physical Data Placement Processing in Data Warehousing and OLAP

  • Horizontally partition JI’s across all processors

  • Replicate PI’s on all processors

  • Use round-robin strategy for partitioning JI’s


The parallel star join algorithm
The Parallel Star Join Algorithm Processing in Data Warehousing and OLAP

  • A general k- dimensional star join query

    • Select AdP, AmP

      from F, D1, … , Dk

      where Pjoin and Pselect

  • The algorithm has three phases

    • Local rowset generation

    • Global rowset synthesis

    • Output preparation


  • Local rowset generation

    1 Processing in Data Warehousing and OLAP

    25

    0

    5

    0

    7

    15

    1

    Local Rowset generation

    • Load PI fragment

    Pc

    P1

    P2

    PI fragment

    PI fragment

    PI fragment

    Qty > 10

    PI fragment

    Rowset fragment


    Local rowset generation contd
    Local Rowset Generation (contd) Processing in Data Warehousing and OLAP

    • Merge dimension rowset fragments

    • Distribute dimension rowset

    Rowset

    fragment

    P1

    P2

    P3

    P4

    OR

    Rdim,i


    Local rowset generation contd1

    RIDs Processing in Data Warehousing and OLAP

    RID1

    1

    1

    RID2

    0

    0

    RID3

    0

    0

    RID3

    0

    1

    Local Rowset Generation (contd)

    • Load JI fragment

    • Merge partial fact rowsets

    Rfact,i

    Rdim,i

    JIi


    Global rowset synthesis
    Global Rowset Synthesis Processing in Data Warehousing and OLAP

    • Merge local fact rowsets

    • Distribute global rowset to groups participating in the output phase

    Rfact,2

    G1

    G2

    Rfact,1

    G3

    G4

    AND

    Rglobal


    Output preparation

    Output Processing in Data Warehousing and OLAP

    CustKey

    RIDs

    CK1

    CK1

    RID1

    1

    CK2

    CK2

    RID2

    1

    CK3

    RID3

    0

    CK4

    RID3

    0

    Output Preparation

    • Distribute global rowset to individual processors

    • Load PI columns necessary for output

    • Merge output

    JIi

    Rglobal

    PIi


    Performance comparison
    Performance Comparison Processing in Data Warehousing and OLAP

    • The PSJ algorithm was compared with Bitmapped Join Index algorithm and the Pipelined Hash join algorithm

    • Two performance metrics used

      • Response time in block access (RTBA)

      • Aggregate Data Transmission (ADT)


    Scalability experiments
    Scalability Experiments Processing in Data Warehousing and OLAP

    • The curves rise as the scale factor and number of processors increase

    • PSJ cost is much lower than BJI and HASH costs

    • At large memory sizes, PSJ approaches “near-perfect” scalability


    Scalability experiments contd
    Scalability Experiments(contd) Processing in Data Warehousing and OLAP

    • Transmission costs for PSJ and BJI are the same

    • Both curves exhibit imperfect scalability

    • HASH has substantially higher transmission costs than PSJ


    Conclusion
    Conclusion Processing in Data Warehousing and OLAP

    • DataIndex is a physical design strategy which provides efficient partitioning of the schema

    • Parallel Star Join algorithm provides a means to perform star join in parallel

    • PSJ algorithm performs better than BJI and HASH algorithms in terms of I/O and transmission costs