1 / 20

Scatter-Gather-Merge Algorithm

Scatter-Gather-Merge Algorithm. - Shourie Boddupalli. Data Parallelism. Data Parallelism is a form of parallelization of computing across multiple processors in parallel computing environment.

asta
Download Presentation

Scatter-Gather-Merge Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scatter-Gather-Merge Algorithm -ShourieBoddupalli

  2. Data Parallelism • Data Parallelism is a form of parallelization of computing across multiple processors in parallel computing environment. • A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines

  3. Data Warehouse • A data warehouse is an online repository for decision support applications that answer business queries in a short time. • Where can data parallelism be used in a Warehouse? Star Schema Star-Join Query

  4. Approaches to process Star-Join • Data Parallel Framework (Ex: Hive , CloudBase) - No need for up-to-date hardware & software - Fault-Tolerance provided by hiding complexity. • But in case of join-query processing computational efficiency in premature state.

  5. Warehouse Example

  6. Example Query SELECT D_YEAR,S-NATION,P_CATEGORY FROM DATE,CUSTOMER,SUPPLIER,PART,LINORDER WHERE LO_CUSTKEY = C_CUSTKEY AND LO_SUPKEY = S_SUPKEY AND LO_PARTKEY = P_PARTKEY AND C_REGION = ‘AMERICA’ GROUP BY D_YEAR,S_NATION,P_CATEGORY;

  7. Execution plan for the query

  8. Scatter-Gather-Merge • This algorithm(as name indicates) has 3 phases Scatter Gather Merge • Key Manipulation Technique: Basic idea is to join the fact table with n dimension tables within 3 computational phases

  9. Example of Database and Star-Join Query

  10. Contd. • During the scatter phase 1) If the input is a tuple of FT, the tupleis transformed into two key-value pairs as results 2) If the input is a tuple of the dimension tables, the tuple is transformed into a new key-value pair as a result • Gather Phase aggregates according to key • Merge Phase produces the final results of star-join queries

  11. Algorithm Algorithm 1 (Key manipulation algorithm of Scatter-Gather-Merge) Scatter(r) Input r is a record. 1: if (r is a record of the fact table F) then 2: for each fki do 3: Turn input tuple (fk1, fk2, . . . , fkn, rF ) into key-value pair ((fki , i), (fk1, fk2, fkn, rF )). 4: Store ((fki , i), (fk1, fk2, . . . , fkn, rF )). 5: endfor 6: endif 7: if (r is a record of dimension table Di )then 8: Turn input tuple (pki , rDi) into key-value pair ((pki , i), rDi). 9: Store and Distribute ((pki , i), rDi). 10: endif Gather(k, v) Input k is a key (join key). ν is a set of records that have the same join key. 1: Match all ((pki , i), rDi) with all ((fki, i), (fk1, fk2, . . . , fkn, rF )). 2: Make an output ((fk1, fk2, . . . , fkn), (rDi, rF )). 3: Store and Distribute ((fk1, fk2, . . . , fkn), (rDi, rF )). Merge(k, v) Input k is a key (fk1, fk2, . . . , fkn). ν is a set of records that have the key. 1: Aggregate every record with all ((fk1, fk2, . . . , fkn), (rDi, rF )) where 1 ≤ i ≤ n. 2: Make an output ((fk1, fk2, . . . , fkn), (rD1, rD2, . . . , rDn, rF )). 3: Store ((fk1, fk2, . . . , fkn), (rD1, rD2, . . . , rDn, rF )). //final output

  12. Notation Used • Di has the primary key PKi that is associated with the foreign key FKi of F where i is the dimension identification number of Di . • Each tuple of Di is (pki , rDi) where pki is the value of the primary key PKi and rDi is a vector that contains other attribute values. • Each tuple of F (fk1, fk2, . . . , fkn, rF ) where fki is the value of the foreign key FKi and rF is a vector that contains other attribute values. The vector (fk1, fk2, . . . , fkn) is unique in the fact table or rF contains the primary key

  13. IO Reduction Technique • In case of key manipulation technique there are n intermediate results to generate a final query which needs to be reduced. • To reduce the number of intermediate results Bloom filters were introduced.

  14. Algorithm for IO Reduction Algorithm 2 (Scatter-Gather-Merge algorithm) Filter-Construction(r) Input r is a record. BFi is a bloom filter of Di . 1: if (r is a record of dimension table Di 2: and r is satisfied with CDi ) then 3: Store and Distribute r. 4: Add pki to BFi . 5: endif Scatter(r) Input r is a record. 1: if (v is a record of the fact table F) then 2: for each fki do 3: if fki is not contained by the corresponding BFi return 4: endif 5: endfor 6: for each fki do 7: Turn input tuple (fk1, fk2, . . . , fkn, rF ) into key value pair ((fki , i), (fk1, fk2, . . . , fkn, rF )). 8: Store and Distribute ((fki , i) , (fk1, fk2, . . . , fkn, rF )). 9: endfor 10: endif 11: if (r is a record of dimension table Di ) then 12: Turn input tuple (pki , rDi) into key-value pair ((pki , i), rDi). 13: Store and Distribute ((pki , i), rDi). 14: endif

  15. Map-Reduce based Scatter-Gather-Merge Algorithm • Three Phases - Construction - Scatter & Gather - Merge

  16. Contd.

  17. Map-Reduce based Scatter-Gather-Merge Algorithm < The Filter-Construction Phase >Map(k, v) Input k is a key. ν is a record of each participating dimension table that the star-join query has restrictions on. 1: if (v is a record of dimension table Di 2: and v is satisfied with CDi ) then 3: Turn input tuple (pki , rDi) into key value pair ((pki , i), rDi). 4: Emit ((pki , i), rDi). 5: endif Reduce(k, v) Input (k, ν) is a filtered record of each dimension table. BF(i,j ) is a bloom filter of Di for the j th Reduce process. 1: Emit ((pki , i), rDi). 2: Add pki to BF(i,j ). < The Scatter-and-Gather Phase > Map(k, v) // scatter function Input k is a key. ν is a record of the fact table and every participating dimension table. 1: if (v is a record of the fact table F) then 2: for each fki do 3: if fki is not contained by the corresponding BF(i,j ) return 4: endif 5: endfor 6: for each fki do 7: Turn input tuple (fk1, fk2, . . . , fkn, rF ) into key-value pair ((fki , i) , (fk1, fk2, . . . , fkn, rF )). 8: Emit ((fki , i) , (fk1, fk2, . . . , fkn, rF )).

  18. Contd. 9: endfor 10: endif 11: if (v is a record of dimension table Di ) then 12: if (There are restrictions on Di ) then 13: Emit ((pki , i), rDi). 14: else 15: Turn input tuple (pki , rDi) into key-value pair ((pki , i), rDi). 16: Emit ((pki , i), rDi). 17: endif 18: endif Reduce(k, v) // gather function Input k is a key (join key). ν is a set of records that have the same join key. 1: Match all ((pki , i), rDi) with all ((fki , i), (fk1, fk2, . . . , fkn, rF )) where pki= fki . 2: Make an output ((fk1, fk2, . . . , fkn), (rDi, rF )). 3: Emit ((fk1, fk2, . . . , fkn), (rDi, rF )). < The Merge Phase > Map(k, v) Input k is a key (fk1, fk2, . . . , fkn) and ν is a value (rDi, rF ). 1: Emit ((fk1, fk2, . . . , fkn), (rDi, rF )). Reduce(k, v) Input k is a key (fk1, fk2, . . . , fkn). ν is a set of records that have the key. 1: Aggregate every record with all ((fk1, fk2, . . . , fkn), (rDi, rF )) where 1 ≤ i ≤ n. 2: Make an output ((fk1, fk2, . . . , fkn), (rD1, rD2, . . . , rDn, rF )). 3: Emit ((fk1, fk2, . . . , fkn), (rD1, rD2, . . . , rDn, rF )).

  19. Experimental Results • From the experiments conducted it is observed that the query performance was better when Scatter-Gather-Merge algorithm with Bloom filters fared well compared to case without Bloom filters • Even in cases where the warehouse size has increased the same results were obtained.

More Related