1 / 19

Twister4Azure

http://salsahpc.indiana.edu/twister4azure. Iterative MapReduce for Azure Cloud. Twister4Azure. Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne (tgunarat@indiana.edu) Indiana University. Twister4Azure – Iterative MapReduce.

chipo
Download Presentation

Twister4Azure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://salsahpc.indiana.edu/twister4azure Iterative MapReduce for Azure Cloud Twister4Azure Iterative MapReducefor Windows Azure Cloud ThilinaGunarathne(tgunarat@indiana.edu) Indiana University

  2. Twister4Azure – Iterative MapReduce • Decentralized iterative MR architecture for clouds • Utilize highly available and scalable Cloud services • Extends the MR programming model • Multi-level data caching • Cache aware hybrid scheduling • Multiple MR applications per job • Collective communication primitives • Outperforms Hadoop in local cluster by 2 to 4 times • Dynamic scheduling, load balancing, fault tolerance, monitoring, local testing/debugging http://salsahpc.indiana.edu/twister4azure/

  3. MRRoles4Azure

  4. MRRoles4Azure Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for input/output/intermediate data storage.

  5. Data Intensive Iterative Applications • k ← 0; • MAX ← maximum iterations • δ[0] ← initial delta value • while( k< MAX_ITER || f(δ[k], δ[k-1]) ) • foreachdatum in data • β[datum] ← process (datum, δ[k]) • end foreach • δ[k+1] ← combine(β[]) • k ← k+1 • end while • Growing class of applications • Clustering, data mining, machine learning & dimension reduction applications • Driven by data deluge & emerging computation fields • Lots of scientific applications

  6. Data Intensive Iterative Applications Compute Communication Reduce/ barrier Broadcast Smaller Loop-Variant Data New Iteration Larger Loop-Invariant Data • Growing class of applications • Clustering, data mining, machine learning & dimension reduction applications • Driven by data deluge & emerging computation fields

  7. Iterative MapReduce • MapReduceMerge • Extensions to support additional broadcast (+other) input data Map(<key>, <value>, list_of <key,value>) Reduce(<key>, list_of <value>, list_of <key,value>) Merge(list_of <key,list_of<value>>,list_of <key,value>)

  8. Merge Step • Extension to the MapReduce programming model to support iterative applications • Map -> Combine -> Shuffle -> Sort -> Reduce -> Merge • Receives all the Reduce outputs and the broadcast data for the current iteration • User can add a new iteration or schedule a new MR job from the Merge task. • Serve as the “loop-test” in the decentralized architecture • Number of iterations • Comparison of result from previous iteration and current iteration • Possible to make the output of merge the broadcast data of the next iteration

  9. Multi-Level Caching In-Memory/Disk caching of static data • In-Memory Caching of static data • Programming model extensions to support broadcast data • Merge Step • Hybrid intermediate data transfer • Caching BLOB data on disk • Caching loop-invariant data in-memory • Direct in-memory • Memory mapped files

  10. Cache Aware Hybrid Scheduling New Job Scheduling Queue Worker Role Map Workers Map 2 Map n Map 1 Reduce Workers Left over tasks Red 2 Red m Red 1 In Memory/Disk Data Cache Map Task Meta Data Cache • Decentralized • Fault tolerant • Multiple MapReduce applications within an iteration • Load balancing • Multiple waves Job Bulletin Board Job 1, iteration 2, bcast.. New Iteration Job 2, iteration 26, bcast.. …….

  11. Data Transfer • Iterative vs traditional MapReduce • Iterative computations tasks are finer grained • Intermediate data are relatively smaller • Hybrid Data Transfer based on the use case • Blob+Table storage based transport • Direct TCP Transport • Push data from Map to Reduce • Optimized data broadcasting

  12. Fault Tolerance For Iterative MapReduce • Iteration Level • Role back iterations • Task Level • Re-execute the failed tasks • Hybrid data communication utilizing a combination of faster non-persistent and slower persistent mediums • Direct TCP (non persistent), blob uploading in the background. • Decentralized control avoiding single point of failures • Duplicate-execution of slow tasks

  13. Collective Communication Primitives for Iterative MapReduce • Supports common higher-level communication patterns • Performance • Framework can optimize these operations transparently to the users • Multi-algorithm • Avoids unnecessary steps in traditional MR and iterative MR • Ease of use • Users do not have to manually implement these logic (eg: Reduce and Merge tasks) • Preserves the Map & Reduce API’s • AllGather • OpReduce • MDS StressCalc, Fixed point calculations, PageRank with shared PageRank vector, Descendent query • Scatter • PageRank with distributed PageRank vector

  14. AllGather Primitive • AllGather • MDS BCCalc, PageRank (with in-links matrix)

  15. KmeansClustering Overhead between iterations First iteration performs the initial data fetch Performance with/without data caching Speedup gained using data cache Task Execution Time Histogram Number of Executing Map Task Histogram Scales better than Hadoop on bare metal Scaling speedup Increasing number of iterations Strong Scaling with 128M Data Points Weak Scaling

  16. Multi Dimensional Scaling New Iteration Calculate Stress X: Calculate invV (BX) BC: Calculate BX Map Map Map Reduce Reduce Reduce Merge Merge Merge Performance adjusted for sequential performance difference Data Size Scaling Weak Scaling Scalable Parallel Scientific Computing Using Twister4Azure. ThilinaGunarathne, BingJingZang, Tak-Lon Wu and Judy Qiu. Submitted to Journal of Future Generation Computer Systems. (Invited as one of the best 6 papers of UCC 2011)

  17. Multi Dimensional Scaling

  18. Performance Comparisons BLAST Sequence Search Smith Watermann Sequence Alignment Cap3 Sequence Assembly MapReduce in the Clouds for Science, ThilinaGunarathne, et al. CloudCom 2010, Indianapolis, IN.

  19. KMeans Clustering Demo

More Related