1 / 29

MI ddleware for D ata-intensive A nalysis and S cience (MIDAS)

This paper discusses the middleware system MIDAS, which provides support for analytical libraries, resource management, coordination and communication, and file and storage abstractions, to enable data-intensive analysis and science.

ione
Download Presentation

MI ddleware for D ata-intensive A nalysis and S cience (MIDAS)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MIddleware for Data-intensive Analysis and Science (MIDAS) Shantenu Jha, Andre Luckow, Ioannis Paraskevakos RADICAL, Rutgers http://radical.rutgers.edu

  2. The Convergence of HPC and “Data Intensive” Computing is happening at many different levels. • At multiple levels: Applications, Micro-Architectural (“near data computing” processors), Macro-Architectural (e.g. File Systems), Software Environment (e.g., Analytical Libraries). • Objective: Bring ABDS Capabilities to HPDC • HPC: Simple Functionality, Complex Stack, High Performance • ABDS: Advanced Functionality A Tale of Two Data-Intensive Paradigms: Data Intensive Applications, Abstractions and Architectures In collaboration with Geoffrey Fox(Indiana) http://arxiv.org/abs/1403.1528

  3. MIDAS: Middleware for Data-intensive Analysis and Science • Application is integrated deeply with Infrastructure. • Great for performance. But bad for extensibility & flexibility. • Multiple levels of functionality, indirection and abstractions. • Performance is often difficult. • Challenge: How to find “Sweet Spot”? • “Neck of hour glass” for multiple applications and infrastructure.

  4. MIDAS: Middleware for Data-intensive Analysis and Science • MIDAS is the middleware for support analytical libraries, by providing • Resource management. • Pilot-Hadoop for managing ABDS frameworks on HPC • Coordination and communication. • Pilot In-Memory for supporting iterative analytical algorithms • Address heterogeneity at the infrastructure level • File and storage abstractions. • Flexible and multi-level compute-data coupling. • Must have a well-defined API and semantics that can then be used by application and SPIDAL library/layer.

  5. Application Integration with MIDAS & SPIDAL: A Perspective (recap) • Type 1: Some applications will require libraries before they need performance/scalability • Advantages of functionality and commonality • Type 2: Some applications are already developed but need performance/scalability, i.e. have necessary functionality, but stymied by lack of scalability • Integration into MIDAS directly for performance • Type 3: Once applications libraries have been developed, make high-performance by integrating libraries to underlying capabilities

  6. MIDAS: Middleware for Data-intensive Analysis and Science • MIDAS providing interoperability b/w ABDS and HPC. • Fast track to use Spark etc. on HPC via API. • MIDAS to support parallelism of applications and SPIDAL that is not currently supported by ABDS. • Trajectory analysis in concept, but not in practise. • Support SPIDAL directly, i.e., without ABDS! • MIDAS to complement capabilities in ABDS • Issue of granularity, easy of development etc. • Progressively difficult; annual objective!

  7. Part II: Pilot-based Runtime for Data Analytics

  8. 2.1 Introduction Pilot Abstraction Working definition: A system that generalizes a placeholder job to provide multi-level scheduling to allow application-level control over the system scheduler via a scheduling overlay. User Space User Application Pilot-Job System Policies Pilot-Job Pilot-Job Resource Manager System Space Resource A Resource B Resource C Resource D

  9. 2.1 Motivation Pilot-Abstraction The Pilot-Abstraction provides a well-define resource management layer for MIDAS: • Application-level scheduling well suited for fine-grained data parallelism of data-intensive applications • Data-intensive applications more heterogeneous and thus, more demanding with respect to their resource management needs • Application-level scheduling enables the implementation of a data-aware resource manager for analytics applications • Interoperability Layer between Hadoop (Apache Big Data Stack (ABDS) and HPC

  10. 2.1 Motivation: Hadoop and Spark De-facto standard for industry analytics Manifold ecosystem with many different analytics tools, e.g. Spark MLLib, H20 (referred to as Apache Big Data Stack (ABDS)) Novel, high-level abstractions: SQL, DataFrames, Data Pipelines, Machine Learning

  11. 2.1 HPC and ABDS Interoperability.

  12. 2.2 Pilot-Abstraction on Hadoop

  13. 2.3 Pilot-Hadoop: ABDS on HPC. Pilot-Job is used for managing Hadoop Cluster Pilot-Agent responsible for managing Hadoop resources: CPU cores, nodes and memory

  14. 2.4 Pilot-Memory for Iterative Processing. Provide common API for distributed cluster memory

  15. 2.5 Abstraction in Action 1. Run Spark or Hadoop on a local machine, HPC or cloud resource 2. Seamless access to native Spark features and libraries 3. Use Pilot-Data API

  16. 3. Validation 3.1 Overhead of Pilot-Abstraction 3.2 HPC vs. ABDS Filesystem 3.3 KMeans

  17. 3.1 Pilot-Abstraction Overhead

  18. 3.1 HPC vs. ABDS Filesystem Lustre vs. HDFS on up to 32 nodes on Stampede Lustre good for medium-sized data Writes on Lustre faster - gap decreases with data size Parallel reads faster with HDFS HDFS Memory option provides slight advantage

  19. 3.3 Pilot-Data on Different Backend Managing heterogeneous HDFS Backends with Pilot-Data on different XSEDE resources

  20. 3.4 KMeans on Pilot-Memory

  21. 4. Conclusion and Future Work Big Data application very heterogeneous Complex infrastructure landscape with many layers of scheduling requires higher-level abstractions for reasoning. Next Steps: • Applications: Graph Analytics (Leaflet Finder) • Application Profiling and Scheduling Work-in-Progress Paper: http://arxiv.org/abs/1501.05041

  22. Part III: All-pairs Hausdorff Comparison Acknowledgement: Collaboration with Oliver Beckstein and team.

  23. Problem Definition • Calculate the geometric similarity between 192 all-atom trajectories in a protein structure • The geometric similarity is computed by using the Hausdorff distance of two trajectories • The Hausdorff distance is the greatest of all the distances from a point in one trajectory to the closest point in the other • Each trajectory file is 4MB in size and contains an array of size T*3N, where T is the time steps and 3N is the position of N atoms in the space. • Run with original trajectories (short), double-length (medium) and quadruple-length (long) versions of the trajectories

  24. All Pairs Pattern The All Pairs Pattern provides a template to define a comparison that will be applied to all unique combinations between the elements of a set. element_initialization(Generate Set Elements) element_comparison 1st element2nd element element_comparison 1st element3rd element element_comparison Mth elementNth element element_comparison Nth elementN-1th element

  25. All-Pairs Pattern Implementation • Initially, there are N(N-1) unique comparisons, where N is the number of elements of the set. Each comparison defines a task. • Map the initial set to a smaller set with k=N/n1 elements,where n1 is a divisor of N, by grouping n1 trajectoriesby n1 trajectories together. • Use the All-Pairs pattern over the new set. Number of task k(k+1)/2. Each task is the comparisons between n1 and n1 elements of the initial set.

  26. Experiment Setup • The initial set contains 192 trajectories. We have a total number of 18336 comparisons. • Use n1 = 12 and create 136 tasks. Each task calculates the Hausdorff distances between 12 and 12 trajectories. • Execute using RADICAL Pilot on Stampede with 16,32,64,128 cores and measure the Time to Completion.

  27. Time to Completion @ Stampede

  28. Conclusions and Future work • Balanced the workload of each task in order to increase the task level parallelism • Able to provide linear speedup • Next Steps: • Ongoing experimentation to find the dependency on n1. • Compare with ABDS method? If so, which?

  29. Thank you

More Related