1 / 43

Text

Distributed Applications: Examining the Past Understanding the Present Preparing for the Future(Grid). Text. Shantenu Jha Director, Cyber-Infrastructure Development, CCT Computer Science e-Science Institute, Edinburgh http://www.cct.lsu.edu/~sjha http://saga.cct.lsu.edu. Outline.

viet
Download Presentation

Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Applications:Examining the Past Understanding the Present Preparing for the Future(Grid) Text Shantenu Jha Director, Cyber-Infrastructure Development, CCT Computer Science e-Science Institute, Edinburgh http://www.cct.lsu.edu/~sjha http://saga.cct.lsu.edu

  2. Outline • Critical Perspective on Large-Scale Distributed Applications and Production Cyber-Infrastructure (CI) • Understanding Distributed Applications (DA) • Differ from HPC or || App, Challenges of DA • DA Development Objectives (IDEAS) • Understanding SAGA • Using SAGA to develop Distributed Applications • Frameworks • Abstractions for Dynamic Execution • Data-Intensive Applications • Discuss how IDEAS are met • Derive (Initial) User Requirements/Requests for FutureGrid Text

  3. Critical Perspectives • Distributed CI: Is the whole > than the sum of the parts? • Several BIG Projects have success stories on TG • But REAL Science happens at ALL SCALES • Tools for the individual users to innovate and develop? • Infrastructure capabilities and policy determine Applications development, deployment and execution: • Proportion of App. that utilize multiple distributed sites sequentially, concurrently or asynchronously is low (~5%) • Not referring to tightly-coupled across multiple-sites • TG (exclusively) supported legacy, static execution models • Move data to computing  Compute where the data is? • Distributed Data/Jobs vs Bringing it all into the Cloud • What novel applications & science has Distributed CI fostered?

  4. Understanding Distributed Applications Development Challenges • Fundamentally a hard problem: • Dynamical Resource, Heterogeneous resources • Variable Control (or lack thereof) • Add to it: Complex underlying infrastructure provisioning • Programming Systems for Distributed Applications: • Incomplete? Customization? Extensibility? • Computational Models of Distributed Computing • Design Points: More than (peak) performance • Primary role of Usage Modes • Range of DA, no clear taxonomy Text

  5. Understanding Distributed ApplicationsDevelopment Challenges • Distributed Applications Require: • Coordination over Multiple & Distributed sites: • Scale-up and Scale-out • Logically or physically Distributed • 1st Gen of Peta/Exa/Zetta/Yotta -- Applications requiring multiple-runs, ensembles, workflows.. • Core characteristics and challenges of logically and physically distributed applications are SAME • Inter-play of Requirements, Infrastructure, Usage Mode Ability to develop simple, novel or effective distributed Applications lags behind other aspects of CI General purpose Distributed Application Development Lacking in NSF/OCIs portfolio….

  6. Understanding Distributed Applications Development Objectives • Interoperability: Ability to work across multiple distributed resources • Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently • Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure • Adaptivity: Response to fluctuations in dynamic resource and availability of dynamic data • Simplicity: Accommodate above distributed concerns at different levels easily… Challenge: How to develop DA effectively and efficiently with the above as first-class objectives?

  7. SAGA: Basic Philosophy • There exists a lack of Programmatic approaches that: • Provide general-purpose common grid functionality for applications and thus hide underlying complexity, varying semantics.. • The building blocks upon which to construct “consistent” higher-levels of functionality and abstractions • Hides “bad” heterogeneity, means to address “good” heterogeneity • Meets the need for a Broad Spectrum of Application: • Simple scripts, Gateways, Smart Applications and Production Grade Tooling, Workflow… • Simple, integrated, stable, uniform and high-level interface • Simple and Stable: 80:20 restricted scope and Standard • Integrated: Similar semantics & style across • Uniform: Same interface for different distributed systems • SAGA: Provides Application* developers with basic unit required to compose high-level functionality across (distinct) distributed systems (*) One Person’s Application is another Person’s Tool Text

  8. SAGA: The Standard Landscape Text

  9. SAGA: In a thousand words..

  10. SAGA: Job Submission Role of Adaptors (middleware binding)‏ Text

  11. SAGA Job API: Example

  12. SAGA: Other Packages

  13. SAGA and Distributed Applications

  14. SAGA-based Frameworks: Types • Frameworks: Logical structure for Capturing Application Requirements, Characteristics & Patterns • Runtime and/or Application Framework • Application Frameworks designed to either: • Pattern: Commonly recurring modes of computation • Programming, Deployment, Execution, Data-access.. • MapReduce, Master-Worker, H-J Submission • Abstraction: Mechanism to support patterns and application characteristics • Runtime Frameworks: • Load-Balancing – Compute and Data Distribution • SAGA-based Framework: Infrastructure-independent

  15. Abstractions for Dynamic Execution (1) Container Task Adaptive: Type A: Fix number of replicas; vary cores assigned to each replica. Type B: Fix the size of replica, vary number of replicas (Cool Walking) -- Same temperature range (adaptive sampling) -- Greater temperature range (enhanced dynamics)

  16. Abstractions for Dynamic Execution (2)SAGA Pilot-Job (BigJob)

  17. Coordinate Deployment & Scheduling of Multiple Pilot-Jobs

  18. Distributed Adaptive Replica Exchange (DARE)Scale-Out, Dynamic Resource Allocation and Aggregation

  19. Multi-Physics Runtime FrameworksExtensibility • Coupled Multi-Physics require two distinct, but concurrent simulations • Can co-scheduling be avoided? • Adaptive execution model: Yes • Load-balancing required. • Pilot-Job facilitates LB! • Across sites? (open Q) • First demonstrated multi-platform Pilot-Job: • MPI-based TG – Condor GI

  20. Dynamic ExecutionReduced Time to Solution

  21. Ensemble Kalman FiltersHeterogeneous Sub-Tasks • Ensemble Kalman filters (EnKF), are recursive filters to handle large, noisy data; use the EnKF for history matching and reservoir characterization • EnKF is a particularly interesting case of irregular, hard-to-predict run time characteristics:

  22. Using more machines decreases the TTC and variation between experiments Using BQP decreases the TTC & variation between experiments further Lowest time to completion achieved when using BQP and all available resources Results: Scale-Out Performance Khamra & Jha, GMAC, ICAC’09

  23. But Why does BQP Help? The Case for System Senors

  24. Autonomic Integration of HPC Grids-Clouds EnKF: Extensibility and Interoperabilty (work with M. Parashar et al. Accepted for e-Science 2009) • Application Objectives: • Acceleration • Resilience • Conservation • Pull vs Push Task map

  25. Application-level InteroperabilityCloud-Cloud; Cloud-Grid • Application-level (ALI) vs. System-level Interoperability (SLI) • Infrastructure Independence is Pre-requisite for ALI • The case for both Grids AND Clouds: • Hybrid & Heterogeneous workload: data-compute affinity differ • Availability zone, Data-transfer cost.. • Complex data-flow dependency: need runtime determination • Just because you can use Grids AND Clouds, should you ? Important Research Question: When should you? • Runtime Decision:Mechanism to determine when/if ? • Should be influenced by Application Objectives • Programming Model should be Infrastructure independent • Same application, priced differently, for same performance • Same application, priced same, for different performance

  26. SAGA-based Frameworks: Examples • SAGA-based Pilot-Job Framework (FAUST) • Extend to support Load-balancing for multi-components • SAGA MapReduce Framework: • Control the distribution of Tasks (workers) • Master-Worker: File-Based &/or Stream-Based • Data-locality optimization using SAGA’s replica API • SAGA NxM Framework: • Compute Matrix Elements, each is a Task • All-to-All Sequence comparison • Control the distribution of Tasks and Data • Data-locality optimization via external (runtime) module

  27. Distributed Data Intensive ApplicationsResearch Challenges • Goal: Develop DDI scientific applications to utilize a broad range of distributed systems, without vendor lock-in, or disruption, yet with the flexibility and performance that scientific applications demand. • Frameworks as possible solutions • Frameworks address some primary challenges in developing Distributed DI Applications • Coordination of distributed data & computing • Runtime (Dynamic) scheduling, placement • Fault-tolerance • Many Challenges in developing such Frameworks: • What are the components? How are they coupled? Functionality expressed/exposed? Coordination? • Layering, Ordering, Encapsulations of Components • “Just because you use can’t use MPI (on distributed systems), doesn’t mean you can’t use other approaches”

  28. Frameworks: Logical ordering SAGA

  29. Frameworks: Logical ordering

  30. SAGA-MapReduce(Miceli, Jha et al CCGrid’09; Merzky, Jha et al GPC’09) • Interoperability: Use multiple infrastructure concurrently • Control the NW placement • Simple staging of data • SAGA-Sphere-Sector: • Open Cloud Consortium • Stream processing model • Ongoing work • Apply to all elements (files) in a data-set (stream) Ts: Time-to-solution, including data-staging for SAGA-MapReduce (simple file-based mechanism)

  31. Controlling Relative Compute-Data Placement

  32. SAGA All-Pairs: Runtime Data Placement • Classical: Place task on 4 LONI machines (512px Dell Clusters) • Simple data staging • “Intelligent”: Map a task to a resource based upon Cost • Cost = Data Dependency + transfer times (latency) • “Ignoring Intelligent mapping is no longer an option” • Quote (undergraduate) Miceli 

  33. Understanding Distributed Applications Development Objectives Redux • Interoperability: Ability to work across multiple distributed resources • SAGA: Middleware Agnostic • Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently • Support Multiple Pilot-Jobs: Ranger, Abe, QB • Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure • Pilot-Job also Coupled CFD-MD, Integrated BQP • Adaptivity: Response to fluctuations in dynamic resource and availability of dynamic data • Simplicity: Accommodate above distributed concerns at different levels easily…

  34. Does SAGA Provide A Fresh Perspective?

  35. Early User: An Environment that Supports • Echo what Andrew Grimshaw said!! • e.g., test-bed for Standards interoperation • Trivial Remarks: • Not obsessed with system utilization like TG • Policies that support IDEAS as first-class concerns • Support Dynamic, First-Principles Explicitly Distributed App. • Dynamic, Adaptive Applications: • Dynamic Resource Utilization: • e.g BQP (Jha et al, GMAC, ICAC Barcelona 2009) • Grid Observatory (EGEE) – all kinds of Traces • Dynamic Adaptive Data: • Network Aware Application (Jha et al, IEEE eScience ’07) • Data Scheduler: Big Data, Frequent Data

  36. Early User: An Environment that Supports • Autonomic Computational Science Applications • Support the tuning of and by Applications • Platform for developing (SAGA) AF and RT Frameworks • Design, Stand-up and Experiment with Frameworks • eg load-balancer for dynamic resource allocation • SAGA-MapReduce, NxM • eg Control Relative Placement of Data/Compute • Supporting Distributed Abstractions – Development, Deployment and Execution-level • A controlled but realistic environment • RAIN – Dynamic Provisioning (provide clean API) • (Reproducible) Experimental Manager, VAMPIR • [Connection with Grid Observatory]

  37. SAGA-based Tools and ProjectsOne person’s Tool is another person’s Application • DESHL • DEISA-based Shell and Workflow library • JSAGA from IN2P3 (Lyon) • http://grid.in2p3.fr/jsaga/index.html • GANGA-DIANE • gLite • XtreemOS (Based upon SAGA for the Distribution) • NAREGI/KEK • SD Specification • With gLite adaptors Advantage of Standards Text

  38. Acknowledgements SAGA Team and DPA Team and the UK-EPSRC (UK EPSRC: DPA, OMII-UK , OMII-UK PAL) People: SAGA D&D: Hartmut Kaiser, Ole Weidner, Andre Merzky, Joohyun Kim, Lukasz Lacinski, João Abecasis, Chris Miceli, Bety Rodriguez-Milla SAGA Users: Andre Luckow, Yaakoub el-Khamra, Kate Stamou, Cybertools (Abhinav Thota, Jeff, N. Kim), Owain Kenway Google SoC: Michael Miceli, Saurabh Sehgal, Miklos Erdelyi Collaborators and Contributors: Steve Fisher & Group, Sylvain Renaud (JSAGA), Go Iwai & Yoshiyuki Watase (KEK) DPA: Dan Katz, Murray Cole, Manish Parashar, Omer Rana, Jon Weissman

  39. Abstractions for Distributed Applications and Systems: A Computational Science Perspective Authors: S Jha, D Katz, M Parashar, O Rana, J Weissman Upcoming Book by Wiley (Summer 2010)

  40. SAGA: Building the abstractions to Bridge the Infrastructure-Applications Gap Focus on Application Development and Characteristics, not infrastructure details

  41. Interoperability

  42. DAG based Workflow ApplicationsExtensibility Approach Application Development Phase Generation & Exec. Planning Phase Execution Phase

  43. SAGA-based DAG ExecutionPreserving Performance

More Related