1 / 76

SAGA: An Overview

SAGA: An Overview. Shantenu Jha for the SAGA Team http:// saga.cct.lsu.edu. Overview. The SAGA Philosophy A Fresh Perspective on Distributed Applications and CI SAGA in a Nutshell SAGA Landscape Individual APIs OGF standard SAGA in action Applications

brick
Download Presentation

SAGA: An Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SAGA: An Overview Shantenu Jha for the SAGA Team http://saga.cct.lsu.edu

  2. Overview • The SAGA Philosophy • A Fresh Perspective on Distributed Applications and CI • SAGA in a Nutshell • SAGA Landscape • Individual APIs • OGF standard • SAGA in action • Applications • Tools, Frameworks, Gateways, Access Layers.. • Uptake and Roadmap

  3. Ability to develop simple, novel or effective distributed applications lags behind other aspects of CI • Distributed CI: Is the whole > than the sum of the parts? • Infrastructure capabilities (tools, programming systems) and policy determine applications, type development & execution: • Proportion of App. that utilize multiple distributed sites sequentially, concurrently or asynchronously is low • Not referring to tightly-coupled across multiple-sites • Focus on extending legacy, static execution models • Scale-Out of Simulations? Compute where the data is? • What novel applications & science has Distributed CI fostered • Distinguish challenges of provisioning Distributed CI versus support for application development Critical Perspectives

  4. Critical PerspectivesQuick Analysis • Several Factors responsible for perceived & actual lack of DA • Developing Distributed Applications is fundamentally hard! • Coordination across multiple distinct resources • Range of tools, prog. systems and environments large • Interoperability and extensibility become difficult • Commonly accepted abstractions not available • E.g. Pilot-Job powerful, but no “unifying” tool on TG • Deployment and execution challenges disjoint from the development process • Generally good idea, but application development often influences where and how it can be deployed/executed

  5. Distributed Applications Development Challenges • Developing Distributed Applications is fundamentally hard • Intrinsic: • Design Points: Dynamical and Heterogeneous resources and Variable Control (or lack thereof) • Coordination over Multiple & Distributed sites • Scale-up and Scale-out • Models of Distributed Applications: • More than (peak) performance • Primary role of Usage Modes • Extrinsic: • (Complex) Underlying infrastructure & its provisioning • Programming systems, tools and environments

  6. Distributed ApplicationsDevelopment Challenges • Dist. Application and Programming Systems and Tools • Incompleteness and/or out-of-phase: • Need X and Y, but only X or Y available, • e.g., Master-Worker paradigm supported, but no FT. • Customization: • Works well with tool A but not B, • e.g., Pegasus-DAGMAN-Condor • Robustness and Scalability: • Works well in small or controlled environment, but not at-Scale • e.g., SAGA–based Montage

  7. Distributed Applications and PGILessons from Applications • There exist distributed applications that aim to utilize multiple resources: • Complex Coordination, Data Mgmt and FT issues • Complex structures at different phases • Emerging Infrastructure present operational challenges • Don’t always provide the application requirement • Missing abstractions • Development, Deployment and Execution • e.g. Coding against low-level middleware • Issues of policy and infrastructure design decisions • e.g. Co-allocation supported or not • e.g. Narrow versus broad Grids

  8. Distributed Application and PGI (2) • Lack of explicit support for addressing these challenges becomes self-fulfilling and self-perpetrating • Most PGI not designed to support distributed applications, increasing effort to develop/deploy/execute • Small number of distributed applications on TG leads to focus on single-site applications • (Ironically) Most applications have been developed to hide from heterogeneity and dynamism; not embrace them • Good heterogeneity vs Bad heterogeneity • Dynamism: Performance Advantages • Applications have been brittle and not extensible: • Tied to specific tool or prog. system (& thus PGI)

  9. Distributed Application and PGI (3) • Successful Evolution of Supported Capabilities and PGI • OSG support of HTC • Condor from scavenging system to building block • Condor Flocking • TeraGrid and Gateways (User-level Abstraction) • Several new capabilities for new communities • Not so Successful Evolution of Capabiliites on PGI • Co-scheduling on PGI • Both technical and policy issues • Scale-out of DAGs/Loosely-coupled ensembles • Execution is logically or physically distributed

  10. Distributed Application and PGI (4) • Supporting applications that have “come-of-age” useful • Coupling real-time simulations to live sensor data (LEAD) • Currently neither OSG nor TG can support DDDAS • Often the underlying infrastructure and capabilities change quicker than application • Infrastructure: Grids and Clouds • Many challenges remain the same, e.g., requirements of distribution coordination of data and computation • Role for abstractions • Applications still around: SFExpress, Netsolve->GridSolve

  11. Distributed Application and PGI (5)Miscellaneous Factors • Role of Funding Agencies: • Changed vision, funding landscape • TG: Run anywhere with tools/prog systems to support primarily static HPC applications • Focus on static as opposed to distributed scale-out robust workflows • http://www.cct.lsu.edu/~sjha/presentations/panel_discussion/punch_counterpunch.pdf • Role of Standard: • Lack of standards impedes interoperation • No standards; • Chicken-and-egg situation • Simple standards have been effective, but with limited impact, eg., troika of JSDL, HPC-BP, OGSA-BES

  12. Distributed Applications How do they differ from traditional HPC applications? PGI design needs to be reflect some or all of these: • Performance Models: • Not just “peak utilization”; e.g., HPC & HTC (# of jobs) • Usage Modes: • The same application has multiple usage modes • How applications are developed, deployed and executed is often determined by the infrastructure • Static vs Dynamic Execution: • Static applications is not enough; varying resource conditions, application requirements • Skillful Decomposition vs Aggregation • Primacy of Coordination across distributed resources

  13. Understanding Distributed ApplicationsIDEAS: First Principles Development Objectives • Interoperability:Ability to work across multiple distributed resources • Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently • Extensibility:Support new patterns/abstractions, different programming systems, functionality & Infrastructure • Adaptivity:Response to fluctuations in dynamic resource and availability of dynamic data • Simplicity:Accommodate above distributed concerns at different levels easily… Challenge: How to develop DA effectively and efficiently with the above as first-class objectives?

  14. Overview • The SAGA Philosophy • A Fresh Perspective on Distributed Applications and CI • SAGA in a Nutshell • SAGA Landscape • Individual APIs • OGF standard • SAGA in action • Applications • Tools, Frameworks, Gateways, Access Layers.. • Uptake and Roadmap

  15. SAGA: In a nutshell • There exists a lack of Programmatic approaches that: • Provide general-purpose, basic & common grid functionality for applications and thus hide underlying complexity, varying semantics.. • The building blocks upon which to construct “consistent” higher-levels of functionality and abstractions • Hides “bad” heterogeneity, means to address “good” heterogeneity • Meets the need for a Broad Spectrum of Application: • Simple scripts, Gateways, Smart Applications and Production Grade Tooling, Workflow… • Simple, integrated, stable, uniform and high-level interface • Simple and Stable: 80:20 restricted scope and Standard • Integrated: Similar semantics & style across • Uniform: Same interface for different distributed systems • SAGA: Provides Application* developers with units required to compose high-level functionality across (distinct) distributed systems • (*) One Person’s Application is another Person’s Tool

  16. SAGA: The Standard Landscape Text

  17. SAGA: Specification Landscape Blue lines show which packages have input in the Experience document

  18. SAGA: In a thousand words..

  19. SAGA: Job Submission Role of Adaptors (middleware binding)‏ SAGA: Role of Adaptors Text

  20. SAGA Job API: Example

  21. SAGA Job Package

  22. SAGA File Package

  23. File API: Example

  24. SAGA Advert

  25. SAGA Advert API: Example

  26. SAGA: Other Packages

  27. SAGA Task Model • All SAGA objects implement the task model • Every method has three “flavors” • Synchronous version - the implementation • Asynchronous version - synchronous version wrapped in a task (thread) and started • Task version - synchronous version wrapped in a task but not started (task handle returned) • Adaptor can implement own async. version

  28. SAGA Implementation: Extensibility • Horizontal Extensibility – API Packages • Current packages: • file management, job management, remote procedure calls, replica management, data streaming • Steering, information services, checkpoint… • Vertical Extensibility – Middleware Bindings • Different adaptors for different middleware • Set of ‘local’ adaptors • Extensibility for Optimization and Features • Bulk optimization, modular design

  29. SAGA C++ quick tour • Open Source - released under the Boost Software License 1.0 • Implemented as a set of libraries • SAGA Core - A light-weight engine / runtime that dispatches calls from the API to the appropriate middle-ware adaptors • SAGA functional packages - Groups of API calls for: jobs, files, service discovery, advert services, RPC, replicas, CPR, ... (extensible) • SAGA language wrappers - Thin Python and C layers on top of thenative C++ API • SAGA middleware adaptors - Take care of the API call execution on the middleware • Can be configured / packaged to suit your individual needs!

  30. SAGA: Available Adaptors • Job Adaptors • Fork (localhost), SSH, Condor, Globus GRAM2, OMII GridSAM, Amazon EC2, Platform LSF • File Adaptors • Local FS, GlobusGridFTP, Hadoop Distributed Filesystem (HDFS),CloudStore KFS, OpenCloud Sector-Sphere • Replica Adaptors • PostgreSQL/SQLite3, Globus RLS • Advert Adaptors • PostgreSQL/SQLite3, Hadoop H-Base, Hypertable

  31. SAGA: Available Adaptors (2) • Other Adaptors • Default RPC / Stream / SD • Planned Adaptors • CURL file adaptor, gLite job adaptor (Ole), ….. • Open issues • We’re in the process of consolidating the adaptor code base and adding rigorous tests in order to improve adaptor quality • Capability Provider Interface (CPI - the ‘Adaptor API’) is notdocumented or standardized (yet), but looking at existing adaptor code should get you started if you want to develop your own adaptor.

  32. SAGA API: Towards a StandardStandards promote Interoperability • The need for standard programming interface • Trade-off “Go it alone” versus “Community” model • Reinventing the wheel again, yet again, & then again • MPI a useful analogy of community standard • Vendors (Resource Provider), Software developers, users.. • social/historic parallels also important • Time to adoption, after specification .... • OGF the natural choice (SAGA-RG, SAGA-WG) • Spin-off of the Applications Research Group • Driven by UK, EU (German/Dutch), US • Design derived from 23 Use Cases • different projects, applications and functionality • biological, coastal modelling, visualization • Will discuss the advantage of SAGA as a standard specification

  33. Overview • The SAGA Philosophy • A Fresh Perspective on Distributed Applications and CI • SAGA in a Nutshell • SAGA Landscape • Individual APIs • OGF standard • SAGA in Action • Applications • Tools, Frameworks, Gateways. Access Layers.. • Uptake and Roadmap

  34. SAGA and Distributed Applications

  35. Understanding Distributed Applications Implicit vs Explicitly Distributed ? • Which approach (implicit vs. explicit) is used depends: • How the application is used? • Need to control/marshal more than one resource? • Why distributed resources are being used? • How much can be kept out of the application? • Can’t predict in advance? • Not obvious what to do, application-specific metric • If possible, Applications should not be explicitly distributed • GATEWAYS approach: • Implicit for the end-users • Supporting Applications? Or Application Usage Modes?

  36. Taxonomy of Distributed Application Development • Example of Distributed Execution Mode: • Implicitly Distributed • HTC of HTC: 1000 job submissions of NAMD the TG/LONI • SAGA shell example (cf DESHL) • Example of Explicit Coordination and Distribution • Explicitly Distributed • DAG-based Workflows (example of Higher-level API) • EnKF-HM application • Example of SAGA-based Frameworks • Pilot-Jobs, Fault-tolerant Autonomic Framework • MapReduce, All-Pairs • Note: An application can belong to more than one type

  37. DNA Energy Levels: HTC of HPC (Bishop; Tulane)

  38. Montage: DAG-based Workflow Application Exemplar

  39. DAG based Workflow ApplicationsExtensibility and Higher-level API Application Development Phase Generation & Exec. Planning Phase Execution Phase

  40. SAGA-based DAG ExecutionPreserving Performance

  41. Ensemble Kalman FiltersHeterogeneous Sub-Tasks • Ensemble Kalman filters (EnKF), are recursive filters to handle large, noisy data; use the EnKF for history matching and reservoir characterization • EnKF is a particularly interesting case of irregular, hard-to-predict run time characteristics:

  42. Results: Scale-Out Performance • Using more machines decreases the TTC and variation between experiments • Using BQP decreases the TTC & variation between experiments further • Lowest time to completion achieved when using BQP and all available resources Khamra & Jha, GMAC, ICAC’09

  43. Extreme Distribution: Frameworks? • History match on a 1 million grid cell problem, with a thousands of ensemble members • The entire system will have a few billion degrees of freedom • This will increase the need for scale-out, autonomy, fault tolerance, self healing etc... 43

  44. Extreme Distribution: Frameworks? • History match on a 1 million grid cell problem, with a thousand ensemble members • The entire system will have a few billion degrees of freedom • This will increase the need for scale-out, autonomy, fault tolerance, self healing etc... 44

  45. Extreme Distribution: Frameworks? • History match on a 1 million grid cell problem, with a thousand ensemble members • The entire system will have a few billion degrees of freedom • This will increase the need for scale-out, autonomy, fault- tolerance, self healing etc... 45

  46. SAGA-based Frameworks: Types • Frameworks: Logical structure for Capturing Application Requirements, Characteristics & Patterns • Runtime and/or Application Framework • Application Frameworks designed to either: • Pattern: Commonly recurring modes of computation • Programming, Deployment, Execution, Data-access.. • MapReduce, Master-Worker, H-J Submission • Abstraction: Mechanism to support patterns and application characteristics • Runtime Frameworks: • Load-Balancing – Compute and Data Distribution • SAGA-based Framework: Infrastructure-independent

  47. SAGA-based Frameworks: Examples • SAGA-based Pilot-Job Framework (FAUST) • Extend to support Load-balancing for multi-components • SAGA MapReduce Framework: • Control the distribution of Tasks (workers) • Master-Worker: File-Based &/or Stream-Based • Data-locality optimization using SAGA’s replica API • SAGA NxM Framework: • Compute Matrix Elements, each is a Task • All-to-All Sequence comparison • Control the distribution of Tasks and Data • Data-locality optimization via external (runtime) module

  48. Abstractions for Dynamic Execution Container Task Adaptive: Type A: Fix number of replicas; vary cores assigned to each replica. Type B: Fix the size of replica, vary number of replicas (Cool Walking) -- Same temperature range (adaptive sampling) -- Greater temperature range (enhanced dynamics)

  49. Abstractions for Dynamic ExecutionSAGA Pilot-Job (BigJob)

  50. Coordinate Deployment & Scheduling of Multiple Pilot-Jobs

More Related