1 / 46

DRM/Computational Grids

DRM/Computational Grids. Bill DeSalvo August 18, 2004. Computational Grids. Definitions…. Cluster : An arbitrary collection of distributed IT resources organized as a management domain… a single system environment.

Download Presentation

DRM/Computational Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DRM/Computational Grids Bill DeSalvo August 18, 2004

  2. Computational Grids

  3. Definitions… Cluster: An arbitrary collection of distributed IT resources organized as a management domain… a single system environment Grid: Transparent, secure, coordinated resource sharing across one or more sites… a cluster of clusters

  4. Grid Drivers Optimize Capabilities Optimize Infrastructure • Virtual Organizations • New infrastructure enables new org structures • Collaborative computing • New Class of Capabilities • Potential to solve very large problems • New Business Models • Outsourcing of computing tasks • Utility computing • Peak load support • Source: IDC • Resource Optimization • Maximize return on capital equipment • Resource Access • Provide mechanisms to share resources across organizational boundaries • Cost Sharing • allow multiple groups to contribute resources to a project while maintaining control of those resources • Improved Management Model • - incorporate multiple systems into an organization under a single unified systems model • Source: IDC

  5. Ian Foster’s Three-Point Grid Checklist • Coordinates resources • Not subject to centralized control • One or more (virtual) organizations • Geographic distribution of users/resources is common • Standard, open, general-purpose protocols and interfaces • Delivers nontrivial qualities of service • SLAs vs. policies vs. QoS • Translates business objectives into IT objectives • Enables effective utilization, resource aggregation, and remote access to specialized resources A cluster is a local-area, logical arrangement of independent entities that collectively provide a service. Clusters are NOT grids!

  6. Virtual Organizations

  7. Evolution of the Grid

  8. Everyone’s Aware of “The Grid”

  9. Platform Grid Competencies • Resource Leasing • Job Forwarding • Account Mapping • Grid Fairshare Scheduling • Advance Reservations • User Authentication • Reliable Data Transfer Outgrowth of Platform’s experience in Grid and Distributed Computing

  10. Platform MultiCluster

  11. Three-Point Grid Checklist & Platform MultiCluster • Coordinates resources • Not subject to centralized control • ‘Single’ organization (“Enterprise Grid”) • Geographic distribution of users/resources is common • Proprietary protocols and interfaces • Delivers nontrivial qualities of service • SLAs vs. policies • Common queues • Advance reservation • Resource leasing • Fairshare • SLAs • Translates business objectives into IT objectives • Enables effective utilization, resource aggregation, and remote access to specialized resources

  12. Dept A Dept B Dept D Dept C Why MultiCluster Global Sharing, Local Ownership (“politics of the grid”) Providing … while maintaining … Local Autonomy Increased Capacity Increased Capability Increased Scalability Growing Computational Needs

  13. HPC Center Cluster C Cluster A Cluster B Job Forwarding Model • “HPC Center” Configuration • Enhanced transparency FCFS guarantee, pending reason support, chunk jobs, host type/queue status aware scheduling, checkpoint/migration

  14. Send queue Receive queue Compute Servers Compute Servers Job Forwarding Model • You submit • We do --- • Job transfer • data staging • Account mapping • Accounting Site A Site B

  15. Resource Leasing Model • Accelerating Enterprise Grid Adoption • Single system image, ease of administration, scalability Enable fairshare, preemption, pending reason support, chunk jobs, advance reservation, interactive jobs, parallel jobs, … across clusters

  16. Advance Reservation Nodes dedicated to User A for time duration • Reserve nodes for exclusive access for user or user group • Ensures critical work is done without interference • Useful for benchmarking or system maintenance • One-time and recurring reservation • Administrator defines reservation for users

  17. Use Cases

  18. DoD HPCMP Grid DoD HPCMP Challenge Initiative to share resources on HPCMP’s resources easily & transparently: SMDC, TACOM, NRL, NAVO and WSMR, … Build a meta-queuing system to integrate the centers Primary Benefit The capability to submit a job to a single, common queue, which will be sent to the best available computer in the Grid

  19. DOD HPCMO DoD HPCMP Grid • Solution • Platform LSF MultiCluster • Resource reservation protocol • Transparent job control • Accounting • Client-server, interactions Kerberized • Ticket forwarding/renewal • Multi-realm support • Account mapping • Platform FTA • Kerberized • Fault tolerant Requirement Fire and Forget Full Kerberos 5 Support Reliable, Secure File Transfer

  20. TACOM/TARDEC Onyx2 32 PEs DREN NRL Origin 2000 128 PEs AFFTC Origin 3000 64 PEs AEDC Origin 2000 64 PEs WSMR Origin 2000 64 PEs RTTC Origin 2000 32 PEs NAVO SUN E10K 64 PEs SSCSD HP Superdome 44 PEs DREN SMDC Origin 2000 64 PEs DoD HPCMP Grid • GRID Challenges • Logistics / Coordination • People • User Accounts • Geographic locations • Site configurations • Time zones /schedules • Network Security /Firewalls • Intro of batch queuing systems to environments • Training & skills transfer

  21. External Grids/Portal SHARCNET

  22. SHARCNET • The network is no longer ‘passive plumbing’ • True resource that can be managed in real time – with guaranteed QoS • Potential projects • -based resource leasing, advance reservation • IP-based topology awareness • Enables new classes of Grid applications • Operational results • Real-time, remote visualization • Virtual storage • Persistent/pervasive • On demand

  23. The Globus Toolkit V2

  24. Compute Servers Compute Servers Sharing pains…physical login Site A Site B • You have to • Get and maintain multiple accounts • Use different batch systems • No consolidated accounting • Manual file movement

  25. The Globus Toolkit™ Version 2 (GT2) • A software toolkit that addresses key technical problems in the development of Grid-enabled tools, services, and applications • Offers a modular “bag of technologies” • Enables incremental development of grid-enabled tools and applications • Implements standard Grid protocols and APIs • Made available under liberal Open Source license • Provided by The Globus Alliance http://www.globus.org

  26. Globus Toolkit: Evaluation (+) • Good technical solutions for key problems, e.g. • Authentication and authorization • Resource discovery and monitoring • Reliable remote service invocation • High-performance remote data access • This & good engineering is enabling progress • Good quality reference implementation, multi-language support, interfaces to many systems, large user base, industrial support • Growing community code base built on tools

  27. Globus Toolkit: Evaluation (-) • Protocol deficiencies, e.g. • Heterogeneous basis: HTTP, LDAP, FTP • No standard means of invocation, notification, error propagation, authorization, termination, … • Significant missing functionality, e.g. • Databases, sensors, instruments, workflow, … • Virtualization of end systems (hosting envs.) • Little work on total system properties, e.g. • Dependability, end-to-end QoS, … • Reasoning about system properties • Scalability

  28. MC: Transparent, dynamic, intelligent, scalable inter-cluster sharing User does not need to know about clusters: total transparency MC dynamically chooses the “best cluster” to run the job User chooses which cluster to submit job to via Globus interface Static, non-intelligent sharing Lacks transparency Cluster A Cluster C Inter-cluster protocols Cluster B LSF MC & Globus Globus

  29. Globus Toolkit 3 (OGSA)

  30. Open Grid Services Architecture (OGSA) • Next-generation architecture • Consequence of technology refresh (i.e., refactoring the Globus Toolkit) and research into Autonomic Computing • Convergence of Grid Computing and Web Services • Globus Toolkit • Access services – e.g., CLIs, GUIs, portals and CoGs • Resource and allocation management • Monitoring and discovery services – e.g., sensing and indexing • Data management services – e.g., file transfer, replica management, etc. • Security – e.g., the Grid Security Infrastructure • Initially SOAP, WSDL and WS-Inspection • The Global Grid Forum (GGF) serves as the standards authority • Two layers • Core Grid platform – OGSA platform interfaces and models • Core Grid infrastructure – Open Grid Services Infrastructure (OGSI) http://www.gridforum.org http://www.globus.org/ogsa

  31. Importance of OGSA to Customers • Grid-enabled Web Services transforming IT • Analyst feedback (e.g., Gartner) • Customer experience • Customers demand standards-compliant products, solutions and services – why? • Vendors guilty of over-promising and under-delivering • Avoid single-vendor lock-in • Proprietary implementations based on open standards • Seek multi-vendor deliverables • Framework for partner collaboration • Demanding professionalism in software engineering • Seek to be engaged in the process

  32. Platform Embraces Open Standards • Platform developing software for over 11 years • Standards efforts are recent activities • Existing implementations are proprietary • Platform is an NPi founder • NPi merged with GGF (4/02) • NPi being leveraged in OGSA • Platform committed to open standards • Proprietary implementations based on open standards • Platform experienced in Open Source arena • Offering Linux solutions for over 6 years • Offering Globus Toolkit solutions for about 2 years • Source-code available for components of Platform LSF

  33. Platform and Globus

  34. Platform Globus Toolkit One step installation Connectors for 3rd party workload management systems (ie: SGE, PBS, etc) Native command line interface support Open Source CSF Plus Advanced CSF-based metascheduler Job persistence; enhanced scalability (6x GT 3); Cluster load balancing and host type matching (LSF only) Platform Globus Tookit Platform Enhancements Community Scheduler Framework (CSF) Round robin job scheduling; Advance reservation booking, query, & control; Reservation based scheduling; Job throttling for increased reliability Globus Toolkit 3

  35. CSF

  36. What is CSF? • CSF (Community Scheduler Framework) • . Not a Platform product • . Contributed industries 1st open source meta-scheduler enhancement to Globus Toolkit V3.X • . Developed with the latest version of OGSI – grid guideline being developed with Global Grid Forum • . Open source "meta-scheduler“ – framework • - Provides basic protocols and interfaces to help resources work together in heterogeneous environments • - enables global access and maintains local control of resources

  37. Key Benefits of OGSA Compliance • Future-proof & protect grid investment using standards-based • solutions • Standardized approach to access Platform LSF • Interoperate with 3rd party systems

  38. Metaschedulers • Scheduler that co-ordinates communication between heterogeneous schedulers that operate at a local level • Enables global access and coordination while maintaining local control and ownership of resources • Future – possible to schedule workload execution also storage, network bandwidth, etc.

  39. CSF Grid Services • Job Service creates, monitors and controls compute jobs • Reservation Service guarantees resources are available for running a job • Queueing Service provides a service where administrators can customize and define scheduling policies at the VO level and/or at the different resource manager level • RM Adaptor Service provides a Grid service interface that bridges the Grid service protocol and resource managers (LSF, PBS, SGE, Condor and other RMs)

  40. Third Party Workload Management System Third Party Workload Management System CSF Architecture Platform LSF User Meta- scheduler Plugin Globus Toolkit User LSF Grid Service Hosting Environment Meta-Scheduler Global Information Service Job Service Reservation Service Queuing Service GRAM SGE GRAM PBS RM Adapter RIPS RIPS RIPS RIPS = Resource Information Provider Services GRAM = Grid Resource & Allocation Mangement Platform LSF

  41. High OMII Grid Canada Profile Low Awareness/Knowledge Liking/Preference/Conviction Commitment

  42. Platform MultiCluster Enables global access and coordination while maintaining local control and ownership of resources Join geographically dispersed clusters Production quality solution to build enterprise grids Platform proprietary solution that is standards-based & OGSA compliant Globus Toolkit Tools to join geographically dispersed clusters A bunch of “bricks” to build grids (that’s why it’s called a toolkit) Users have to specify which cluster they would like their job to be sent to – not transparent Open source solution Platform adds commercial support: documentation, training, tech support, professional services What are the Multi-Domain Tools and What Do They Do?

  43. Summary

  44. Summary • OGSA applies to e-Science and e-Business • Rich architectural framework • Existing, emerging and planned specifications • Ultimately resulting in Open Standards • Existing, emerging and planned implementations • The Community Scheduler Framework • Standards-based • Choice of implementations • Ushers existing grids towards OGSA compliance • Spectrum of potential use cases

  45. Thank you.

More Related