1 / 60

Grid Computing - A Primer

Grid Computing - A Primer. Sridhara Dasu, Department of Physics, U. Wisconsin. Grid Computing What is the buzz all about? What is the promise? My Perspective What is in it for me? How is it working for us? In UW-Madison And, beyond … Conclusion Why should you be interested?

aure
Download Presentation

Grid Computing - A Primer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid Computing - A Primer Sridhara Dasu, Department of Physics, U. Wisconsin • Grid Computing • What is the buzz all about? • What is the promise? • My Perspective • What is in it for me? • How is it working for us? • In UW-Madison • And, beyond … • Conclusion • Why should you be interested? • What are the consequences for you? Acknowledgements: Condor Team GLOBUS Team I.Foster/Argonne M.Livny/Wisconsin D.Bradley/Wisconsin Sridhara Dasu

  2. Grid Computing isin the News … Sridhara Dasu

  3. Grid Projects Are Ubiquitous

  4. The Opportunity (or Challenge):Computational Cornucopia • Abundant computation, data, bandwidth • In many fields, too much data—not too little • Simulations of unprecedented accuracy • Ubiquitous internet  distance not a barrier • But as a consequence • Rate of change accelerates • Complex problems  multidisciplinary distributed teams & sharing of resources & expertise • Without infrastructure, you can’t compete Sridhara Dasu

  5. Why Distributed Teams Are Important • Increasingly challenging & complex problems • Particle physics, Global change, Cosmology, Life sciences • Manufacturing, Mineral exploration • Film production, Game development, … • Required expertise & resources also distributed • People • Computational capability • Data • Sensors Sridhara Dasu

  6. The Grid “Resource sharing & coordinated problem solving in dynamic … virtual organizations” http://www.mkp.com/mk/default.asp?isbn=1558609334 • Enable integration of distributed service & resources • Using general-purpose protocols & infrastructure • To achieve useful qualities of service “The Anatomy of the Grid”, Foster, Kesselman, Tuecke, 2001 Sridhara Dasu

  7. What is a Grid? • The key criteria: • Coordinated distributed resources … • Uses standard, open, general-purpose protocols and interfaces … • Deliver non-trivial qualities of service. • What is not a Grid? • A cluster, a network attached storage device, a scientific instrument, a network, etc. • Each is an important component of a Grid, but by itself does not constitute a Grid Sridhara Dasu

  8. Why Should You Care? 1) Grid is a promising technology [Vision] • It ushers in a virtualized, collaborative, distributed world 2) Grids are being commissioned now [Reality] • Grids are built (not bought), but are delivering real benefits in academic and commercial settings 3) An open Grid is to your advantage [Future] • Standards are being defined now that will determine the future of this technology Sridhara Dasu

  9. The Power Grid:On-Demand Access to Electricity Decouple production & consumption, enabling • On-demand access • Economies of scale • Consumer flexibility • New devices Quality, economies of scale Time Sridhara Dasu

  10. But Computing Isn’t Really Like Electricity! • How about “access computing resources like we access Web content”? • We have no idea where a website is, or on what computer or operating system it runs • Two interrelated opportunities 1) Enhance economy, flexibility, access by virtualizing computing resources 2) Deliver entirely new capabilities by integrating distributed resources Sridhara Dasu

  11. Automatically connect applications to services • Dynamic & intelligent • provisioning Application Virtualization Infrastructure Virtualization • Dynamic & intelligent • provisioning • Automatic failover Virtualization Applications: Delivery Application Services: Distribution Servers: Execution Source: The Grid: Blueprint for a New Computing Infrastructure (2nd Edition), 2004 Sridhara Dasu

  12. Local Clusters to Global Grids Cluster Grid Enterprise Grid Global Grid Sridhara Dasu

  13. Grid Deployment Trends Corporate Corporate Mission Criticality Scientific Department Enterprise Collaboration Internet Sridhara Dasu

  14. Transparent Service Utility Computing Utility Computing Grid Autonomic Computing Autonomic Computing Service- Oriented Architecture Service- Oriented Architecture Webster says: Autonomic = acting or occurring involuntarily <autonomic reflexes> Sridhara Dasu

  15. Layers of Grid Architecture Sridhara Dasu

  16. Multidisciplinary Teams:Problem Solving in the 21st Century • Teams organized around common goals • Communities: “Virtual organizations” • With diverse membership & capabilities • Heterogeneity is a strength not a weakness • And geographic and political distribution • No location/organization possesses all required skills and resources • Must adapt as a function of the situation • Adjust membership, reallocate responsibilities, renegotiate resources Sridhara Dasu

  17. Challenging Technical Requirements • Dynamic formation and management of virtual organizations • Discovery & online negotiation of access to services: who, what, why, when, how • Configuration of applications and systems able to deliver multiple qualities of service • Autonomic management of distributed infrastructures, services, and applications • Management of distributed state • Open, extensible, evolvable infrastructure Sridhara Dasu

  18. The Globus Project™Making Grid computing a reality (since 1996) • Close collaboration with real Grid projects in science and industry • The Globus Toolkit®: Open source software base for building Grid infrastructure and applications • Development and promotion of standard Grid protocols to enable interoperability and shared infrastructure • Development and promotion of standard Grid software APIs to enable portability and code sharing • Global Grid Forum: We co-founded GGF to foster Grid standardization and community Sridhara Dasu

  19. Globus Toolkit 2Key Protocols • The Globus Toolkit v2 (GT2)centers around four key protocols • Connectivity layer: • Security: Grid Security Infrastructure (GSI) • Resource layer: • Resource Management: Grid Resource Allocation Management (GRAM) • Information Services: Grid Resource Information Protocol (GRIP) • Data Transfer: Grid File Transfer Protocol (GridFTP) • Also key collective layer protocols • Info Services, Replica Management, etc. Sridhara Dasu

  20. Est. 1986 C High Throughput Computing ondor Resource Management UW Condor Project - Miron Livny’s group (http://www.cs.wisc.edu/condor) • Predates Globus • High throughput computing on commodity resources • Successful enterprise level deployment • UW Computer Science Condor pool • UW Condor pools in other departments • INFN/Italy pools • Inter-pool flocking • … • Also, some industrial users • … Sridhara Dasu

  21. Application Submit (client) Application Agent Customer Agent Matchmaker Owner Agent Execute (service) Remote Execution Agent Local Resource Manager Resource The Layers of Condor Complete solution for resource management Sridhara Dasu

  22. A Grid Job • Must be able to run in the background: no interactive input, windows, GUI, etc. • Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices • Organize data files, input/output Sridhara Dasu

  23. Condor Universes • The Standard Universe • Check-points executable state • Job migration to other resources to continue execution • Transparent IO redirection to user submit machines • Robust against resource preemption for higher priority tasks + resource failures • Limitations on applications (e.g., shlib, MT) • The Vanilla Universe • Traditional batch jobs with no limitations • External solutions for IO redirection • Not robust against preemption or resource failures • The Globus Universe (new) • Adapted to emerging Grid standards • Part of Globus Toolkit Sridhara Dasu

  24. Globus middleware deployed across entire Grid remote access to computational resources dependable, robust data transfer Condor job scheduling across multiple resources strong fault tolerance with checkpointing and migration layered over Globus as “personal batch system” for the Grid Condor-G: Globus + Condor Sridhara Dasu

  25. Condor Globus Toolkit Condor Condor-G User/Application Grid Fabric (processing, storage, communication) Sridhara Dasu

  26. Creating a Submit Description File • A plain ASCII text file • Tells Condor-G about your job: • Which executable, grid site, input, output and error files to use, command-line arguments, environment variables, etc. • Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc. Sridhara Dasu

  27. Simple Submit Description File # Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = globus GlobusScheduler = host.domain.edu/jobmanager Executable = my_job Queue Sridhara Dasu

  28. Running condor_submit • You give condor_submit the name of the submit file you have created • condor_submit parses the file, checks for errors, and creates a “ClassAd” that describes your job(s) • Sends your job’s ClassAd(s) and executable to the Condor-G schedd, which stores the job in its queue • Atomic operation, two-phase commit • View the queue with condor_q Sridhara Dasu

  29. Condor_q Globus Resource Condor_submit Gate Keeper Condor-G Local Job Scheduler Condor-G condor_submit sequence Sridhara Dasu

  30. Running condor_submit % condor_submit my_job.submit-file Submitting job(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job 1 jobs; 1 idle, 0 running, 0 held % Sridhara Dasu

  31. DAGMan • Directed Acyclic Graph Manager • DAGMan allows you to specify the dependencies between your Condor-G jobs, so it can manage them automatically for you. • (e.g., “Don’t run job “B” until job “A” has completed successfully.”) Sridhara Dasu

  32. Job A Job B Job C Job D What is a DAG? • A DAG is the datastructure used by DAGMan to represent these dependencies. • Each job is a “node” in the DAG. • Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Sridhara Dasu

  33. Job A Job B Job C Job D Defining a DAG • A DAG is defined by a .dagfile, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D • each node will run the Condor-G job specified by its accompanying Condor submit file Sridhara Dasu

  34. What about Data? Data Placement* (DaP) must be an integral part of the end-to-end solution Stork (Another UW-Computer Science Product) • Schedules, runs, monitors, and manages Data Placement (DaP) jobs in a heterogeneous Grid environment & ensures that they complete. • What Condor (G) means for computational jobs, Stork means the same for DaP jobs. • Just submit a bunch of DaP jobs and then relax.. • Interoperates with various storage services * Space management and Data transfer Sridhara Dasu

  35. SRM SRB NeST Full Condor-G Capabilities Planner(s) DAGMan Stork (DaP) Condor-G(compute) Gate Keeper StartD RFT GridFTP Sridhara Dasu

  36. UW “Enterprise Level” Grid • Condor pool at CS • 1000 ~1GHz Intel CPUs • Condor pools at various departments • 100 ~2.4 GHz Intel CPUs at Physics, etc. • New: Grid Laboratory of Wisconsin • Condor jobs flock from various departments to CS Pool as needed • Excellent utilization • Especially when the Condor Standard Universe is used • Premption, Checkpointing, Job Migration Sridhara Dasu

  37. Grid Laboratory of Wisconsin 2003 Initiative funded by NSF/UWSix GLOW Sites • Computational Genomics, Chemistry • Amanda, Ice-cube, Physics/Space Science • High Energy Physics/CMS, Physics • Materials by Design, Chemical Engineering • Radiation Therapy, Medical Physics • Computer Science Phase-1 already has ~300 Xeon CPUs Expect to grow to about 700 CPUs + 100 TB disk Sridhara Dasu

  38. Condor/GLOW Ideas • Exploit commodity hardware for high throughput computing • The base hardware is the same at all sites • Local configuration optimization as needed • e.g., Number of CPU elements vs storage elements • Must meet global requirements • It turns out that our initial assessment calls for almost identical configuration at all sites • Managed locally at 6 sites • Shared globally across all sites • Higher priority for local jobs Sridhara Dasu

  39. The Large Hadron Collider Sridhara Dasu

  40. The Large Hadron Collider Building and commissioning the accelerator and detectors, and extracting interesting physics out of this massive data sample is a big challenge. Sridhara Dasu

  41. Event Filtering Before Archival Output: 1MB/event @100 Hz Petabyte per year Sridhara Dasu

  42. Analysis Teams + Resources Input: ~109 events (petabyte databases) Complex algorithms developed by collaborating physicists Output: Publications with ~100s of selected events Sridhara Dasu

  43. Simulation: Early Grid Deployment • Detailed simulations necessary • Large numbers of background events need to be simulated • Dominated by fluctuations of tails • Computation scale • Background events occur on every crossing - 40 MHz • Up to 10 minutes on a 1 GHz CPU to simulate full event • 2 x 109 s CPU time to simulate 1 s of LHC operation • Requires 1000 CPUs running for 1 month • CMS has large number of detector channels, 108 • Each event requires 1-10 MB storage space • 32-320 TB needed for 1 s of LHC operation • Optimizing CPU and data storage • Simulate in bins and reuse some data • Pleasantly parallel application • Ideal Grid testbed candidate • Used UW “enterprise level” classic Condor grid successfully • With Grid2003 used nation wide Globus/Condor-G based true grid Sridhara Dasu

  44. Tapping UW “Enterprise Level” Grid We tapped resources on the UW campus opportunistically We produced more events in 2003 than most other CMS collaborators - because of using our UW enterprise level grid and condor standard universe! 2004 numbers are through March, and were also running our new C++ simulation code that is a factor of 2 slower. We have typically used less than 50% of available resources and ran for about 30% of the year. Sridhara Dasu

  45. Tapping Global Grid : Grid3 Sridhara Dasu

  46. Cost Savings from Grids • The size of cost savings from grids will come in two waves: • First from the adoption of clusters • Then from the adoption of Enterprise Grids • Firms using Clusters estimate that cost savings will be small at first, but will grow to 15% to 30% savings in IT Costs in 2005-2008. • Firms planning to use Enterprise Grids estimate that they will experience a second wave of benefits. Savings will grow to 15% to 30% by 2007-2010. Source: Robert Cohen, “Grid Computing: Projected Impact on North Carolina’s Economy & Broadband Use through 2010,” Rural Internet Access Authority, September 2003. http://www.e-nc.org Sridhara Dasu

  47. Grid drawbacks being addressed now • Low utilization of enterprise resources • High cost of provisioning for peak demand • Inadequate resources prevent use of advanced applications • Lack of information integration Sridhara Dasu

  48. Cyberinfrastructure & VOs Relevance Far Beyond Science 1) Virtualization of information technology • From vertical silos to on-demand access • Improve efficiency of delivery, increase flexibility of use • E.g., financial services, e-commerce 2) New applications, products, & services enabled by much computation & data • Media, life sciences, manufacturing, seismic exploration, online gaming, etc., etc., etc. Sridhara Dasu

  49. The Value of Grid Computing:IBM Perspective Increased Efficiency Higher Quality of Service Increased Productivity & ROI Reduced Complexity & Cost Improved Resiliency Sridhara Dasu

  50. switchfabric compute storage Grids: HP Perspective computing utility or GRID virtual data center value programmable data center grid-enabled systems UDC Tru64, HP-UX, Linux clusters Open VMS clusters, TruCluster, MC ServiceGuard today shared, traded resources Sridhara Dasu

More Related