1 / 55

Hansel and Gretel are lost in the forest of the definitions

Grid computing : an introduction Lionel Brunie Institut National des Sciences Appliquées Lyon, France. Hansel and Gretel are lost in the forest of the definitions. Distributed system Parallel system Cluster computing Meta-computing Grid computing Peer to peer Global computing

alice
Download Presentation

Hansel and Gretel are lost in the forest of the definitions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid computing : an introductionLionel BrunieInstitut National des Sciences AppliquéesLyon, France

  2. Hansel and Gretel are lost in the forest of the definitions • Distributed system • Parallel system • Cluster computing • Meta-computing • Grid computing • Peer to peer • Global computing • Internet Computing • Network computing

  3. Distributed system • N autonomous computers (sites) : n administrators, n data/control flows • an interconnection network • User view : one single (virtual) system • « Traditional » programmer view : client-server

  4. Parallel System • 1 computer, n nodes : one administrator, one scheduler, one power source • memory : it depends • Programmer view : one single machine executing parallel codes. Various programming models (message passing, distributed shared memory, data parallelism…)

  5. Cluster computing • Use of PCs interconnected by a (high performance) network as a parallel (cheap) machine • Two main approaches • dedicated network (based on a high performance network : Myrinet, SCI, Fiber Channel...) • non-dedicated network (based on a (good) LAN)

  6. Network computing • From LAN (cluster) computing to WAN computing • Set of machines distributed over a MAN/WAN that are used to execute parallel loosely coupled codes • Depending on the infrastructure (soft and hard), network computing is derived in Internet computing, P2P, Grid computing, etc.

  7. Visualization Meta computing • Definitions become fuzzy... • A meta computer = set of (widely) distributed (high performance) processing resources that can be associated for processing a parallel not so loosely coupled code • A meta computer = parallel virtual machine over a distributed system SAN LAN Cluster of PCs WAN SAN Supercomputer Cluster of PCs

  8. Grid computing (1) “Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations” (I. Foster)

  9. Grid computing (2) • Information grid : large access to distributed data : the Web • Data grid : management and processing of very large distributed data sets • Computing grid ~ meta computer • Ex : Globus, Legion

  10. Internet computing • Use of (idle) computer interconnected by Internet for processing large throughput applications • Ex : SETI@HOME, Décrypthon, RSA-155 • Programmer view : a single master, n servants

  11. Global computing • Internet computing on a pool of sites • Meta computing with loosely coupled codes • Grid computing with poor communication facilities • Ex : Condor

  12. Peer to peer computing • A site is both client and server : servent • Dynamic servent discovery by « contamination » • 2 approaches : • centralized management : Napster • distributed management : Gnutella, Kazaa • Application : file sharing

  13. Grid computing

  14. Data Intensive Physical Sciences • High energy & nuclear physics • Simulation • Earth observation, climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Pollutant dispersal scenarios • Astronomy- Digital sky surveys : the planned Large Synoptic Survey Telescope will produce over 10 petabytes per year by 2008 ! • Molecular genomics • Medical images

  15. A Brain is a Lot of Data!(Mark Ellisman, UCSD) And comparisons must be made among many We need to get to one micron to know location of every cell. We’re just now starting to get to 10 microns

  16. Performance evolution of computer components • Network vs. computer performance • Computer speed doubles every 18 months • Network speed doubles every 9 months • Disk capacity doubles every 12 months • 1986 to 2000 • Computers: x 500 • Networks: x 340,000 • 2001 to 2010 • Computers: x 60 • Networks: x 4000 Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.

  17. Partial conclusion • It is not a phantasm ! • Real need for very high performance infrasatructures • Basic idea : share computing resources

  18. Back to roots (routes) • Railways, telephone, electricity, roads, bank system • Complexity, standards, distribution, integration (large/small) • Impact on the society : how US grown • Big differences : • clients (the citizens) are NOT providers (State or companies) • small number of actors/providers • small number of applications • strong supervision/control

  19. Computational grid • « HW and SW infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities • Performance criteria : • security • reliability • computing power • latency • services • throughput

  20. Applications • Distributed supercomputing • High throughput computing • On demand (real time) computing • Data intensive computing • Collaborative computing

  21. An Example Virtual Organization: CERN’s Large Hadron Collider 1800 Physicists, 150 Institutes, 32 Countries 100 PB of data by 2010; 50,000 CPUs?

  22. Online System Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Grid Communities & Applications:Data Grids for High Energy Physics ~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 France Regional Centre Germany Regional Centre Italy Regional Centre FermiLab ~4 TIPS ~622 Mbits/sec Tier 2 ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec Tier 4 Physicist workstations www.griphyn.org www.ppdg.net www.eu-datagrid.org

  23. Levels of cooperation • End system (computer, disk, sensor…) • multithreading, local I/O • Cluster (heterogeneous) • synchronous communications, DSM, parallel I/O • parallel processing • Intranet • heterogeneity, distributed admin, distributed FS and databases • low supervision, resource discovery • high throughput • Internet • no control, collaborative systems, (international) WAN • brokers, negotiation

  24. Basic services • Authentication • Authorization • Activity control • Resource information • Resource brokering • Scheduling • Job submission, data access/migration and execution • Accounting

  25. Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture(By Analogy to Internet Architecture) From I. Foster

  26. Aspects of the Problem • Need for interoperability when different groups want to share resources • Diverse components, policies, mechanisms • E.g., standard notions of identity, means of communication, resource descriptions • Need for shared infrastructure services to avoid repeated development, installation • E.g., one port/service/protocol for remote access to computing, not one per tool/application • E.g., Certificate Authorities: expensive to run • A common need for protocols & services From I. Foster

  27. Basic services • Authentication • Authorization • Activity control • Resource information • Resource brokering • Scheduling • Job submission, data access/migration and execution • Accounting

  28. Security :Why Grid Security is Hard • Resources being used may be extremely valuable & the problems being solved extremely sensitive • Resources are often located in distinct administrative domains • Each resource may have own policies & procedures • Users may be different • The set of resources used by a single computation may be large, dynamic, and/or unpredictable • Not just client/server • It must be broadly available & applicable • Standard, well-tested, well-understood protocols • Integration with wide variety of tools

  29. Grid Security : various views User View Resource Owner View 1) Specify local access control 2) Auditing, accounting, etc. 3) Integration w/ local systemKerberos, AFS, license mgr. 4) Protection from compromisedresources 1) Easy to use 2) Single sign-on 3) Run applicationsftp,ssh,MPI,Condor,Web,… 4) User based trust model 5) Proxies/agents (delegation) Developer View API/SDK with authentication, flexible message protection, flexible communication, delegation, ...Direct calls to various security functions (e.g. GSS-API)Or security integrated into higher-level SDKs: E.g. GlobusIO, Condor

  30. Grid security : requirements • Authentication • Authorization and delegation of authority • Assurance • Accounting • Auditing and monitoring • Integrity and confidentiality

  31. Resources • Description • Advertising • Cataloging • Matching • Claiming • Reserving • Checkpointing

  32. Resource layers • Application layer • tasks, resource requests • Application resource management layer • intertask resource management, execution environment • System layer • resource matching, global brokering • Owner layer • owner policy : who may uses what • End-resource layer • end-resource policy (e.g. O.S.)

  33. Resource management (1) • Services and protocols depend on the infrastructure • Some parameters • stability of the infrastructure (same set of resources or not) • freshness of the resource availability information • reservation facilities • multiple resource or single resource brokering • Example request : I need from 10 to 100 CE each with at least 128 MB RAM and a computing power of 50 Mips

  34. Resource management (2) • Figure : the structure of a RMS...

  35. Resource management and scheduling (1) • Levels of scheduling • job scheduling (global level ; perf : throughput) • resource scheduling (perf : fairness, utilization) • application scheduling (perf : response time, speedup, produced data…) • Mapping/scheduling • resource discovery and selection • assignment of tasks to computing resources • data distribution • task scheduling on the computing resources • (communication scheduling) • Individual perfs are not necessarily consistent with the global (system) perf !

  36. Resource management and scheduling (2) • Grid problems • predictions are not definitive : dynamicity ! • Heterogeneous platforms • Checkpointing and migration

  37. Broker Co-allocator A Resource Management System example (Globus) RSL specialization RSL Application Information Service Queries & Info Ground RSL Simple ground RSL Local resource managers GRAM GRAM GRAM LSF Condor NQE

  38. Resource information (1) • What is to be stored ? • Organization, people, computing resources, software packages, communication resources, event producers, devices… • what about data ??? • A key issue in such dynamics environments • A first approach : (distributed) directory (LDAP) • easy to use • tree structure • distribution • static • mostly read ; not efficient updating • hierarchical • poor procedural language

  39. Resource information (2) • But : • dynamicity • complex relationships • frequent updates • complex queries • A second approach : (relational) database

  40. Data management • It was long forgotten !!! • Though it is a key issue ! • Issues : • indexing • retrieval • replication • caching • traceability • (auditing) • And security !!!

  41. The ReplicaManagement Problem • Maintain a mapping between logical names for files and collections and one or more physical locations • Decide where and when a piece of data must be replicated • Important for many applications • Example: CERN high-level trigger data • Multiple petabytes of data per year • Copy of everything at CERN (Tier 0) • Subsets at national centers (Tier 1) • Smaller regional centers (Tier 2) • Individual researchers will have copies • Even more complex with sensitive data like medical data !!!

  42. Programming on the grid : potential programming models • Message passing (PVM, MPI) • Distributed Shared Memory • Data Parallelism (HPF, HPC++) • Task Parallelism (Condor) • Client/server - RPC • Agents • Integration system (Corba, DCOM, RMI)

  43. Program execution : issues • Parallelize the program with the right job structure, communication patterns/procedures, algorithms • Discover the available resources • Select the suitable resources • Allocate or reserve these resources • Migrate the data • Initiate computations • Monitor the executions ; checkpoints ? • React to changes • Collect results

  44. The Legion system • University of Virginia • Object-oriented approach. Objects = data, applications, sensors, computing resources, codes… : all is object ! • Loosely coupled codes • Single naming space • Reuse of existing OS and protocols ; definition of message formats and high level protocols • Core objects : naming, binding, object creation/activation/desactivation/destruction • Methods : description via an IDL • Security : in the hands of the users • Resource allocation : a site can define its own policy

  45. The Globus toolkit • A set of integrated executable management (GEM) services for the Grid • Services • resource management (GRAM-DUROC) • communication (NEXUS - MPICH-G2, globus_io) • information (MDS) • data management (replica catalog) • security (GSI) • monitoring (HBM) • remote data access (GASS - GridFTP - RIO) • executable management (GEM) • execution • Commodity Grid Kits (Java, Python, Corba, Matlab…)

  46. High-Throughput Computing: Condor • High-throughput computing platform for mapping many tasks to idle computers • Since 1986 ! • Major components • A central manager manages pool(s) of [distributively owned or dedicated] computers. A CM = scheduler + coordinator • DAGman manages user task pools • Matchmaker schedules tasks to computers using classified ads • Checkpointing and process migration • No simple communications • Parameter studies, data analysis • Condor married Globus : Condor-G • More than 150 Condor pools in the world ; or on your machine !

  47. Job A Job B Job C Job D Defining a DAG • A DAG is defined by a .dagfile, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D • Each node will run the Condor job specified by its accompanying Condor submit file From Condor tutorial

  48. Conclusion • Just a new toy for scientists or a revolution ? • Complexity from heterogeneity, wide distribution, security, dynamicity • Many approaches • Still much work to do !!! • A global framework for grid computing, pervasive computing and Web services ?

  49. Application MetadataService Planner: Data location, Replica selection, Selection of compute and storage nodes Replica Location Service Information Services Security and Policy Executor: Initiates data transfers and computations Data Movement Data Access Compute Resources Storage Resources Functional View of Grid Data Management Location based on data attributes Location of one or more physical replicas State of grid resources, performance measurements and predictions

  50. MDS2 WS-Index (OGSI) Components in Globus Toolkit 3.0 GSI WU GridFTP JAVA WS Core (OGSI) Pre-WS GRAM WS-Security RFT (OGSI) OGSI C Bindings WS GRAM (OGSI) RLS Security Data Management Resource Management Information Services WS Core

More Related