1 / 22

EGEE is a project funded by the European Union under contract INFSO-RI-508833

Practical approaches to Grid workload management in the EGEE project Massimo Sgaravatto INFN Padova On behalf of the EGEE JRA1 IT-CZ cluster. CHEP 2004. www.eu-egee.org. EGEE is a project funded by the European Union under contract INFSO-RI-508833. EGEE project. EGEE project

lance-bruce
Download Presentation

EGEE is a project funded by the European Union under contract INFSO-RI-508833

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical approaches to Grid workload management in the EGEE projectMassimo SgaravattoINFN PadovaOn behalf of the EGEE JRA1 IT-CZ cluster CHEP 2004 www.eu-egee.org EGEE is a project funded by the European Union under contract INFSO-RI-508833

  2. EGEE project • EGEE project • Aim: build a consistent, robust and secure Grid infrastructure • Focus first on two pilot applications areas (HENP, Biomedical applications) • But the goal is to take other researchers in academia and industry • Middleware activity (JRA1) • Re-engineer Grid software to provide production quality middleware • Evolution towards emerging standards, based on Service Oriented Architectures • Taking into account application requirements and production/ deployment/ management needs • See talk #247 (E. Laure) Chep 2004 - 2

  3. Workload management • Grid workload and resource management is one of the key Grid middleware functionality • How to efficiently schedule a big number of different data-intensive jobs, submitted by a distributed community of users, to a Grid encompassing many and heterogeneous resources • Progress was made in various projects with different integrated software solutions: • DataGrid Workload Management System • Condor • EuroGrid-Unicore resource broker • … • Still a lot to do • Scalability, reliability • Identification and handling of failures originating from different software layers, and possibly from 'foreign' Grid system and resources • Distributed (hierarchical ?) super-scheduling • Proper semantics of resource information collection and distribution (push, pull, index, cache, refresh) • … Chep 2004 - 3

  4. Workload Management System • Provision of Grid Workload Management System services assigned to the “EGEE JRA1 Italian Czech cluster” • CESNET • Datamat S.p.A. • INFN • Architecture of the EGEE WMS designed and being implemented • Taking into account feedback and requirements from reference applications and deployment/production/management activities • Taking into account previous experiences from other Grid projects (in particular the DataGrid WMS) • Set of Grid services • Workload Manager (WM) • Computing Element (CE): Resource access • Logging & Bookkeeping (L&B) • Job Provevance (JP) • Grid Accounting service • Interoperating among them and with other EGEE Grid Services Chep 2004 - 4

  5. Workload Manager Chep 2004 - 5

  6. Workload Manager Job management requests (submission, cancellation) expressed via a Job Description Language (JDL) Chep 2004 - 6

  7. Workload Manager Keeps submission requests Requests are kept for a while if no matching resources available Chep 2004 - 7

  8. Workload Manager Repository of resource information available to matchmaker Updated via notifications and/or active polling on sources Chep 2004 - 8

  9. Workload Manager Finds an appropriate CE for each submission request, taking into account job requests and preferences, Grid status, utilization policies on resources Chep 2004 - 9

  10. Scheduling policies • Different possible policies • Eager scheduling: a job is bound to a resource as soon as possible • Job is then forwarded to that CE, where very likely it will end up in a queue • Lazy scheduling: job held by the WM until a resource becomes available • Job then forwarded to that CE for immediate execution • WM architecture able to accommodate both models (and the intermediate solutions) • Eager scheduling: matching a job against multiple resources • Lazy scheduling: matching a resource against multiple jobs • Needed to better investigate strengths and weaknesses of different policies in different scenarios • Evaluation of relevant metrics, covering both resource utilization and user satisfaction Chep 2004 - 10

  11. Computing Element • Service representing a computing resource • Main functionality: job management • Run jobs • Cancel jobs • Suspend and resume jobs • Provide info on “quality of service” • How many resources match the job requirements ? • What is the estimated time to have the job starting its execution ? • … • … • Used by the WM or by any other client (e.g. end-user) • CE architecture accommodated to support both push and pull model • Push model: the job is pushed to the CE by the WM • Pull model: the CE asks the WM for jobs • These two models are somewhat mirrored in the resource information flow • In order to 'pull' a job a resource must choose where to 'push' information about itself Chep 2004 - 11

  12. CE Architecture Client JobSubmit JobAssess JobKill JobSuspend JobResume JobGetStatus WEB WEB CE Mon Web service accepting job management requests LSF PBS ? Worker Nodes Chep 2004 - 12

  13. CE Architecture Client Notifications Job requests WEB WEB CE Mon Async. notifications about job/CE events Job requests (for CE working in pull mode) LSF PBS ? Worker Nodes Chep 2004 - 13

  14. Logging & Bookkeeping • Collects and manages job-related events (e.g. submission, suitable CE found, start of execution, …) from the WMS components • Processes these events to give a higher level view on job states • Both job states and raw data available to users • Also via Web Service interface • Possible to subscribe to receive notifications on particular job state changes • LB event trail can be analyzed to identify problems with resources ("black holes", unusual failure rates, etc). • See poster #419 for more details Chep 2004 - 14

  15. Job Provenance • Keeps track of definition of submitted jobs, execution conditions and job life cycle for a long time • Job life logs (JDL, timestamps, jobids, …) • Executable and input/output files • Execution environment (OS, installed software version, …) • Custom data provided by user • Used for • Debugging • Post-portem analysis • Comparison of job executions in an evolving environment • Service components • Primary Storage Server • Keeps data in the most compact and economic form • Index Servers • Configured to support a set of queryable attributes • See poster # 419 for more details Chep 2004 - 15

  16. Grid Accounting • Accumulates information about the usage of Grid resources by users / groups (e.g. VOs) • To be used • To track resource usage • To discover abuses (and help avoiding them) • Also possible to charge users for the resources they have used • Allows implementation of submission policies based on resource usage • Exchange market among Grid users and Grid resource owners, which should result in market equilibrium •  Load balancing on the Grid Chep 2004 - 16

  17. Accounting architecture Accounting Resource metering: getting info about resource usage Storage Element Computing Element Chep 2004 - 17

  18. Accounting architecture Accounting Reports about resource usage per user / VO/ resource Storage Element Computing Element Chep 2004 - 18

  19. Accounting architecture Resource pricing Accounting Storage Element Computing Element Resource owner Chep 2004 - 19

  20. Accounting architecture Resource pricing Cost computation Accounting Storage Element Computing Element Resource owner Chep 2004 - 20

  21. Status • Workload Manager, Logging & Bookkeeping, Grid Accounting software inherited by DataGrid WMS software • Being revised and complemented according to the new architecture • E.g. Information Supermarket, TaskQueue new developments • Web services interfaces • First implementation already deployed in the EGEE GLITE prototype testbed • Computing Element • New fresh developments • CEMon prototype already implemented • Job Provenance • New component being implemented Chep 2004 - 21

  22. Links • EGEE JRA1 IT-CZ cluster homepage • http://egee-jra1-wm.mi.infn.it/egee-jra1-wm • EGEE JRA1 (middleware activity) homepage • http://egee-jra1.web.cern.ch/egee-jra1 • EGEE project homepage • http://www.eu-egee.org Chep 2004 - 22

More Related