1 / 23

The gLite Workload Management System Alessandro Maraschini alessandro.maraschini@datamat.it

Overview of the gLite Workload Management System, its architecture, job types, and latest features and functionalities. Also covers testing activities and future plans.

joneslaura
Download Presentation

The gLite Workload Management System Alessandro Maraschini alessandro.maraschini@datamat.it

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The gLite Workload Management SystemAlessandro Maraschini alessandro.maraschini@datamat.it OGF20, Manchester, UK 2007, May 10

  2. Contents • WMS • System overview, partners, task, components • JDL • Language overview • JobTypes: single /compounds • News • New Functionalities • Latest Activities • Future Plans • Future Implementations & Activities • WMS and WfMS • Tests • Middleware Testing Activities & Results OGF20, Manchester, UK 2007, May 10th

  3. Introduction: gLite WMS • Workload Management System (WMS) • Italian and Czech clusters • Part of Joint Research Activity 1 (JRA1) • Partners involved • INFN • Datamat • CESNET • Provides Distribution and Management of tasks across resources available on a Grid • Accepts a request of execution of a Job from a client • Finds appropriate resources to satisfy the Job • Follows the Job until completion. OGF20, Manchester, UK 2007, May 10th

  4. WMS Architecture: core components • WMProxy • Accepts Request from User • Checks/Authentication/Authorization • Sets up Local File System • Forwards request to WM • Workload Manager (WM) • Look for the appropriate Computing Element (CE) • Matchmaking operation • Forwards request to CondorC • Logging & Bookkeeping (LB) • Tracks jobs in terms of eventsgathered from various gLite components • Processes the incoming events to give a higher level view on the job states • Recently introduction of LBProxy • Lightweight local LB service “dedicated” to WMS components • Asynchronously logs info to the actual (usually remote) LB OGF20, Manchester, UK 2007, May 10th

  5. LB Server Workload Manager LB Proxy WMProxy Job Controller CondorG Job Controller CondorC Log Monitor UserInterface WMS Architecture overview LCG CE gLite CE gLite WMS OGF20, Manchester, UK 2007, May 10th

  6. JDL: overview • Job Description Language (JDL) • gLite approach to Request Description • Allows the user to provide job execution needed information • Characteristics of the application • Requirements/preferences about resources • Customized hints for gLite WMS on how to handle the application • Supported Job Types • Single Jobs • Compound Jobs • Workflows (DAGs) • Collections, Parametric Jobs OGF20, Manchester, UK 2007, May 10th

  7. JDL: Single Types • Single Jobs • Normal: single and simple batch job with no peculiar requirements • MPICH: a parallel application to be run on the nodes of a cluster using the MPICH implementation of the message passing interface • new MPI flavours support planned • Interactive: a job whose standard streams are forwarded to the submitting client, that can actually interact and steer the job execution by providing real-time input information • Previously Supported Job Types • Not supported anymore: • Checkpointable Jobs • Partitionable Jobs • Deprecation due Lack of feedback from users • It seems they are not used at all • Focus on improving support for “really used” job types OGF20, Manchester, UK 2007, May 10th

  8. JDL: Compound Jobs • Definition • Aggregation of Single/Normal Jobs • Benefits • One Shot submission for (up to thousands of) jobs • Single call to WMProxy server • Single AuthN and AuthZ process • Submission time reduction • SingleIdentification to manage all jobs (father Job) • Not an actual Job, used to monitor the whole bunch • Sharing of files between jobs OGF20, Manchester, UK 2007, May 10th

  9. Father nodeF nodeG nodeD nodeA nodeE nodeB nodeC nodeH nodeI JDL: Compound Types • Compound Jobs: Workflows • Implemented as Directed Acyclic Graphs (DAGs) • Set of jobs where the input, output or execution of one of more jobs may depend on one or more other jobs • Dependencies represent time constraints: a child cannot start before all parents have successfully completed OGF20, Manchester, UK 2007, May 10th

  10. nodeA nodeB JDL: Compound Types • Compound Jobs: Parametric Jobs • Parameterized description of a Job • Parameter sweep usage • Automatically converted on WMS side • Generate a (possibly) huge number of (similar) jobs • No dependencies between nodes Father nodeE nodeD nodeC OGF20, Manchester, UK 2007, May 10th

  11. nodeE nodeB JDL: Compound Types • Compound Jobs: Collections • A set of possibly heterogeneous jobs that can be specified within a single JDL description • No dependencies among specified jobs Father nodeA nodeD nodeC OGF20, Manchester, UK 2007, May 10th

  12. New Functionalities: WMProxy • WMProxy server • Replaces the old C++ based socket connection service • Implements an interoperable interface • Web Service based • WS-I compliant • WMProxy client • Provides C++ based WMS command-line User Interface (UI), which executes all the needed operation automatically • Provides multi language (C++, Java and Python) provided APIs OGF20, Manchester, UK 2007, May 10th

  13. New Functionalities: ICE- CREAM • CREAM service: Computing Resource Execution And Management service • CE with Web service interface • WMS requests are directly forwarded to CREAM based CEs through ICE • ICE: Interface to Cream Environment • Basically reproduces the Job Controller / CondorC / Log Monitor layer needed by gLite/LCG CEs OGF20, Manchester, UK 2007, May 10th

  14. LB Server CREAM Workload Manager LB Proxy WMProxy Job Controller CondorG ICE UserInterface New Functionalities: ICE- CREAM LCG CE Job Controller CondorC LogMonitor gLite CE gLite WMS OGF20, Manchester, UK 2007, May 10th

  15. New Functionalities: Sandbox Files • Sandbox Archiving and Sharing • Job sandbox files can be automatically compressed • Different jobs can share the same sandbox • dramatically reduces network traffic • allowes the user to save time and bandwidth • Sandbox Remote Specification • User can store files directly on a remote machine • No intermediate copies – JobWrapper will download directly from WorkerNode • Reduces server load • Supported File Transfer • Full support (input & output files) for protocols: • gridftp • https OGF20, Manchester, UK 2007, May 10th

  16. New Functionalities: Bulk-MM • Bulk-MatchMaking • Natural completion of the Bulk Submission • Allow single Matchmaking of similar jobs in one shot • Jobs equivalence are based upon specifying “significant attributes” • Jobs whose significant attribues are literally equal are equivalent • Target Jobs: Bunch of Independent Jobs • Mainly Collections and Parametric Jobs • Originally managed with Condor DAGMan • Allows Submission to CREAM-based CEs • Provides additional boost in WMS performance • Saves time & resources OGF20, Manchester, UK 2007, May 10th

  17. Other New Functionalities • Service Discovery • Provides services information by performing queries to external databases of different kinds (RGMA, BDII) • Client side • Queries for available WMProxyEndpoints on the net • Does not need user commands manual reconfiguration • Server side • Queries for available LB servers where to Log Job information • Job Files Perusal • Performs a monitoring activity on the actual output files produced by a job during its lifecycle: • Adds important pieces of information not available by simple status monitoring and that were before available only at job completion OGF20, Manchester, UK 2007, May 10th

  18. gLite New Activities • New platforms widely deployed on the infrastructure • In particular Scientific Linux 4 and 64-bit architectures • Migration to ETICS build system • High Flexibility • Addresses multiple platform support • almost impossible using the old gLite build system • All WMS components build achieved • WMS Ongoing activity: Integration/deployment • Software not yet fully deployed • Client side manual installation fully working • Server side installation not yet available (almost achieved) OGF20, Manchester, UK 2007, May 10th

  19. gLite WMS Ongoing Restructuring • gLite Restructuring: • All new features development stopped for 6 months • improving usability & portability • Multi platform (Structural changes needed) • Cleaning up sections that cause build and porting difficulties • Removing/Reducing Dependencies on external software to ease installation and deployment • Goals: • Easier Service maintainance and Usage • Will increase stability and throughput • Toward a “gLighter” User Interface • Identify and remove all unnecessary dependencies OGF20, Manchester, UK 2007, May 10th

  20. gLite WMS Future • Improving Logging and Error Reporting • Common syslog-like logging format • Windows working prototype • gLite porting on Microsoft Windows platforms • Improving interoperability • Supercomputing 06 • Working Prototype for Demo • Basic Execution Service (BES) • Job Submission Description Language (JSDL) OGF20, Manchester, UK 2007, May 10th

  21. WfMS and gLite WMS • Possible “external integration” with external existing Workflow frameworks • Still to be discussed and planned • A proposal for a Workflow Mangement System Integrated within WMS under discussion • Running on top of gLite Middleware • Abstract and Generic Representation of Workflow • Internally usage of Petri Net model • Externally translation mechanisms from different language front ends OGF20, Manchester, UK 2007, May 10th

  22. Test & Result • Intense testing and constant bug fixing activities have been performed over the last months • Improved job submission rate • Improved service stability • New Functionalities tested and adopted by experiments • Production quality test Results: • 16K jobs/day over one week of submissions • No manual intervention on server • Stable memory usage • 0.3% of jobs in non-final state • Aborted jobs mostly due to expired user credentials • Was about 5% before Bulk-MM support OGF20, Manchester, UK 2007, May 10th

  23. Some Links • WMS • http://egee-jra1-wm.mi.infn.it/egee-jra1-wm/ • WMProxy • http://trinity.datamat.it/projects/EGEE/wiki/wiki.php • LB • http://egee.cesnet.cz/en/JRA1/index.html • CREAM • http://grid.pd.infn.it/cream • JDL • http://edms.cern.ch/document/590869/1 OGF20, Manchester, UK 2007, May 10th

More Related