1 / 15

ToPoS: High-Throughput Parallel Processing Pipelines on the Grid

ToPoS: High-Throughput Parallel Processing Pipelines on the Grid. Pieter van Beek SARA Computing and Networking Services High Performance Computing and Visualization e-Science Support. Users experiences with gLite. Overhead for starting jobs is considerable

pilis
Download Presentation

ToPoS: High-Throughput Parallel Processing Pipelines on the Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ToPoS:High-Throughput Parallel Processing Pipelines on the Grid Pieter van Beek SARA Computing and Networking Services High Performance Computing and Visualization e-Science Support

  2. Users experiences with gLite • Overhead for starting jobs is considerable • Determining the best chunk size is difficult. • Too small -> large overhead • Too large -> timeouts and throughput problems. • Resource brokering is far from optimal • Jobs often fail and users create their own tools for administrative tasks

  3. Resource Brokering Submitted jobs are sent to a CE immediately. When another CE becomes available, you won't use it automatically

  4. Failing Jobs (1) • Common experiences: • Sorry, an Incomprehensible Error occurred • Your VOMS Credential has expired • What Job? • Success! (but there’s no output) • Failure! (but it ran just fine) • Out of Wall-time (but no CPU-time?) • A lot of “monitoring and resubmission” software is created again and again by many users.

  5. Failing Jobs (2) • A real world example: • 27,000 jobs • duration: approx. 4 hrs • approx. 280 WNs • Theoretical duration: 16 days • But with a success rate of 70% … • Approx. 9 resubmissions • “Practical” duration: >2 months

  6. Pilot Jobs • “Normal” jobs • Pilot jobs

  7. Simplest possible solution:Topos I • An online counter, like a “page views” counter • Numbers are “leased” for some period • Leases must be renewed • Interfaced with HTTP (REST web service) • Can be used with any HTTP client (wget, browsers) • As little security as possible

  8. Pilot job flow Pilot job Running pilot job Get unused token Submit Finished? Execute token task Pilot job with token no yes Delete token affirm token use

  9. Advantages • Simple design and use • Using HTTP REST • Automatic resubmissions • Less overhead for large number of jobs. One pilot job can execute several tasks in sequence. • Improved scheduling • Easy job administration by querying Token Pool Server. • Progress • Fail rate

  10. Topos I screenshots

  11. Topos 2.x • Interfaced by WebDAV i.o. HTTP • Tokens are files, i.e. they have • identity • content • mime-type • properties • Token pools are directories • Tokens can be moved between directories • Allows users to build pipelines and workflows (high-level colored Petri nets)

  12. Topos 2 screenshot

  13. “Portfolio” • SciaGrid • Collaboration between SRON, KNMI, NIKHEF and SARA • Website where users can select • satellite data (Sciamachy) • data processors • Arnold Kuzniar and Jack Leunissen (WUR) • BLAST protein sequence alignment • Bas Dutilh (CMBI) • HAMMER sequence alignment (?) • Jan Bot (TUD)

  14. Future directions • Documentation • ATOM/RSS instead of WEBDAV • Back to numbers instead of files • TODO

  15. pieterb@sara.nl

More Related