1 / 23

Batch Software at JLAB

Batch Software at JLAB. Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000. Introduction. Environment Farms Data flows Software Batch systems JLAB software LSF vs. PBS Scheduler Tape software File pre-staging/caching. Environment. Computing facilities were designed to:

Download Presentation

Batch Software at JLAB

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

  2. Introduction • Environment • Farms • Data flows • Software • Batch systems • JLAB software • LSF vs. PBS • Scheduler • Tape software • File pre-staging/caching CHEP 2000

  3. Environment • Computing facilities were designed to: • Handle data rate of close to 1 TB/day • 1st level reconstruction only (2 passes) • Match average data rate • Some local analysis but mainly export of vastly reduced summary DSTs • Originally estimated requirements: • ~ 1000 SI95 • 3 TB online disk • 300 TB tape storage – 8 RedWood drives CHEP 2000

  4. Environment - real • After 1 year of production running of CLAS (largest experiment) • Detector is far cleaner than anticipated, which means: • Data volume is less ~ 500 GB/day • Data rate is 2.5x anticipated (2.5 kHz) • Fraction of good events larger • DST sizes are same as Raw data (!) • Per event processing time is much longer than original estimates • Most analysis is done locally – no-one is really interested in huge data exports • Other experiments also have large data rates (for short periods) CHEP 2000

  5. Computing implications • CPU requirement is far greater • Current farm is 2650 SI95 and will double this year • Farm has a big mixture of work • Not all production – “small” analysis jobs too • We make heavy use of LSF hierarchical scheduling • Data access demands are enormous • DSTs are huge, many people, frequent accesses • Analysis jobs want many files • Tape access became a bottleneck • Farm can no longer be satisfied CHEP 2000

  6. Fast Ethernet Gigabit Ethernet SCSI2 FWD SCSI2 UWD/S JLab Farm Layout 400GB UWD 400GB UWD 400GB UWD 400GB UWD 3TB UWD 3TB UWD 400GB Dual Sun Ultra2 Dual Sun Ultra2 Dual Sun Ultra2 Dual Sun Ultra2 STK 9840 Tape Drives stage work work MetaStor SH7400 File Server MetaStor SH7400 File Server Cisco Cat 5500 Quad SUN E4000 Quad SUN E3000 Cisco 2900 Cisco 2900 Cisco 2900 Dual PIII 500MHz Qty. 25 Dual PII 450MHz Qty. 20 Dual PIII 650MHz Qty. 25 Dual PII 400MHz Qty. 20 Dual PII 300MHz Qty. 10 stage 150GB 200GB 18GB UWS 18GB UWS 18GB UWS 18GB UWS 18GB FWD Plan - FY 2000 CACHE FILE SERVERS WORK FILE SERVERS MASS STORAGE SERVERS Gigabit Ethernet Gigabit Ethernet Fast Ethernet FARM SYSTEMS STK Redwood Tape Drives

  7. Other farms • Batch farm • 180 nodes -> 250 • Lattice QCD • 20 node Alpha (Linux) cluster • Parallel application development • Plans (proposal) for large 256 node cluster • Part of larger collaboration • Group want a “meta-facility” • Jobs run on least loaded cluster (wide area scheduling) CHEP 2000

  8. Additional requirements • Ability to handle and schedule parallel jobs (MPI) • Allow collaborators to “clone” the batch systems and software • Allow inter-site job submission • LQCD is particularly interested in this • Remote data access CHEP 2000

  9. Components • Batch software • Interface to underlying batch system • Tape software • Interface to OSM, overcome limitations • Data caching strategies • Tape staging • Data caching • File servers CHEP 2000

  10. Batch software • A layer over the batch management system • Allow replacement of batch system LSF, PBS (DQS) • Constant user interface no matter what the underlying system is • Batch farm can be managed by the management system (e.g. LSF) • Build in a security infrastructure (e.g GSI) • Particularly to allow remote access securely CHEP 2000

  11. Batch system - schematic User processes Submission, query, statistics Submission interface Query interface Job submission system Database Batch control system LSF, PBS, DQS, etc. Batch processors

  12. Existing batch software • Has been running for 2 years • Uses LSF • Multiple jobs – parameterized jobs (LSF now has job arrays, PBS does not have this) • Client is trivial to install on any machine with a JRE – no need to install LSF, PBS etc. • Eases licensing issues • Simple software distribution • Remote access • Standardized statistics and bookkeeping outside of LSF • MySQL based CHEP 2000

  13. Existing software cont. • Farm can be managed by LSF • Queues, hosts, scheduler etc. • Rewrite in progress to: • Add PBS interface (and DQS?) • Security infrastructure to permit authenticated remote access • Clean up CHEP 2000

  14. PBS as alternative to LSF • PBS (Portable Batch System – NASA) • Actively developed • Open, freely available • Handles MPI (PVM) • User interface very familiar to NQS/DQS users • Problem (for us) was the (lack of a good) scheduler • PBS provides only a trivial scheduler, but • Provides mechanism to plug in another • We were using hierarchical scheduling in LSF CHEP 2000

  15. PBS scheduler • Multiple stages (6), can be used or not as required, in arbitrary order • Match making – matches requirements to system resources • System priority (e.g. data available) • Queue selection (which queue runs next) • User priority • User share: which user runs next, based on user and group allocations and usage • Job age • Scheduler has been provided to PBS developers for comments – and is under test CHEP 2000

  16. Mass storage • Silo – 300 TB Redwood capacity • 8 Redwood drives • 5 (+5) 9840 drives • Managed by OSM • Bottleneck: • Limited to a single data mover • That node has no capacity for more drives • 1 TB tape staging RAID disk • 5 TB of NFS work areas/caching space CHEP 2000

  17. Solving tape access problems • Add new drives – 9840’s • Requires 2nd OSM instance • Transparent to user • Eventual replacement of OSM • Transparent to user • File pre-staging to the farm • Distributed data caching (not NFS) • Tools to allow user optimization • Charge for (prioritize) mounts CHEP 2000

  18. OSM • OSM has several limitations (and is no longer supported) • Single mover node is most serious • No replacement possible yet • Local tapeserver software solves many of these problems for us • Simple remote clients (Java based) – do not need OSM except on server CHEP 2000

  19. Tape access software • Simple put/get interface, • Handles multiple files, directories etc. • Can have several OSM instances, but a unique file catalog, transparent to user • System fails over between servers • Only way to bring 9840’s on line • Data transfer is network (socket) copy in Java • Allows a scheduling/user allocation algorithm to be added to tape access • Will permit “transparent” replacement of OSM CHEP 2000

  20. Data pre-fetching & caching • Currently • Tape – stage disk – network copy to farm node local disk • Tape – stage disk – NFS cache – farm • But this can cause NFS server problems • Plan: • Dual solaris nodes with • ~ 350 GB disk (RAID 0) • Gigabit ethernet • Provides large cache for farm input • Stage out entire tapes to cache • Cheaper than staging space, better performance than NSF • Scaleable as the farm grows CHEP 2000

  21. Fast Ethernet Gigabit Ethernet SCSI2 FWD SCSI2 UWD/S JLab Farm Layout 400GB UWD 400GB UWD 400GB UWD 400GB UWD 3TB UWD 3TB UWD 400GB Dual Sun Ultra2 Dual Sun Ultra2 Dual Sun Ultra2 Dual Sun Ultra2 STK 9840 Tape Drives stage work work MetaStor SH7400 File Server MetaStor SH7400 File Server Cisco Cat 5500 Quad SUN E4000 Quad SUN E3000 Cisco 2900 Cisco 2900 Cisco 2900 Dual PIII 500MHz Qty. 25 Dual PII 450MHz Qty. 20 Dual PIII 650MHz Qty. 25 Dual PII 400MHz Qty. 20 Dual PII 300MHz Qty. 10 stage 150GB 200GB 18GB UWS 18GB UWS 18GB UWS 18GB UWS 18GB FWD Plan - FY 2000 CACHE FILE SERVERS WORK FILE SERVERS MASS STORAGE SERVERS Gigabit Ethernet Gigabit Ethernet Fast Ethernet FARM SYSTEMS STK Redwood Tape Drives

  22. File pre-staging • Scheduling for pre-staging is done by the job server software • Splits/groups jobs by tape (could be done by user) • Makes a single tape request • Holds jobs while files are staged • Implemented by batch jobs that release held jobs • Released jobs with data available get high priority • Reduces job slots blocked by jobs waiting for data CHEP 2000

  23. Conclusions • PBS is a sophisticated and viable alternative to LSF • Interface layer permits • use of same jobs on different systems – user migration • Add features to batch system CHEP 2000

More Related