Batch Software at JLAB

Batch Software at JLAB Ian Bird Jefferson Lab CHEP2000 7-11 February, 2000

Introduction • Environment • Farms • Data flows • Software • Batch systems • JLAB software • LSF vs. PBS • Scheduler • Tape software • File pre-staging/caching CHEP 2000

Environment • Computing facilities were designed to: • Handle data rate of close to 1 TB/day • 1st level reconstruction only (2 passes) • Match average data rate • Some local analysis but mainly export of vastly reduced summary DSTs • Originally estimated requirements: • ~ 1000 SI95 • 3 TB online disk • 300 TB tape storage – 8 RedWood drives CHEP 2000

Environment - real • After 1 year of production running of CLAS (largest experiment) • Detector is far cleaner than anticipated, which means: • Data volume is less ~ 500 GB/day • Data rate is 2.5x anticipated (2.5 kHz) • Fraction of good events larger • DST sizes are same as Raw data (!) • Per event processing time is much longer than original estimates • Most analysis is done locally – no-one is really interested in huge data exports • Other experiments also have large data rates (for short periods) CHEP 2000

Computing implications • CPU requirement is far greater • Current farm is 2650 SI95 and will double this year • Farm has a big mixture of work • Not all production – “small” analysis jobs too • We make heavy use of LSF hierarchical scheduling • Data access demands are enormous • DSTs are huge, many people, frequent accesses • Analysis jobs want many files • Tape access became a bottleneck • Farm can no longer be satisfied CHEP 2000

Fast Ethernet Gigabit Ethernet SCSI2 FWD SCSI2 UWD/S JLab Farm Layout 400GB UWD 400GB UWD 400GB UWD 400GB UWD 3TB UWD 3TB UWD 400GB Dual Sun Ultra2 Dual Sun Ultra2 Dual Sun Ultra2 Dual Sun Ultra2 STK 9840 Tape Drives stage work work MetaStor SH7400 File Server MetaStor SH7400 File Server Cisco Cat 5500 Quad SUN E4000 Quad SUN E3000 Cisco 2900 Cisco 2900 Cisco 2900 Dual PIII 500MHz Qty. 25 Dual PII 450MHz Qty. 20 Dual PIII 650MHz Qty. 25 Dual PII 400MHz Qty. 20 Dual PII 300MHz Qty. 10 stage 150GB 200GB 18GB UWS 18GB UWS 18GB UWS 18GB UWS 18GB FWD Plan - FY 2000 CACHE FILE SERVERS WORK FILE SERVERS MASS STORAGE SERVERS Gigabit Ethernet Gigabit Ethernet Fast Ethernet FARM SYSTEMS STK Redwood Tape Drives

Other farms • Batch farm • 180 nodes -> 250 • Lattice QCD • 20 node Alpha (Linux) cluster • Parallel application development • Plans (proposal) for large 256 node cluster • Part of larger collaboration • Group want a “meta-facility” • Jobs run on least loaded cluster (wide area scheduling) CHEP 2000

Additional requirements • Ability to handle and schedule parallel jobs (MPI) • Allow collaborators to “clone” the batch systems and software • Allow inter-site job submission • LQCD is particularly interested in this • Remote data access CHEP 2000

Components • Batch software • Interface to underlying batch system • Tape software • Interface to OSM, overcome limitations • Data caching strategies • Tape staging • Data caching • File servers CHEP 2000

Batch software • A layer over the batch management system • Allow replacement of batch system LSF, PBS (DQS) • Constant user interface no matter what the underlying system is • Batch farm can be managed by the management system (e.g. LSF) • Build in a security infrastructure (e.g GSI) • Particularly to allow remote access securely CHEP 2000

Batch system - schematic User processes Submission, query, statistics Submission interface Query interface Job submission system Database Batch control system LSF, PBS, DQS, etc. Batch processors

Existing batch software • Has been running for 2 years • Uses LSF • Multiple jobs – parameterized jobs (LSF now has job arrays, PBS does not have this) • Client is trivial to install on any machine with a JRE – no need to install LSF, PBS etc. • Eases licensing issues • Simple software distribution • Remote access • Standardized statistics and bookkeeping outside of LSF • MySQL based CHEP 2000

Existing software cont. • Farm can be managed by LSF • Queues, hosts, scheduler etc. • Rewrite in progress to: • Add PBS interface (and DQS?) • Security infrastructure to permit authenticated remote access • Clean up CHEP 2000

PBS as alternative to LSF • PBS (Portable Batch System – NASA) • Actively developed • Open, freely available • Handles MPI (PVM) • User interface very familiar to NQS/DQS users • Problem (for us) was the (lack of a good) scheduler • PBS provides only a trivial scheduler, but • Provides mechanism to plug in another • We were using hierarchical scheduling in LSF CHEP 2000

PBS scheduler • Multiple stages (6), can be used or not as required, in arbitrary order • Match making – matches requirements to system resources • System priority (e.g. data available) • Queue selection (which queue runs next) • User priority • User share: which user runs next, based on user and group allocations and usage • Job age • Scheduler has been provided to PBS developers for comments – and is under test CHEP 2000

Mass storage • Silo – 300 TB Redwood capacity • 8 Redwood drives • 5 (+5) 9840 drives • Managed by OSM • Bottleneck: • Limited to a single data mover • That node has no capacity for more drives • 1 TB tape staging RAID disk • 5 TB of NFS work areas/caching space CHEP 2000

Solving tape access problems • Add new drives – 9840’s • Requires 2nd OSM instance • Transparent to user • Eventual replacement of OSM • Transparent to user • File pre-staging to the farm • Distributed data caching (not NFS) • Tools to allow user optimization • Charge for (prioritize) mounts CHEP 2000

OSM • OSM has several limitations (and is no longer supported) • Single mover node is most serious • No replacement possible yet • Local tapeserver software solves many of these problems for us • Simple remote clients (Java based) – do not need OSM except on server CHEP 2000

Tape access software • Simple put/get interface, • Handles multiple files, directories etc. • Can have several OSM instances, but a unique file catalog, transparent to user • System fails over between servers • Only way to bring 9840’s on line • Data transfer is network (socket) copy in Java • Allows a scheduling/user allocation algorithm to be added to tape access • Will permit “transparent” replacement of OSM CHEP 2000

Data pre-fetching & caching • Currently • Tape – stage disk – network copy to farm node local disk • Tape – stage disk – NFS cache – farm • But this can cause NFS server problems • Plan: • Dual solaris nodes with • ~ 350 GB disk (RAID 0) • Gigabit ethernet • Provides large cache for farm input • Stage out entire tapes to cache • Cheaper than staging space, better performance than NSF • Scaleable as the farm grows CHEP 2000

Fast Ethernet Gigabit Ethernet SCSI2 FWD SCSI2 UWD/S JLab Farm Layout 400GB UWD 400GB UWD 400GB UWD 400GB UWD 3TB UWD 3TB UWD 400GB Dual Sun Ultra2 Dual Sun Ultra2 Dual Sun Ultra2 Dual Sun Ultra2 STK 9840 Tape Drives stage work work MetaStor SH7400 File Server MetaStor SH7400 File Server Cisco Cat 5500 Quad SUN E4000 Quad SUN E3000 Cisco 2900 Cisco 2900 Cisco 2900 Dual PIII 500MHz Qty. 25 Dual PII 450MHz Qty. 20 Dual PIII 650MHz Qty. 25 Dual PII 400MHz Qty. 20 Dual PII 300MHz Qty. 10 stage 150GB 200GB 18GB UWS 18GB UWS 18GB UWS 18GB UWS 18GB FWD Plan - FY 2000 CACHE FILE SERVERS WORK FILE SERVERS MASS STORAGE SERVERS Gigabit Ethernet Gigabit Ethernet Fast Ethernet FARM SYSTEMS STK Redwood Tape Drives

File pre-staging • Scheduling for pre-staging is done by the job server software • Splits/groups jobs by tape (could be done by user) • Makes a single tape request • Holds jobs while files are staged • Implemented by batch jobs that release held jobs • Released jobs with data available get high priority • Reduces job slots blocked by jobs waiting for data CHEP 2000

Conclusions • PBS is a sophisticated and viable alternative to LSF • Interface layer permits • use of same jobs on different systems – user migration • Add features to batch system CHEP 2000

Batch Software at JLAB

Batch Software at JLAB

Presentation Transcript

Batch at the limit?!

Batch Computing at Altera

Production Preparations at JLAB

JLab Software Assurance Program

DVCS at JLab

Precious Metals at JLAB

PVDIS at JLab 6 GeV

Future Physics at JLab

Polarized Electron Footprint at JLab

Batch Implementation at Plascon Mobeni

LQCD Clusters at JLab

Future Measurements at 12 GeV at Jlab

Physics with ELIC at JLab

Dihadron production at JLab

Hard scattering studies at JLab

Polarimetry at JLab

PC-104 Use at JLAB

Migration to PPC at JLab

Batch Computing at Altera

Graded Configuration Management at JLab

LQCD Clusters at JLab

Migration to PPC at JLab