BaBar Status for the RD45 Workshop

BaBar Statusfor the RD45 Workshop Jacek Becla David Quarrie Stanford Linear Accelerator Center

Performance Requirements (1) • Online Prompt Reconstruction • Baseline of 200 processing nodes • 100 Hz total (physics plus backgrounds) • 30 Hz of Hadronic Physics • Fully reconstructed • 70 Hz of backgrounds, calibration physics • Not necessarily fully reconstructed • most recent goal: 100 Hz on average • by mid-March

Performance Requirements (2) • Physics Analysis • DST Creation • 2 users at 109 events in 106 secs (1 month) • DST Analysis • 20 users at 108 events in 106 secs • Interactive Analysis • 100 users at 100 events/secs

BaBar Event Production (1)

BaBar Event Production (2)

Production Federations (1) • Developer Test • Dedicated server with 500GB disk & two lockserver machines • Saturation of transaction table with a single lock server • Test federations typically correspond to BABAR software releases • 5 federation ids assigned per developer • Space at a premium – separate journal file area • Shared Test • Developer communities (e.g. reconstruction) • Share hardware with developer test federations • Space becoming a problem – dedicated servers being setup • Online (IR2) • Used for online calibrations, slow controls information, configurations • Servers physically located in experiment hall

Production Federations (2) • Online Prompt Reconstruction (OPR) • Pseudo real-time reconstruction of raw data • Designed to share IR2 federation as 2nd autonomous partition • Intermittent DAQ run startup interference caused split • Still planned to recombine • Input from files on spool disk • Decoupling to prevent possible deadtime • These files also written to tape • 100-200 processing nodes • Design is 200 with 100Hz input event rate • Output to several database servers with 3TB of disk • Automatic migration to hierarchical mass store tape

Production Federations (3) • Reprocessing • Clone of OPR for bulk reprocessing with improved algorithms etc. • Reprocessing from raw data tapes • Being configured now – first reprocessing scheduled for March 2000 • Physics Analysis • Main physics analysis activities • Decoupled from OPR to prevent interference • Simulation Production • Bulk production of simulated data • Small farm of ~30 machines • Augmented by farm at LLNL writing to same federation • Other production site databases imported to SLAC

Production Federations (4) • Simulation Analysis • Shares same servers as physics analysis federation • Separate federations to allow more database ids • Not possible to access physics and simulation data simultaneously • Production Releases • Used during the software release build process. • One per release and platform architecture • Share hardware with developer test federations • Testbed federations (up to 6) • Dedicated servers with up to 240 clients • Performance scaling as function of number of servers, filesystems per server, cpus per server, and other configuration parameters

Integration With Mass Storage • Different regions on disk • Staged: Databases managed by staging/migration/purging service • Resident: Databases are never staged or purged • Dynamic: Neither staged, migrated or purged • Metadata such as federation catalog, management databases • Frequently modified so would be written to tape frequently • Only single slot in namespace but multiple space on tape • Explicit backups taken during scheduled outages • Test: Not managed. • Test new applications and database configurations • Analysis federation staging split into two servers • User: Explicit staging requests based on input event collections • Kept: Centrally managed access to particular physics runs

Movement of data between FDs • Several federations form coupled sets • Physics(Online, OPR, Analysis) • Simulation (Production, Analysis) • Data Distribution strategy to move dbs between fds • Allocation of id ranges avoids clashes between source & destination • Use of HPSS namespace to avoid physical copying of databases • Once a database has been migrated from source, the catalog of the destination is updated and the staging procedures will read the database on demand • Transfer causes some interference • Still working to understand and minimize • Two scheduled outages per week (10% downtime) • Other administrative activities • Backups, schema updates, configuration updates

Physicist Access to Data • Access via event collections • Mapping from event collection to databases • Data from any event spread across 8 databases • Improved performance to frequently accessed information • Reduction in disk space • Scanning of collections became bottleneck • Needed for explicit staging of databases from tape • Mapping known by OPR but not saved • Decided to use Oracle database for staging requests • Single scan of collection to databases mapping • Also used for production bookkeeping • Data distribution bookkeeping • On-demand staging feasible using Objy 5.2 • Has been demonstrated • Prefer explicit staging for production until access better understood

Conditions DB - splits • Conditions/Ambient Split • big win (conditions frequently imported to all federations) • data sample to import/export: 26 GB -> 2 GB • time to import full snapshot: >1 hour --> <25 min • ambient imported less frequently to ambient db • Another conditions split pending • to preserve rolling calibrations generated by OPR

CondDB - moving data @ slac • Copying frequency • 1/day to OPR, shortly will do 4/day • 2/week to analysis • 6 hour latency demonstrated • OPR -> IR2: ~ 1/week • during machine physics outages • driven by demand - expected to increase in future • Planned improvement • copy files (under different names) - no outage necessary • rename files - ~ 1 min “outage”

Data Distribution Tools • Current state • tools based on perl / tcl / shell / C++ • author left, impossible to maintain • will be rewritten in Java • Improvements in transfer rate • secure fast copy developed • ~ twice faster than ftp • next version will use asynchronous transfer • stay tuned!

CondDB - new features • “link database(s)” containing symbolic link(s) to index db(s) • supports potentially multiple index databases for the same component • “high water mark” • set a limit in the validity time to prevent propagating the newly stored condition from covering the rest (up to +oo) of conditions histories. • need to achieve reproducibility of conditions/calibrations • used for rolling calibrations • both in production

Platforms & Software Builds • Currently supported platforms • Sun Solaris, OSF • just started: Linux • Production builds every month • instead of twice a week • serious bug fixes builds allowed in between • Nightly builds (of libraries)

Production Schema Management • Crucial that new software releases compatible with schema in production federations • New software release is a true superset • Reference schema used to preload release build federation • If release build is successful, the output schema forms new reference • Following some QA tests • Offline and online builds can overlap in principle • Token passing scheme ensures sequential schema updates to reference • Production federations updated to reference during scheduled outages • Explicit schema evolution scheme described elsewhere

BaBar Software Review (Aug99) “BaBar’s choice of Objectivity was undoubtedly a very progressive decision and represents a real pioneering effort, which cannot realistically be expected to be implemented without problems. […] Recommendations: • Studies to understand and fix the current limitations on storing events from OPR and reading them during analysis should be allocated sufficient hardware and manpower resources • BaBar must organize an effort to provide a limited-function, short-to-medium term solution for micro-DST analysis using the standard BaBar analysis framework, but without Objectivity” • Next review: • 28 Feb 2000

Hardware in OPR • Servers (Sun E4500, 4 CPU) 1) two, each with one A3500 (500 GB each), LS, JNL 2) two, each with two A3500 (500 GB each), LS, JNL, catalog server • 4 AMSes per server (still true for multi-treaded AMSes) 3) three, each with two A3500 (800 GB each) • Clients 1) Sun Ultra5, 330 MHz, 256 MB • 50, then 100 2) 15 Feb 2000: replaced with Sun T1 (440 MHz, 256 MB, generally faster) • 30% improvements seen on simple benchmarks • performance point of view: reduce lock traffic :-)

Hardware in Test-bed • Initially • 2 data servers (Sun 450), LS, JNL, 100 clients (as in OPR) • Later • 230 nodes (scheduled access with OPR prod) • up to 6 data servers, catalog server

Current Limitation • LS saturation seen with 200 nodes • LS: not multi-threaded • #CPUs, clock speed of little help (tried 330 Hz -> 440 Hz) • Autonomous partitions - problems in objy5.2 • lock usage: already reduced to minimum • still could do better with a lot of effort (pre-creation) • Objy response • next release (early summer 2000) • reduced traffic to lock server • next major release: LS rewritten

Clustering in OPR • clustering data • Per component • redirect data per domain/authLevel/authName/component • never even: 4 physics streams, 6 FSs • changingratiobetweenphysicsstreams • Per client • each 1/6 of nodes has dedicated FS • Independent “database clusters” • do not let 200 nodes write to a single db

# databases in OPR • Total accumulated ~ 14 K • serialization on db • db = file -> vnode • page table updates • “active” dbs seen by one client • 1 event -> 8 active dbs • raw, rec, aod, esd, tag, rawhdr, col, evshdr • 4 physics streams -> 8 * 4 = 32 • 4 database clusters -> 32 * 4 = 128 • or 6 clusters -> 32 * 6 = 192

Achieving 100 Hz in OPR • Requirement: 100 Hz on average by mid-March • almost not feasible • current peak rate 128 Hz (testbed), but • efficiency on average ~ 20 Hz • Major issues/inefficiencies • Startup time • operational delays • “finalize” • rolling calibrations, writing histograms, etc

Processing rate 200 nodes

OPR Performance Tests

Objy 5.2 issues • AMS multi-threaded, but • much higher CPU usage: understood (thanks Andy) • serialization observed: possibly connected to cpu usage • staging works • can stage databases while other clients are using the AMS • migration @ slac • in phases (per fd) • no problem with 5.1 / 5.2 interoperability

Future Improvements • Bronco farm -> T1 • more file systems • improved Objy code • next release, LS rewritten • autonomous partitions • reducing payload per event • event size still too big • raw ~80 kB instead of 32 kB • rec 150 kB instead of 120 kB

Hardware for Analysis • Analysis fd • LS, JNL server, catalog server • 3 data servers • 6 file systems • SP2 analysis • LS, JNL server, soon catalog server • 3 data servers • 6 file systems

Optimizing Analysis • Placement • optimized: data rebalanced (several iterations) • added more data servers, catalog server • optimized access to tag & micro data • 35 Hz -> 2 kHz (iterating over tag) • expected x25 bandwidth relative to August ‘99 • more fileservers & filesystems • Dirk’s visit of significant help, next visit just after CHEP • Found performance problems in transient code • proxyDict, RogueWave, …

Miscellaneous (1) • Conditions spread across all file systems (OPR) • automatic load balance (AMS + fs cache) • automatic purging from disc cache • Working with Objy on LS monitoring, profiling... • new db using Unix cp + ooattachdb • only 1 catalog operation, much faster • dual lock servers for developers’ federations • Objy transaction table limit too low

Miscellaneous (2) • autonomous partitions still not used in production • Large file handling issues • Problems handling 10GB db files for external distribution • Running out of dbids quickly • working on reliable cleanup mechanism • extending containers is expensive operation • presizing containers

Miscellaneous (3) • no “urgent” problems/bugs since last RD45 workshop • DRO tried recently • unusable in production • Supporting external sites • tuning their systems, solving problems, advising • starts to be a significant burden on db group • Weekly meetings with Objy developer

Event Store Browser • Added multi-treading • thread-per-connection, or thread pool • in conjunction with objectivity contexts • improve scalability • Collections may be 'scanned' to obtain lists of associated databases. Establishes a link between the event collection browse tree and the database browse tree, very useful for distribution. • New EventID now includes run number and timestamp info. Cloned events can now be identified and traced back to parent collections. • Several GUI facelifts

Objy - Known Problems (1) • conversion functions • reading tags written on Sun from OSF or Linux • 2kHz (Sun), 0.5kHz (OSF or Linux) • CPU utilization: 30% (Sun), 90% (Linux) • problem for external sites • e.g. background events written on Sun @ slac • fixing not trivial, one time conversion tool could be provided shortly

Objy - Known Problems (2) • Objy 5.2 AMS - high CPU usage • understood at SLAC, pending fix • waitlock ignored • when APs used, under heavy load • introduced in 5.2 • bug in Active Schema • practically cannot use with BaBar schema • fixed, coming soon with next release

Future Improvements • next release (summer 2000) • performance improvements • threadSpec, ls traffic, cache management, reduce fsynch • discussed • large page size • painful upgrade, probably feasible <1 PB • long refs • read-only databases • pre-creating containers

Support Personnel • Two database administrators • Daily database operations • Develop management scripts and procedures • Data distribution between SLAC production federations • First level of user support • Two data distribution support people • Data distribution to/from regional centers and external sites • Two HPSS (mass store) support people • Also responsible for AMS back-end software • Staging/migration/purging • Extensions (security, timeout deferral, etc.) • Five database developers provide second tier support • Augmented by physicists, visitors

Database Statistics (Jan 2000) • Servers • 12 primary (database servers) • 15 secondary (lock servers, catalog & journal servers) • 513 persistent classes • Data • ~33TB accumulated data • ~14000 databases • ~28000 collections • Disk Space • ~10TB • Total for data, split into staged, resident, etc.

Database Statistics (2) • Sites • >30 sites using Objectivity (USA, UK, France, Italy, Germany, Russia) • Users • ~655 licensees (signed the license agreement) • ~430 users (created a test federation) • ~90 simultaneous users at SLAC • Monitoring distributed oolockmon statistics • ~60 developers • Have created or modified a persistent class • A wide range of expertise: 10-15 experts

Summary • Database performance • no longer seen as a major problem • OPR: satisfactory now, still improving • analysis: getting better (lack of manpower) • Still learning how to manage a large system • multiple federations, multiple servers, multiple sites • large user community, large developer community • Still more to automate - too many manual procedures • We seem to have the largest database in the world • and it is growing fast • does anybody else entered into multi-TB region yet?

BaBar Status for the RD45 Workshop