Data Distribution and Management

Data Distributionand Management Tim Adye Rutherford Appleton Laboratory BaBar Computing Review 9th June 2003

Overview • Kanga Distribution system • New Computing Model requirements • Data Management • XRootd • Using the Grid People:- Tim Adye (Kanga, Grid), Dominique Boutigny (Grid),Alasdair Earl (Grid, Objy), Alessandra Forti (Bookkeeping),Andy Hanushevsky (XRootd, Grid), Adil Hasan (Grid, Objy), Wilko Kroeger (Grid), Liliana Martin (Grid, Objy),Jean-Yves Nief (Grid, Objy), Fabrizio Salvatore (Kanga) Data Distribution and Management

Tier C SP site Current Dataflow Scheme Padova Reprocessed data (Objy) MC (Objy) SLAC Analysis data + MC (Objy) Analysis data + MC (Kanga) RAL ccIN2P3 Karlsruhe Data Distribution and Management

Tier C Tier C Tier C SP site Future Dataflow Possibilities Padova Data: new + reprocessed MC SLAC Analysis data + MC Karlsruhe RAL ccIN2P3 Data Distribution and Management

skimData Database skimData Database Local Information Local Information Kanga Files Kanga Files Current Kanga Data Distribution Source: eg. SLAC Destination: eg. RAL skimSqlMirror transfers new entries MySQL Oracle Transfer controlled By local selections New files marked as being on disk Files copied with scp, bbftp, or bbcp Data Distribution and Management

Key Features • Each site has its own copy of the meta-data (skimData database) • Once metadata/data is imported, all analysis is independent of SLAC • Each site knows what is available locally • Three-step update process (can run in a cron job) • New entries from SLAC added to DB(skimSqlMirror) • Local selections applied to new entries and flagged in the database (skimSqlSelect) • Uses the same selection command as used for user analysis • Selection can include an import priority • Selected files imported and flagged “on disk” (skimImport) • Database records what is still to do, so nothing is forgotten if import is restarted Data Distribution and Management

Data around the world • Currently RAL holds the full Kanga dataset • 29 TBon disk • 1.5 Mfiles, 1010 events including stream/reprocessing duplicates • A further 4 TB of old data archived to RAL tape • 10-20 Tier C sites import data from SLAC or RAL • Most copy a few streams • Karlsruhe copies full dataset (AllEvents), but not streams • Since August, Kanga data removed from SLAC • First exported to RAL • File checksums recorded and compared at RAL • Archived to SLAC tape (HPSS) – 29 TB on tape now Data Distribution and Management

Handling Pointer Skims • 105 pointer skims for all series-10 real data now available • Each pointer skim file contains pointers to selected events in AllEvents files • Big saving in disk space, but need to evaluate performance implications • Tier C sites do not have AllEvents files • “Deep copy” program creates a self-contained skim file from specified pointer files • Needs to be integrated into import procedure to allow automatic conversion • Plan is to run deep copy like another “ftp” protocol, with output written directly to Tier C disk via ROOT daemon Data Distribution and Management

New Analysis Model Few changes are required for the new analysis model • April test output already exported to RAL • Allow for multiple files per collection • Simple extension to old bookkeeping system should allow easy transition to new system • Available for July test skim production • will not handle user data • Full collection  file mapping will come with new bookkeeping tables • Migrate to new bookkeeping tables and APIs • should simplify existing tools Data Distribution and Management

Data Management • All Kanga files stored in a single directory tree, accessed via NFS • Mapping to different file-systems/servers maintained with symbolic links • Automatically created by import tools • Cumbersome and error-prone to maintain • Tools for automatic reorganisation of disks at SLAC only • Kanga files can be archived and restored on tape • Modular interface to local tape systems • Implemented for SLAC, RAL, (and Rome) • Small files packed into ~1GB archives for more efficient tape storage • Archive and restore controlled with skimData selections • Noautomatic “stage on demand” or “purge unused data” Data Distribution and Management

XRootd XRootd, like Rootd (the ROOT daemon), provides remote access to ROOT data – but greatly enhanced… • High Performance and scalable • multi-threaded, multi-process, multi-server architecture • Dynamic load-balancing • Files located using multi-cast lookup • No central point of failure • Allows files to be moved or staged dynamically • Allows servers to be dynamically added and removed Data Distribution and Management

xrootd xrootd xrootd (any number) dlbd dlbd dlbd dlbd (any number) xrootd XRootd Dynamic Load Balancing Mass Storage System (Tape, HPSS, etc) subscribe I do who has the file? Client Data Distribution and Management

XRootd (cont) • Flexible security • Allowing use of almost any protocol • Includes interface to mass-storage system • Automatic “stage on demand” or “purge unused data” • Reuses file-system, mass-store, and load-balancing systems from SLAC Objectivity (oofs, ooss, oolb) • Compatibility with Rootd allows access from existing clients (and vice versa) • Moving from NFS to XRootd should solve many of our data management problems • Still need tools for automatic redistribution of data between servers Data Distribution and Management

Using the Grid • Extend existing tools to use Grid authentication • Use GridFTP or GSI-extended bbftp/bbcp for file transfer • Add GSI authentication to deep copy • BdbServer++ uses remote job submission to run deep copy and data transfer • Prototype currently being tested with Objectivity exports • SRB is a distributed file catalogue and data replication system • Have demonstrated bulk Objectivity data transfers SLAC  ccIN2P3 • Both BdbServer++ and SRB could handle Kanga distribution just as easily (if not more easily) Data Distribution and Management

Conclusion • Kanga distribution works now • Adapting to new computing model should be easy • Data management is still cumbersome • Many of the problems should be addressed by XRootd • We are looking to take advantage of the Grid Data Distribution and Management

Backup Slides Adil Hasan

SRB in BaBar (Tier A to Tier A) • SLAC, ccin2p3 and Paris VI et VII looking at using SDSC Storage Resource Broker for data distribution (orig Objy). • SRB comes with a metadata catalog plus set of Apps to move data and update location in metadata catalog. • Can remotely talk to the metadata catalog and move data through SRB server, able to use GSI (ie X509 certs) or password for auth. • Have managed to successfully carry out 2 prototypes at Super Computing 2001, 2002: • 1st prototype replicate objy databases between 2 servers at SLAC • Had 1 metadata catalog for the replicated data. • 2nd prototype copied objy databases between SLAC and ccin2p3 • Used 2 SRB metadata catalogs one at ccin2p3, one at SLAC. • See: http://www.slac.stanford.edu/BFROOT/www/Computing/Offline/DataDist/DataGrids/index.html Data Distribution and Management

SRB in BaBar (Tier A to Tier A) • Metadata cataog design for prototypes separate experiment-specific from experiment-common metadata (eg file sizes, mtimes, etc are expt-common, BaBar collections are expt-specific). • Design also very flexible, should be easy to accommodate the new Decoupling allows greater flexibility to system (SRB and BaBar can develop metadata structure how they want independently). • Comp model – provided bookkeeping uses same principle in design. • Have been plan on demonstrating the co-existence of objy and root files in SRB catalog. • Are putting this runs production objy metadata into SRB for data dist to ccin2p3 NOW (running this week or next). Should give us useful information on running in production NOW. • Are looking at technologies to federate local SRB metadata catalogs. Data Distribution and Management

SRB in BaBar (Tier A to Tier A) • Bi-weekly meetings with SRB developers are helping to get further requirements and experience fed-back to SRB (eg bulk loading, parallel file transfer app, etc). • Bug-fix release cycle ~weekly. • New feature releases ~3-4 weekly. • Meetings also have participation from SSRL folks. Data Distribution and Management

BdbServer++ (Tier A to Tier C) • Based on BdbServer developed by D. Boutigny et al. used to extract objy collections ready for shipping to Tier A or C. • Extended to allow extraction of events of interest from Objy (effort from Edinburgh, SLAC, ccinp3) • Using BdbCopyJob to make deep-copy of collection of events of interest. • BdbServer tools to extract the collection • Tools to ship collections from Tier A to Tier C • Useful for “hot” collections of interest (ie deep copies that are useful for multiple Tier C’s). • “cache” the hot collections at Tier A to allow quicker distrib to Tier C. Data Distribution and Management

BdbServer++ Tier A to Tier C • Grid-ified BdbServer++: • Take advantage of the distrib batch work for the BdbCopyJob and extraction. • Take advantage of grid-based data dist to move files from Tier A to C. • Design highly modular should make it easy to implement changes based on prod running experience. • Tested BdbCopyJob through EDG and Globus. • Tested extraction of collection through EDG. • Tested shipping collections through SRB. • Have not tested all steps together! • Production style running may force changes to how collections are extracted and shipped (issues on disk space etc). Data Distribution and Management

BdbServer++ Tier A to Tier C • BdbServer++ work also producing useful by-products: • Client-based globus installation tool (complete with CA info). • Modular structure of BdbServer++ should make it easy to plug in the new tools for the new CM (eg BdbCopyJob -> Root-deep-copy-app, objy collection app -> root coll app). • All Grid data dist tools being developed must keep in mind legacy apps (implies phased approach). However new development should not exclude Grid by design. Data Distribution and Management

Data Distribution and Management

Data Distribution and Management

Presentation Transcript

APFO Data Distribution and Services

Key Distribution and Management

Data Distribution

Management and Distribution of Chemical Data in the Protein Data Bank

Distribution Management

JERICO WP5 Data management and distribution

Population distribution, density, and data

Data Distribution

APFO Data Distribution and Services

Key Distribution/Management and Authentication

Data and Distribution

Data and its Distribution

Key Distribution/Management and Authentication

Distribution Management

Data Distribution and Distributed Transaction Management

JERICO WP5 Data management and distribution

Data Distribution

inventory management and distribution