DQ2 status, releases and plans

DQ2 status, releases and plans US ATLAS DDM Workshop 28-29 September 2006 on behalf of the DDM Group

Outline • Introduction to DQ2 • Brief look at the internals of DQ2 • Bottlenecks • Overview of the project, including future plans • Items to discuss and Conclusions

DQ2 • DQ2, our Distributed Data Management system which builds on top of Grid data transfer tools, is based on: • Hierarchical definition of files and datasets • Through dataset catalogs • Datasets as the unit of file storage and replication • Supporting dataset versions • Distributed file catalogues at each site • Automatic data transfer mechanisms using distributed site services • Dataset subscription system • DQ2 allows the implementation of the basic ATLAS Computing Model needs, as described in the Computing TDR (June 2005): • Distribution of raw and reconstructed data from CERN to the Tier-1s • Distribution of AODs (Analysis Object Data) to Tier-2 centres for analysis • Storage of simulated data (produced by Tier-2s) at Tier-1 centres for further distribution and/or processing

DQ2 Concepts • ‘Dataset’: • an aggregation of data (spawning more than one physical file!), which are processed together and serve collectively as input or output of a computation or data acquisition process. • Flexible definition: • … can be used for grouping related data (e.g. RAW from a run with a given luminosity) • … can be used for data movement purposes (e.g. Panda dispatch blocks, with data needed as input for jobs) • ‘File’: • constituent of a dataset • Identified by LFN and GUID

DQ2 Concepts • ‘Site’ • A computing site providing storage facilities for ATLAS • … which may be a federated site • ‘Subscription’ • Mechanism to request updates of a dataset to be delivered to a site

Site ‘X’: Dataset ‘A’ Subscriptions: File1 File2 Dataset ‘A’ | Site ‘X’ DQ2 components • Dataset Catalogs • Responsible for all bookkeeping • Developed by ATLAS and hosted by ATLAS @ CERN (jointly by DDM ‘core’ team and DDM ‘Operations team) • Site Services • Responsible for fulfilling dataset subscriptions • Subscription is the concept driving data movement: a site storage is subscribed to a dataset

DQ2 components Part of DQ2 Not Part of DQ2 Not Part of DQ2 DQ2 dataset catalogs File Transfer Service DQ2 “Queued Transfers” DQ2 Subscription Agents Local Replica Catalog

DQ2 dependencies • Python >= 2.3.4, curl, py-libcurl, Globus GSI • Dataset Catalogs + Apache front-end, mod_python, mod_gridsite MySQL server (currently) • Site Services + MySQL client and server gLite FTS client, srmcp client, LFC client Site Services run on the ATLAS “VO BOX” (a dedicated machine for ATLAS), with one instance per site storage [ Note: LFC: LCG product / gLite: EGEE product ]

More than just s/w development • DQ2 forced the introduction of many concepts, defined in the Computing Model, onto the middleware: • ATLAS Association between Tier-1/Tier-2s • Distinction between temporary (e.g. disk) and archival (e.g. tape) areas • Datasets as the unit of data handling • Often, existing Grid middleware was not originally designed to have these concepts: • DQ2 has a Tiers of ATLAS file, containing the ‘association’ between Tier-1s and Tier-2 sites • DQ2 has been the product of a joint collaboration between CERN-based ATLAS Computing Group and US ATLAS

A (brief?) look at theinternals of DQ2

Catalogs • Currently, single instance of dataset catalogs • Architecture foresees multiple regional, independent catalog instances • Dataset repository catalog: • What datasets exist in the system? • Dataset content catalog • What are the constituents (files) of a dataset (version)? • Dataset hierarchy catalog • Not implemented (yet): hierarchical organization between datasets • Dataset selection catalog • What are the datasets that match this user query? (not within DQ2) • Dataset location catalog • Where is this data located? • Dataset subscription catalog: • Keeps track of all requests for datasets to be resident at a site

Client • DQ2 client API: • Interfaces to all dataset catalogs • in a secure way for any ‘write’ operations • Guarantees consistency between all the (“loosely coupled”) dataset catalogs e.g. a dataset in the location catalog refers to the same dataset in the content catalog.. • Consistency now being improve (for 0.3) with initial support for transactions, etc

Interacting with the Catalogs • Create a new dataset, adding content and subscribing it to a site: • User interacts with DQ2 client API • Internally, DQ2 client API will: • Create a dataset unique ID (DUID) • Create a dataset version unique ID (VUID) • Whenever a dataset is created, the first version is also automatically created • Add DUID and VUID along with dataset name (and native dataset metadata) to the repository catalog • Add constituents (files) to the content catalog • Add either the DUID or VUID to the subscription catalog • Depending whether the user subscribed to a particular dataset version (VUID) or to the latest dataset version (DUID)

Dataset States • A dataset can be: • Open • The latest version (latest VUID) is open so new files may be added to it • Closed • The latest version (latest VUID) is closed so no new files may be added to it • But… a new version may be added to the dataset (creating a new VUID) • Frozen • The latest version (latest VUID) is closed so no new files may be added to it • Additionally, no new version may be added to the dataset (the DUID is now immutable)

Overview of the subscription workflow • A subscription is inserted in the central catalogs • Site services pull all subscriptions to the storage they are serving • Subscription is queued (with a ‘fair’ share) in the site services • Site services then: • Find all missing files, resolve their source replicas by contacting remote replica catalogs, according to a subscription policy • Partition request according to ‘network’ channels and submit them (typically to gLite FTS) • Allowing storage usage to be managed • Managing FTS queue of transfers • Retrying failed transfers • Register files onto site’s local replica catalog

Subscription policies • Users may configure the exact behavior of a subscription: • The sites to use as sources… • Known sources (user specifies list of sites to search for replicas) • Close sites (‘geographically close sites’) • Complete sources (DQ2 uses the location catalog list of complete dataset replicas as possible sources) • Incomplete sources (DQ2 uses the location catalog list of incomplete dataset replicas as possible sources) • Whether to wait for replicas to appear on any of the source sites or not… • keep retrying to find replicas if these are not immediately found • Whether the subscription should be archival (custodial) at the destination site • Data, at some future phase, can be deleted from the destination or not?

Deeper look into the site services 1/6 • Fetcher • Gets all subscriptions to the site • Starting from 0.3, it will not scan all subscriptions but only subscription of datasets modified since the Fetcher last run (important optimization) • For all new subscriptions, queue a new request to that subscription in the local database • If a subscription is queued at the site, but is no longer in the central catalogs, mark the site subscription as cancelled • Then each agent at the site will cancel any files associated with it • When all agents cancel their files, the subscription is cancelled and all traces removed (more later on this…)

Deeper look into the site services 2/6 • State machine then acts on requests in the local database • Requests are chosen based on: • Fair share allocation • Modification date • Fetcher is the one that assigns a request (subscription) to a share: • Share assignments are flexible and can be based on subscription owner, dataset name, etc • We should work on a ‘official’ share assignment policy… • State machine picks up requests by: • ‘Rolling the dice’ to choose a share, based on each share fraction • Scan for all requests with the last modification date • Choose one of the requests (random) of the group of requests with last modification date • If share has no request, roll the dice again and discount this share’s fraction

Deeper look into the site services 3/6 • The various agents that act on the request are the collectively called the ‘state machine’ • Agents are: • VUID Resolver: • Finds content for the dataset (queries content catalog), marks dataset as incomplete or complete • Replica Resolver: • Finds source replicas for a dataset, applying subscription policy to choose source sites and queries those sites • Partitioner: • Chooses transfer tool and partition request in blocks (if dataset is big) • Submitter: • Submits each transfer request • Pending Handler: • If transfer tool is asynchronous, checks request status • Verifier: • Checks if transferred files are ok on the destination • Replica Register: • Register newly transferred files to local LRC

Deeper look into the site services 4/6 • All agents act with bulk operations: • The size of the ‘bulk’ is the size of a single dataset, or a subset of a dataset • Reason: • It is foreseen that agents will act on behalf of the subscription requester (with the requester’s certificate) • Therefore, each agent has to have a security context that matches the requester… • Sometimes, if bulk would be too big (e.g. a dataset with 10000 files should not be submitted in one go to FTS as it would block all other transfers), then the request is partitioned in pieces, all corresponding to same subscription • Also, if some file transfers fail for a dataset, only those transfers are retried (subset of the original ‘bulk’)

Deeper look into the site services 5/6 • gLite FTS: • Is the preferred transfer tool • DQ2 (submitter and pendinghandler) manages FTS in such a way that: • Requests are always submitted until the point where, • For the same destination host, • and the same FTS server, • and the same source host, • … requests start staying ‘Pending’ in FTS • At that point, DQ2 does not submit more requests for same source/destination and FTS server: • Leaving internal FTS queue full, but not more than full! • Otherwise, fair share would not work • DQ2 would apply share but then internally in FTS we’d have a big queue of requests pending!

Deeper look into the site services 6/6 • Cancellation of requests: • Whenever a subscription is removed (or changed) in the dataset catalogs, the Fetcher picks it up and marks the ongoing subscription in the local “queued transfers” database as cancelled • Each agent, before starting checks if subscription is cancelled: • If it is, does nothing and cleans up request • In the case of pending handler (FTS tool), also triggers a glite-transfer-cancel, effectively canceling the transfer on FTS

Cache turnover • DQ2 requires two attributes in the Local Replica Catalog • Replica last access time • Custodial/Archival flag (T/F) • When the file is registered, the date is updated along with the custodial flag • The custodial flag may also be updated if file is already on catalog and is now custodial (but was temporary) • Any other tools (e.g. dq2_get or WN utils) should also update the replica access time • Simple cache turnover: • Scan the local replica catalog and remove non-custodial files which have not been recently accessed • David Cameron’s PhD thesis proved this concept :-) (I think!) • Suggestions welcomed for additional strategies!

Bottlenecks

Bottlenecks • Central dataset catalogs: • Missing bulk methods, some queries very slow • 0.2 has ‘slow’ content catalog • With a known leak in the POOL FC interface, leading to annoying restarts of the service every once in a while • 0.3 seems considerably better and a good platform for future developments • Will need to scale h/w soon: • Connections are already failing occasionally when major exercises start • More Apache front-ends, better back-end DB h/w • Developers bottleneck: API must change sometimes • From 0.3 it can (potentially) be backward compatible

Bottlenecks • Site services: • The perceived reliability of the system overall, comes mostly from the behavior of the site services • Site services depend greatly on Grid m/w • Most of which is largely immature • A lot of work/refactor/trials to get site services stable • e.g. each process now measures its own memory and socket consumption and dies gracefully after a certain thresold :S

Bottlenecks • LCG: • SRMs and storages are unstable (see outcome of SC4 T0->T1 tests) • Nonetheless, functionality provided by SRM remains fundamental to us • LFC catalogs are not a reliable service (s/w and/or service issue) • Investigating with service providers and developers • OSG: • Good support model and isolation from the developers • Not many problems seen from our side • But apparently some problems happening on that side! • OSG/US-ATLAS requires additional customization of DQ2 to their local needs • High on our task list as OSG is actively support the project

Missing • Tools to guarantee consistency between: • Dataset locations and site contents • LRC contents and site contents (more of a site issue) • Generic interfaces and tools so that DQ2 can perform these operations uniformly across all Grids

DQ2 project

Manpower • Myself • Going around data management (problems) for a few years now • (very little) coordination, (lots of) meetings, working on site services • David Cameron • Former EDG (European Data Grid project) • (lots of) support for operations (LCG), monitoring, helping with site services • Thesis on ‘Replica Management and Optimisation for Data Grids’ • Pedro Salgado • Joined later, working on central catalogs • 2 people during most of the project, also supporting LCG operations • Now with help of DDM Operations!

Major milestones • Looking back at the product history: • Started with two developers (with many other parallel tasks) • 0.1.x branch: • Trivial site services, no load balancing, first go at central catalogs • Integrated successfully with Panda production • 0.2.x branch: • Integrated with ATLAS production (mostly!) • Two attempts: • 0.2.1 -> 0.2.9: problematic site services • 0.2.10 -> … : brand new site services • Overall, s/w development cycles have been largely irregular due to lack of manpower • For considerable periods stop development, support operations or some exercise

SC4 Tier-0 Scaling Tests • Run a full-scale exercise, from EF, reconstruction farm, T1 export, T2 export • Realistic data sizes, complete flow • ATLAS leading the Tier-0 processing and export exercises!

Project • Mailing lists: • atlas-dq2-dev@cern.ch • Discuss new developments/architecture • atlas-dq2-support@cern.ch • Discuss support issues • Savannah Tasks • https://savannah.cern.ch/task/?group=atlas-ddm • Every new development inserted and discussed as a Task • Savannah Bug Reports • https://savannah.cern.ch/bugs/?group=atlas-ddm • Wiki-based documentation • https://uimon.cern.ch/twiki/bin/view/Atlas/DDM • Test cases • 475 unittests • Development Guides • https://uimon.cern.ch/twiki/bin/view/Atlas/DDMDevelopmentNotes • Packages, releases and tags • Based on python distutils, CVS tags

Status of releases • Today, 0.2.12 deployed • After requests for improvements to the site services • SC4, ATLAS S/W Workshop • One instance of the central catalogs • atlddmpro.cern.ch • One instance of the site services per site storage • including separate instances (e.g. T1 hosts T1 and T2s) for the disk and archival areas • Developments introduced during 0.2.x: • Fair-share, monitoring, better FTS handling, support for multiple data flows (T0<->T1<->T2)

0.2.12 • New: • Handling of agents far more reliable • Also stopping/killed agents (hanged) • Fair share algorithm • Better balancing of external dependencies • Also each agent measuring ‘its’ own memory • Loosely coupled services: • … one to spawn agents • … another to assign requests according to shares • Temporary glitch in one part should not affect rest! • Local customizations possible: • Pre-submit script / clean attempt script • Already serving BNLDISK and BNLTAPE • Any first impressions?

Future plans • Clearly, our developments focus mostly on the dataset catalogs • The ATLAS-specific part • Nonetheless, until site services (and all their external dependencies) are stable we cannot afford to move forward with new developments • Not disrupt an already very disrupted infrastructure • c.f. T1<->T2 tests from DDM Operations • Different from T0->T1 with more complex data flows, more sites involved, and no support from developers • Involves less-debugged services at the sites

Short-term plans (development team) • SC4 Tier-0 exercise re-run • All attentions on this exercise, started ~25th September • Decide now on conditions to deploy 0.3 catalogs • Propose: • Separate instance • Start regular (e.g.) migration once per week of 0.2 to 0.3 • Ask users to migrate existing applications to 0.3 API • Ask initial feedback to methods, immediate improvements • Only then release 0.3 catalogs • Monitoring alarms (automated) and integration with ARDA monitoring • First prototype ready ~ next week • Will unload developers a lot • Need more stable system, less DQ2 ‘load’ on site services, better handling of external dependencies • Initial set of improvements done; more pending • Prepare updated Design and Architecture guides • Requested by the Collaboration

Future (longer term) planning • 0.2 ‘branch’: • Better isolation between site service activities (LFCs, FTS, etc) • Alarms, monitoring • Site isolation • ‘Bad’ source site cannot stop system or slow down subscriptions • 0.3 ‘branch’: • New central dataset catalogs • 0.4 ‘branch’: • Hierarchical dataset catalogs • 0.5 ‘branch’: • Regional dataset catalogs • Cross-publishing between dataset catalog instances • Manpower situation seems to be improving: • Significant manpower increase expected ~ November • (will need to work on training and documentation)

Wish list for m/w developers • gLite FTS: • Notification service (e.g. Jabber based) as alternative to web service polling • Unauthenticated read-only requests usually a good model • Delegation • Support SRM v2.2 • Staging • Improved reporting • Extend status reports for ongoing transfers and queue status along with the definition of a notification-based interface? • More complex channel management and load-balancing (later) • LFC • Faced a ‘crisis’ a few weeks ago: hit a scalability limit • Requesting additional features to speed up lookups • Access Control Lists, Consistency and Accounting • LFC: no support for replica ownership. How to do accounting then? Querying SRMs? Difficult to enforce catalog policies (preventing end-users from ‘polluting the LFC) • SRM: eagerly waiting for SRM v2.2 to solve many of the issues we see today (ACLs, consistency, stability) • VOMS deployment and integration (across all Grids)

Issues to discuss and conclusions

OSG-specific issues 1/2 • Local Replica Catalog • I do not like the lrcs_conf.py approach • The LRC is currently not part of DQ2 • Lots of confusion during DQ2 installations • Since OSG packaging did install the LRC as if it were part of DQ2 • Now OSG appear to have the motivation to re-develop the LRC (we received complaints about security model + LFC experience so far) • Assume problems will be sorted out soon with additional development of this component… • Packaging: DQ2 uses standard python-distutils • For LCG we adopted simple shell scripts which build and install the s/w, wrapping python distutils tarballs • OSG builds on top of the LCG shell scripts (??) • Why not use distutils directly??

OSG-specific issues 2/2 • Rotation of storage areas • Tiers of ATLAS • Motivation for publicly visible storage area: avoid contacting another site to find replicas by knowing its base path directly • Control of storage areas useful • Renaming storages, etc • But not necessarily centralized control… • So, consider reviewing public ToA SRM entry • FTS • Most developments at this point focus on SRM • Later on service discovery, channel management • Wildcard channel approach ‘survivable’ for now? • SRM? Do not use it if you don’t have to (yet?) • But probably OSG will require it at some point • Trade-off between more mature s/w (in the future) or operational experience today

ATLAS-wide issues 1/3 • Migration to 0.3. catalogs: when? • Proposal: Alexei decides when he is more or less confident with site services behavior • Will need an “order of magnitude” more reliability on the site services first • But in the meantime set up separate 0.3 catalogs instance, with occasional migration of production catalogs • So that Torre (Dataset browser), Panda developers can start coding against the new HTTP API Famous last words: to be backward compatible from now … and please make suggestions now (or as soon as you start using it), for 0.3, until we get it exactly right (e.g. bulk methods, new calls, etc) before deployment

ATLAS-wide issues 2/3 • Convergence of dq2_get, dq2_register, dq2_put, etc • US ATLAS contribution (Tadashi), except for dq2_register • Would need a similar contribution for LCG-specific issues… • End-user support: • Model so far: assumed DQ2 would distribute physics datasets (most notably AODs) to all sites and then a local POSIX-like interface would be used to access data • e.g. xrootd

ATLAS-wide issues 3/3 • The question is: • What about storing user data? • Continue with dq2_put approach? • New set of dataset catalogs for end-users? • Review location catalog • Centralized replica location? • I prefer centralized monitoring service which has knowledge of replicas • Accounting and quota management • Depend on scanning LRC? • All to be discussed very soon, as part of DDM discussion and architecture guides • Where you will be asked to participate…

Conclusion • Important improvements were made recently on the usage of external dependencies • (even more since s/w week: now included on 0.2.12) • Manpower situation will improve • Bringing the project to a new level of development • Central dataset catalogs - the core ATLAS development - in very good shape • Confident in the success of SC4 T0->T1 transfers; situation on BNLDISK/TAPE should be considerably better now • SC4 tests important as these stress the system, showing many bottlenecks Panda and others report as well • So far, commitments of the ATLAS DQ2 team have been fulfilled successfully, even if sometimes painfully: • “ATLAS copies its first PetaByte out of CERN” • http://cdsweb.cern.ch/search.py?recid=978808&ln=en

(backup) Requests during ATLAS s/w week

Requests during s/w week (so far) • Investigate slowness • “””Slow dataset subscription (20 and more hours between the moment dataset subscription request and the first action to process the request)””” • Better monitoring and alarm system • “””Obscure subscription history, lack of monitoring and control””” • “””Inconsistencies in DDM monitoring“”” • Log file improvements and log changes to running DQ2 site services • “””Very rare 100% data replication, no way for automatic check, control and retransfer””” • “””Capability to provide detailed information about when and who has issued subscription request””” • “””Better insight into the state of subscription transfers (FTS queues)””” • “””Sending info about long waiting unsuccesfull/failing subscriptions by mail“”” • Allow for more local customization of DQ2 • “””DDM/DQ2 SW and Computing Facilities interference(DQ2/Storage and DQ2/FTS interference • “””data replication requests from PIC and ASGC blocked BNL storage system. Also LYON was blocked by T2 data transfer)””” • “””0 length files on CASTOR, files overwrite on dCache”””

Requests during s/w week (so far) • More stable and decoupled site services: • “””keeping FTS/LFC/etc busy and blocking data transfer to the particular site“”” • More information on central catalogs: • “””Dataset location, though there are no data on site or data transfer failed, site is shown as dataset replicas holder””” • Security • “””Proposed is to separate production from general user areas””” • On our task lists but VOMS/SRM support insufficient (and likely to remain for a long time): alternative is define new group mapped to a set of DNs (at the SE level) - to be followed with all Tiers • More data flows • “””Capability of handling Tier2 subscriptions of data not in the home cloud””” • Mostly a decision with the FTS server provider and the concerned sites

DQ2 status, releases and plans