DQ2 discussion on future features

DQ2 discussionon future features BNL workshop October 4, 2007

DQ2 0.4.x • Continue to optimize DB schema to cope with higher load • channel allocation to follow ‘Dataset Subscription policy’ • Hiro/Patrick also asking for local configurable ordered list of preferred sources within cloud • implications on channel allocation • How much to ‘prefer’ a T1 before going to a T2 for a replica? Right now, shortest queue wins… • distinguishing files unlikely to have replicas in the future (bad subscriptions) • particularly in the local monitoring • removing ‘holes’ in system (growing backlogs) • Reduce load (better GSI session reuse) • Goal O(100K) file transfers/day/site • or SRM/storage limitations • Need better understanding outside DQ2

Local monitoring of site services

Staging… • Did not recognize this was a problem for OSG • .. It is very hard to do with remote storages without SRM • FTS 2 + SRMv2 move on the right direction but not there yet • Could do a local mechanism for T1->T2 transfers in the same cloud • provided site services for T2 run “close” to the T1 storage • … but not for cross T1 transfers

Hierarchiescurrent thoughts, for discussion • Hierarchical datasets would be a special kind of dataset. • These would have only 2 states: open AND frozen • These would not have versions • The constituents of a hierarchical dataset could only be closed dataset versions or frozen datasets • Not sure if the following commands should be provided explicitly: • list files in hierarchical dataset directly? • or only list datasets in hierarchical dataset and forcing user to loop over results? • subscribe open hierarchical dataset? • or only allow listing datasets in open hierarchical dataset and forcing user to manually subscribe sub-units • point is: having to loop over OPEN hierarchies (likely manageable) • locations of hierarchical dataset? • or only allow listing locations of the individual datasets in the hierarchical dataset?

Merging • Not much to do from DQ2 side here but provide an attribute for each dataset • “merged” Y/N (or protocol: zip, tar?) • DQ2 does 3rd party transfers only • does not actually ‘see’ the data

Checksums • Not much from DQ2 here but enforcing checksums in the central catalogues and its protocol • ‘md5:’ for MD5 • adler32 is frequently discussed as a better checksum candidate • but not relevant to DQ2, rather to the sites and production people

Subscription lifetime • Increasingly important… • Would clean up what no one is cleaning up now… (some sites with O(100K) files in impossible situations) • Discussion from yesterday: • allow only waitForSources to be set by users with production role ? • avoid creating looping subscriptions in the system • Forbid subscriptions for datasets with more than X files, if not production user requesting? • Forbid more than Y subscriptions per sure, if not production user? • Ignore subscription - regardless of its state - after more than 3 months? • Subscription is marked as broken

Central catalogues • [ as mentioned yesterday ] • Main changes are: • for Scalability only… • dropping VUIDs (becomes DUID+Version number) • DUID becomes timestamp-oriented UUID so that backend is partitioned in time • and highly optimized UUID storage on ORACLE • meaning shorter index • ORACLE partitioning, redirect service… • .. but fully backward compatible with 0.3 clients • Many queries become much faster • list files in dataset is query by DUID as opposed to query by N number of VUIDs • ORACLE IOTs guarantees listing files from a dataset [version] reads close to sequential blocks on disk

Location catalogue • [ as mentioned yesterday ] • Location catalogue will be populated asynchronously with: • information on missing files • (re)marking complete/incomplete locations for existing datasets - consistency • Missing files are extra information made available on ‘best-effort’ to the users • derived from request by Ganga • This is populated by the ‘tracker’ service • Which was being reworked for the site services • The tracker service is a ‘stronger’ Fetcher (as existing on the site services), used to find content on site VS content missing on site - one of the site services performance bottleneck

Dashboard • Relatively big update coming soon • distinguish errors source/destination • display messages on the dashboard for all sites • alarms supported • more overview of site services state from a central place • e.g. states of files (based also on new site services monitoring)

ToA • More and more info there… • Blacklist/whitelist • Preferred site connections • This is a cache file, same style as ToA • but independent file from ToA cache since it is more dynamic • ToA renewal much stronger • I’d claim it is the most reliable info system so far on the Grid :-)

Communication… • … still not working: • e.g. did not recognize staging as a problem • e.g. 0.3.2 apparently not deployed on OSG T2s • quite bad as 0.3.1 had a simple bug where agents could simply die whenever a glitch happened in the central catalogue connection • glitches “common” with the central catalogue request rate, but harmless and ok to retry • … what to do here? • Jabber chatroom :-) • ddmdev@conference.jabber.org • ask me - msbranco@gmail.com or atlas-dq2-dev@cern.ch - to be authorized

DQ2 discussion on future features