Experience with large-scale distributed data management and metadata

Experience with large-scale distributed data management and metadata David Malon* LBNE Software and Computing Meeting 14 November 2013 * malon@anl.gov

Disclaimer • When I first saw my name on a draft agenda for these meetings, the topic proposed by Maxim was something about ATLAS experience with metadata • A couple of days ago it had been transformed into something about large-scale distributed data management and metadata • There exist metadata that have little to do with distributed data management (and vice versa, mutatis mutandis), but the new title clearly reflects one of the near-term priorities as presented to DOE • I will say a little about both and about where these topics overlap, with perhaps a bit more on metadata • Difficult to decide what to omit • Will focus on design and infrastructure considerations and what ATLAS is doing and a bit about why, and lessons learned from the ATLAS experience • Relax; it is not my intention to try to convince anyone to use an ATLAS product or even to do what ATLAS is doing, but simply to provide information and some personal perspective • And what ATLAS is doing is in flux, as developments for LHC Run 2 are in progress • But I hope experience and expertise gained in ATLAS will be of use to the LBNE collaboration David Malon, LBNE Software and Computing Meeting

Not an outline • What this talk is not: • Not an inventory of metadata • Not even a taxonomy • Certainly not about all information that could possibly be viewed as metadata • Not a description of specific metadata infrastructure components • Instead, a brief and very uneven sampling of principally physics-related metadata and metadata flow as seen from the point of view of what happens just beneath the surface, even insomething as rudimentary as execution of a single task reading one dataset as input and writing another as output • More unbalanced than I had planned, due to an eldercare medical emergency last night and this morning, but please bear with me … • Along the way, something about the tension between logical models and physical deployment, and how the difference is addressed • And why it might matter • And about physics and production and data management metadata and where they connect and where they can be factorized • And also along the way, something about the principles that have guided ATLAS decisions about metadata handling David Malon, LBNE Software and Computing Meeting

Metadata are pervasive in workflows • Metadata are used to discover, identify, and select data to be processed • Metadata are used in data-quality-based decision-making about what to process • Metadata are used to understand how data have been processed, and other aspects of data provenance • Metadata are used to configure tasks appropriately to process data • Metadata are used to understand what auxiliary data are needed for processing • Metadata are used in job-level configuration and initialization • Metadata may be accessed during job execution (on input file boundaries, or because temporal conditions change, or …) • Metadata may be propagated from input to output • Metadata may be generated by an executing job • Metadata at job and task and file levels may be returned to a metadata service And that’s “just” the physics metadata! • At a technical level, one could give a detailed talk on each of these • Don’t worry; I won’t. • Metadata are also used to decide where to run jobs and sometimes when, and how to partition them (e.g., N input files per job or 1/Nth of an input file per job), and more David Malon, LBNE Software and Computing Meeting

And more … • Metadata are integral to robustness and data integrity • Even “simple” event counting is fundamental: • At the end of all my clever and globally distributed processing, did I end up with as many output events as input events? • And it’s harder when one is filtering • And understanding provenance is fundamental as well David Malon, LBNE Software and Computing Meeting

Backing up a bit: Logical versus physical constructs • Examples of logical, semantic units of data organization • Data-taking runs • Contiguous temporal segments within a run, known as luminosity blocks • The set of events that pass a particular trigger or suite of triggers • The Monte Carlo sample corresponding to certain generator settings and a given detector geometry • Files, on the other hand, are artifacts of physical storage organization • Physicists will usually process a “dataset” in the semantic sense above, and whether it comprises 4 files or 400 files is a storage detail • And files are too often too small or too large (for consumption by a downstream job, or for storage systems, e.g., because of slow simulation times compared to event generation or reconstruction or analysis or …) • And they are very often merged and sometimes split • That doesn’t mean they are not important. David Malon, LBNE Software and Computing Meeting

Logical versus physical constructs • If files are artifacts of storage organization, should there be any physics metadata about files? • (Obviously), no physics results (or physics metadata) should depend upon storage organization • In most cases, most physics metadata about a file are really metadata about the collection of events within the file, or about a larger collection of events to which the events in the file belong • This distinction might seem pedantic, but it turns out to be significant in a framework in which the I/O model is to process collections of events whether they are stored contiguously within a file or scattered across many files • ATLAS I/O and metadata and event store navigation infrastructure support quite general notions of event collections (the set of events that pass a selection query to an event-level metadata database is valid input and can produce correct output metadata) • The ATLAS data management and production systems are currently much more file-set-oriented • For now: but with the event server prototyping work underway, this may change David Malon, LBNE Software and Computing Meeting

Are datasets logical or physical constructs? (Does it matter?) • Datasets as commonly defined in HEP experiments provide something closer to a conceptual organization, but … • Even here the physical mapping implemented by ATLAS and other experiments corresponds to exactly one logical view • In practice, in most experiments’ data management systems, a dataset is, either explicitly or effectively, a collection of files. • Example: the file set containing Analysis Object Data (AOD) for a single run and trigger stream and a given processing is a dataset in this data management sense • One could imagine, though, that the subset of events within this dataset that pass a specific set of triggers is conceptually every bit as logical a candidate for “dataset” status • And the physics metadata and supporting infrastructure requirements would not be different because the dataset is not instantiated as a file set • And the fact that these events are mixed with events that passed other triggers that were written to the same stream is an artifact of storage David Malon, LBNE Software and Computing Meeting

Physics, production, and data management metadata • It is possible to factorize data management and production metadata from physics metadata • A data management system might need to know the file size and creation date and checksum and possibly a GUID and which files comprise a dataset and where a logical file’s replicas are • But it should not need to know about physics, or perhaps even that a file contains events as opposed to, say, genome data • Allows one to factorize physics metadata infrastructure from data management infrastructure • And the possibility to use a generic data management system, in principle • And do you really want your data management system to manage cross-sections and K-factors? • There may be cases in which some limited semantic information is useful to production systems • Event count, for example, can be useful for job partitioning • One possibility is to use a “generic” data management system that also supports extensible {key, value} storage of domain-specific attributes • The next generation of the ATLAS distributed data management system does this David Malon, LBNE Software and Computing Meeting

Aspects of provenance, and some of its uses • Not just information of what was produced from what, and how, that must be preserved for data understanding, but useful and used downstream • Provenance is in principle almost fractal in its complexity, but basic (transform-level, job-level) provenance is already useful • In ATLAS production, basic information of how data were produced, for example, is encoded in a configuration tag that is recorded both in an ATLAS metadata database and in the output event data products themselves • Example of a configuration tag: r2713_p705, which encodes information about • which ATLAS releases (17.0.3.3) • which database releases (16.9.1.1) • which transforms (reco_trf.py), • which job configurations • … • … were employed in the (two) processing steps used to produce the data from the original raw data input. • Details corresponding to these configuration tags are recorded in a metadata service • A service is provided to decode them (which is what I used to fill in the release versions) David Malon, LBNE Software and Computing Meeting

Aspects of provenance, and some of its uses • Importantly, ATLAS physicists can configure their own subsequent jobsequivalentlyto what was run in production simply by providing the relevant configuration tag as an input argument to their job scripts David Malon, LBNE Software and Computing Meeting

In-file metadata infrastructure • Experiments often put metadata in event files, for several purposes • As a cache, to avoid needless remote database connections • For bookkeeping and accounting (e.g., cut flows) • … • ATLAS employs a sophisticated in-file metadata infrastructure that goes well beyond machinery to allow storage of non-event data in event data files • Currently incident-driven, as input file boundaries are encountered asynchronously to state transitions of the event-processing framework, and artificial from the point of view of physics metadata • Note too that propagation of metadata from input to output may not just be done blindly • For example, when only selected events from an input file are read David Malon, LBNE Software and Computing Meeting

A principle and its converse • If an event data file is unreachable or unreadable, it must be possible to ascertain what one is missing • And, for example, to adjust cross-sections and other statistics accordingly • Conversely, for routine use it should not be necessary to consult an external metadata source to understand a file’s content David Malon, LBNE Software and Computing Meeting

A metadata and data management aside: What’s in a name? • Nomenclature is one of those topics that few have the patience to worry about, but it is one of the frontlines of metadata • Convenient if the name allows one to distinguish data from Monte Carlo, and raw data from analysis object data, and so on, but not all metadata can go into the name • Conversely, omitting all semantic information and using a UUID as a name may be seen by a user community as less than helpful • it is helpful to make choices early about what is in the name, and where, and what name field size limits might be, and what information must be found using a metadata service • Helpful if tools enforce naming conventions, and for production data, if the name of an output data product can be generated/deduced from the input data and information about the task that will be run David Malon, LBNE Software and Computing Meeting

Data distribution • Rudiments of data distribution historically have been • subscription- or rule-based placement of standard data products, with • Support for ad hoc user-level data transfers—fetches of single files or datasets • The former usually corresponds to pledged storage resources • Carefully managed and monitored and accounted for • The latter is often used for taking data home to locally-owned facilities, or for code development or debugging, or … • When pledged storage is a limited resource relative to data volumes, there are many policy decisions, some built into the computing model, some more dynamic • How to partition data among sites, how many replicas of each data product variety, how long to keep certain data (e.g., for detector performance studies) on disk, … • And underlying infrastructure to implement and support policy • And committees. Let’s not forget committees, to deal with storage resource requests for special purposes and exceptional requirements before conferences and … • Original LHC data distribution models tended to be conservatively hierarchical, and placement a bit static (for good reasons at the time) • Landscape is changing. David Malon, LBNE Software and Computing Meeting

Changing landscape of data distribution • Appreciably less reliance on static data placement, except for custodial copies • With monitoring, expect to have fewer statically-placed replicas and instead to replicate “hot” data on demand, e.g., when a site that hosts a hot dataset is busy and another site is less so • Wide-area data access has been possible in principle for many years but little used until recently • Performance issues, protocol issues, firewall issues, catalog issues, concerns about degradation of performance of data-serving sites, … • Now wide-area access is becoming something that experiments will count on (e.g., through federated xrootd) • ATLAS production may, as a modest example, schedule jobs on sites that host the data, but if a given file cannot be read after a couple of retries, the options are no longer limited to retrying the job somewhere else or recopying the file—it can be read remotely if the job site and the hosting site are configured to allow this • Some uses are of course more ambitious David Malon, LBNE Software and Computing Meeting

Distributed data futures? • ATLAS is prototyping an event server production architecture, in which work chunks (single events or lists of events) can be farmed out to opportunistically-available compute resources • Including back-filling HPC resources • Losses should be kept small if/when nodes fail or resources become rather suddenly unavailable • Needs sub-file data access granularity; greatly aided by the availability of remote read capabilities (“process this event from that file over there; here’s another”) • Architecturally similar to a wide-area version of a framework for multicore event processing • Getting the metadata right in such scenarios is tricky. • When it works, it also opens the door to making logical event collections like “the set of all events in this good run list that satisfy one of these triggers” usable as first-class citizens, even though there is no preexisting “file set” corresponding to exactly these events and no others • These are some of the things my group is working on. David Malon, LBNE Software and Computing Meeting

Conclusions • Lots of metadata topics and flavors not addressed here • Event-level metadata infrastructure and event indexing, for example • Metadata and metadata infrastructure amount to far more (and are far more interesting!) than simply what experiment-specific metadata fields should be added to a file or dataset catalog • Use and implications are broad and deep • Distributed data management is becoming easier and harder (and more interesting!) as on-demand access and replication and global views of everywhere-accessible data at finer granularities emerge • Forward-looking metadata and data management infrastructure may benefit from maintaining a clear logical (versus physical) view of data that may begin by supporting current (and natural) implementations of datasets-as-file-sets, but that should anticipate the potential for greater generality • And the underlying technologies (not just for wide-area data access, but also for non-relational storage engines) are evolving in interesting ways as well • A very fertile area for near- and long-term work David Malon, LBNE Software and Computing Meeting

Experience with large-scale distributed data management and metadata

Experience with large-scale distributed data management and metadata

Presentation Transcript

iRODS and Large-Scale Data Management

Large Scale Data Visualization with VisIt

Large-Scale Data Processing with MapReduce

Large-Scale Distributed Systems

Large- scale Linked Data Management

Large Scale Distributed Computing Systems

Large-Scale Distributed Systems

Distributed Metadata with the AMGA Metadata Catalog

Large scale data processing

Ethernet Routing for Large Scale Distributed Data Center Fabrics

Data Management for Large-Scale Scientific Computations in High Performance Distributed Systems

ATLAS Distributed Data Management Operations Experience and Projection

NetSearch : Googling Large-scale Network Management Data

Large Scale Data Processing with DryadLINQ

Data Mining Algorithms for Large-Scale Distributed Systems

Large Scale Data Integration

Large Scale Data Analytics

large scale data analysis

HBA Distributed Metadata Management for Large Cluster Based Storage Systems

CS6282 Very Large Scale Distributed Systems

Contents – Large-Scale Distributed Systems