PhEDEx: a novel approach to robust Grid data management

PhEDEx: a novel approach to robust Grid data management Tim Barrass Dave Newbold and Lassi Tuura All Hands Meeting, Nottingham, UK 22 September 2005

What is PhEDEx? • A data distribution management system • Used by the Compact Muon Solenoid (CMS) High Energy Physics (HEP) experiment at CERN, Geneva • Blends traditional HEP data distribution practice with more recent technologies • Grid and peer-to-peer filesharing • Scalable infrastructure for managing dataset replication • Automates low-level activity • Allows manager to work with high level dataset concepts rather than low level file operations • Technology agnostic • Overlies Grid components • Currently couples LCG, OSG, NorduGrid, standalone sites Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

The HEP environment • HEP collaborations are quite large • Order of 1000 collaborators, globally distributed • CMS is only one of four Large Hadron Collider (LHC) experiments being built at CERN • Typically resources are globally distributed • Resources organised in tiers of decreasing capacity • Tier 0: the detector facility • Tier 1: large regional centres • Tier 2+: smaller sites-- Universities, groups, individuals… • Raw data partitioned between sites, highly processed ready-for-analysis data available everywhere • LHC computing demands are large • Order 10 PetaBytes per year created for CMS alone • Similar order simulated • Also analysis and user data Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

CMS distribution use cases • Two principle use cases- push and pull of data • Raw data is pushed onto the regional centres • Simulated and analysis data is pulled to a subscribing site • Actual transfers are 3rd party- handshake between active components important, not push or pull • Maintain end-to-end multi-hop transfer state • Can only clean online buffers at detector when data safe at Tier 1 • Policy must be used to resolve these two use cases Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

PhEDEx design • Assume every operation is going to fail! • Keep complex functionality in discrete agents • Handover between agents minimal • Agents are persistent, autonomous, stateless, distributed • System state maintained using a modified blackboard architecture • Layered abstractions make system robust • Keep local information local where possible • Enable site administrators to maintain local infrastructure • Robust in face of most local changes • Deletion and accidental loss require attention • Draws inspiration from agent systems, “autonomic” and peer-to-peer computing Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

Transfer workflow overview Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

Production performance Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

Service challenge performance Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

Future directions • Contractual file routing • Cost-based offers for a given transfer • Peer-to-peer data location • Using Kademlia to partition replica location information • Semi-autonomy • Agents governed by many small tuning parameters • Self modify- or use more intelligent protocols? • Advanced policies for priority conflict resolution • Need to ensure that raw data is always flowing • Difficult real-time scheduling problem Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

Summary • PhEDEx enables dataset level replication for the CMS HEP experiment • Currently manages 200TB+ of data, globally distributed • Real life performance of 1 TB per day sustained per site • Challenge performance of over 10TB per day • Not CMS-- or indeed HEP-- specific • Well-placed to meet future challenges • Ramping up to get to O(10)PB per year • 10-100TB per day • Data starts flowing for real in the next two years Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

Extra information • PhEDEx and CMS • http://cms-project-phedex.web.cern.ch/cms-project-phedex/ • cms-phedex-developers@cern.ch : feel free to subscribe! • CMS Computing model http://www.gridpp.ac.uk/eb/ComputingModels/cms_computing_model.pdf • Agent frameworks • JADE http://jade.tilab.com/ • DiaMONDs http://diamonds.cacr.caltech.edu/ • FIPA http://www.fipa.org • Peer-to-peer • Kademlia http://citeseer.ist.psu.edu/529075.html • Kenosis http://sourceforge.net/projects/kenosis • Autonomic computing • http://www.research.ibm.com/autonomic/ • General agents and blackboards • Where should complexity go?http://www.cs.bath.ac.uk/~jjb/ftp/wrac01.pdf • Agents and blackboards http://dancorkill.home.comcast.net/pubs/ Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

Issues • Most issues fabric-related • Most low level components experimental or not production-hardened • Tools typically unreliable under load • MSS access a serious handicap • PhEDEx plays very fair, keeping within request limits and ordering requests by tape when possible • Main problem is keeping in touch with the O(3) people at each site involved in deploying fabric, administration &c Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

Deployment • 8 regional centres, 16 smaller sites • 110TB, replicated ~twice • 1 TB per day sustained • On standard Internet Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

Testing and scalability Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

PhEDEx architecture Tim Barrass, Bristol, tim.barrass@bristol.ac.uk

PhEDEx: a novel approach to robust Grid data management