1 / 5

‘Real World’ issues from DC04

‘Real World’ issues from DC04. DC04: Trying to operate the CMS computing system at 25Hz for one month We are three days in! We are using components that are ready NOW Even if it’s not politically correct Often using several different approaches for comparison

hhooten
Download Presentation

‘Real World’ issues from DC04

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ‘Real World’ issues from DC04 • DC04: • Trying to operate the CMS computing system at 25Hz for one month • We are three days in! • We are using components that are ready NOW • Even if it’s not politically correct • Often using several different approaches for comparison • This talk: concentrates on data management issues • ‘Real World’ issues that have come up during DC04 preparation • Stuff that is not (yet) well covered by the available tools • I know that… • Some issues may be application problems, not middleware ones • Some issues may be covered by components under development • Some issues may be self-inflicted injuries Dave Newbold, University of Bristol GridPP Middleware Meeting

  2. Directed data transfer • Data management ‘type I’: replica management • The (automatic?) movement of data products to where they are needed; managing relevant system and application metadata • Best-effort optimisation of data location in response to dynamic workload needs • Well-covered by current and future middleware • Data transfer ‘type II’: bulk data management • The predictable straight(ish) ‘production line’ of data flow • Detector -> DAQ -> Buffer -> Reco farm -> T1 -> MSS -> calib -> … • Requirements are different to replica management • Robustness and reliability paramount (raw data is the ‘crown jewels’) • Throughput is very important: ‘best effort’ is not good enough • Not explicitly addressed by current middleware products • Data distribution is explicitly ‘directed’ by policy • ‘Seeds’ the replica mangement system from the Tier-1’s. Dave Newbold, University of Bristol GridPP Middleware Meeting

  3. Directed data transfer • Our current solution • Cooperating system of simple ‘agents’ at Tier-0 and Tier-1 • They communicate only through a shared (Oracle) DB • They have little or no state - it’s all held in the central DB • Could this be useful as generic middleware? • Other related issues: • Lack of a single consistent interface to MSS (in Europe and US) makes life difficult (being addressed?) • There are very many failure modes in the data management system that we must think of… • Would be good to factorise out the problems of failing storage components by having the MSS ‘remap’ our data when required • Predict at least one disk failure per day somewhere in DC04 Dave Newbold, University of Bristol GridPP Middleware Meeting

  4. Data transfer tools • Need low-level transfer tools that: • Log what is going on! (We have ad-hoc solutions here for DC04) • Adjust policy automatically for optimum throughput according to network conditions • Fail gracefully when something is wrong at an end-point • Play nice with firewalls, etc • NB: performance is not currently the problem, but the tools are… • Checksumming • We would like a system that performs fast file-level checksum of data ON THE DISK • No, TCP checksum does not catch all errors • Silent disk problems, filesystem errors, NFS problems, etc etc • Checksumming data from MSS after-the-fact is very difficult • Would also like: • Some SIMPLE means of distributed, authenticated, atomic, reliable message-passing between agents over the Grid • With a command-line level API for scripting Dave Newbold, University of Bristol GridPP Middleware Meeting

  5. Other issues… • Small files! • They seem to be inevitable, but play havoc with efficiency: • Huge lists of files in catalogues • Not dealt with efficiently by MSS, transfer tools, etc • Basic unit of information management: data produced by one MC, reco, filter job during its run (with unique GUID) • Do not want to make jobs too long… (too much state in the system) • Can aggregation help? Perhaps, but we need the tools • Metadata • Currently a ‘hot topic’? • How to handle efficient distribution of system- and user-level metadata? • Which metadata are immutable after creation? Which need to be distributed widely? How to handle schema extension on per-user basis? Dave Newbold, University of Bristol GridPP Middleware Meeting

More Related