1 / 8

The Data Carousel

The Data Carousel. what problem it’s trying to solve the data carousel and the grand challenge the bits and pieces: how it all works what’s ready now; what’s left to do. the problem we’re facing. PHENIX program heavy in “ensemble” physics

simone
Download Presentation

The Data Carousel

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Data Carousel what problem it’s trying to solve the data carousel and the grand challenge the bits and pieces: how it all works what’s ready now; what’s left to do

  2. the problem we’re facing • PHENIX program heavy in “ensemble” physics • typical day (or week) at the office: get lots of events, make foreground and background distributions, compare, improve code, repeat until published • needs to move lots of data very efficiently • needs to be comprehensible to PHENIX physicists • people are accustomed to “staging files” • needs to work with the CAS analysis architecture • lots of linux boxes with 30 GB disk on each • main NFS server with 3 TB disk • solution: optimized batch file mover • similar to Fermilab data “freight train” • works with existing tools: HPSS, ssh, pftp, perl

  3. the carousel and the grand challenge • complementary tools for accessing event data • works at lower level of abstraction than GC • files, not objects • can work with non-event data files • important since it doesn’t take much to clog access to tapes • 11 MB/sec/drive in principle; 6 MB/sec/drive in practice • best case: Eagles take ~20 seconds to load, seek: read ~100 MB files at random and you’ll see no better than 50% bandwidth • MDC1,2 naive ftp only saw ~1 MB/sec effective bandwidth for reads • already works with disjoint staging areas • can, in principle, work over the WAN • doesn’t reorganize data, doesn’t provide event iterator, isn’t coupled to analysis code • good or bad, depends on what you’re expecting

  4. the bits and pieces • split-brain server • part which knows HPSS, part which knows PHENIX • HPSS batch queue (Jae Kerr, IBM) • optimizes tape mounts for a given set of file requests • once file is staged to cache, used NFS write to non-cache disk • modified to use pftp call-back (Tom Throwe, J.K.) • carousel server (J. Lauret, SUNYSB) • feeds sets of files to batch queue at measured pace • knows about groups, does group-level accounting • implements file retrieval policy • maintains all state info in external database • client side scripts • implements file deletion policy (defaults to LRU cache) • client side requirements are kept ALARA • ssh + .shosts, perl + few modules, pftp

  5. carousel architecture HPSS tape data mover “ORNL” software carousel server mySQL database HPSS cache filelist client rmine0x pftp pftp CAS NFS disk CAS local disk

  6. showing carousel info via the web

  7. accounting tables • group-level accounting information provides possibility of tailoring access to HPSS resources

  8. current state and future directions • works (has basically worked since MDC2) • two main sources for code • http://nucwww.chem.sunysb.edu/pad/offline/carousel/ • PHENIX CVS repository • there remains one PHENIX-ism to be exorcised • HPSS batch queue is currently hardwired to suid to “phnxreco” • instead could select uid, gid based on COS • lots of future improvements are possible • have worked to make system “good enough” to use • could use more sophisticated server/client communication • check for available space before staging file to HPSS cache

More Related