RHIC, STAR computing towards distributed computing on the Open Science Grid

RHIC, STAR computing towards distributed computing on the Open Science Grid Jérôme LAURETRHIC/STAR

Outline • The RHIC program, complex and experiments • An overview of the RHIC Computing facility • Expansion model • Local Resources, remote usage • Disk storage, a “distributed” paradigm • Phenix and STAR • STAR Grid program & tools • SRM / DataMover • SUMS • GridCollector • Brief overview of the Open Science Grid • STAR on OSG Jérôme LAURET IWLSC, Kolkata India 2006

The RHIC program, complex and experiments • An overview of the RHIC Computing facility • Expansion model • Local Resources, remote usage • Disk storage, a “distributed” paradigm • Phenix and STAR • STAR Grid program & tools • SRM / DataMover • SUMS • GridCollector • Brief overview of the Open Science Grid • STAR on OSG Jérôme LAURET IWLSC, Kolkata India 2006

The Relativistic Heavy Ion Collider (RHIC) complex & experiments • A world-leading scientific program in Heavy-Ion and Spin program • The Largest running NP experiment • Located in Long Island, New York, USA • Flexibility is key to understanding complicated systems • Polarize protons sqrt(s) = 50-500 GeV • Nuclei from d to Au, sqrt(sNN) = 20-200 GeV • Physics runs to date • Au+Au @ 20, 62, 130, 200 GeV • Polarized p+p @ 62, 200 GeV • D+Au @ 200 GeV RHIC It is becoming the world leader in the scientific quest toward understanding how mass and spin combine into a coherent picture of the fundamental building blocks nature uses for atomic nuclei. It is also providing unique insight into how quark and gluons behaved collectively at the very first moment our universe was born. Jérôme LAURET IWLSC, Kolkata India 2006

PHOBOS BRAHMS &PP2PP RHIC PHENIX 1.2 km STAR The experiments Jérôme LAURET IWLSC, Kolkata India 2006

The RHIC Computing Facility (RCF) • RHIC Computing Facility (RCF) at BNL • Tier0 for the RHIC program • Online recording of Raw data • Production reconstruction of all (most) Raw data • Facility for data selection (mining) and analysis • Long term archiving and serving of all data • … but not sized for Monte Carlo generation • Equipment refresh funding (~25% annual replacement) • Addressing obsolescence • Results in important collateral capacity growth Jérôme LAURET IWLSC, Kolkata India 2006

Tier1, Tier2, … remote facilities • Remote Facilities • Primary source of Monte Carlo data • Significant analysis activity (equal in the case of STAR) • Such sites are operational – the top 3 • STAR • NERSC/PDSF, LBNL • Wayne State University • Sao Paolo • PHENIX • RIKEN, Japan • Center for High Performance Computing, University of New Mexico • VAMPIRE cluster, Vanderbilt University • Grid Computing • Promising new direction in remote (distributed) computing • STAR and, to a lesser extent, PHENIX are now active in Grid computing Jérôme LAURET IWLSC, Kolkata India 2006

Key sub-systems • Mass Storage System • Hierarchical Storage Management by HPSS • 4 StorageTek robotic tape silos ~4.5 PBytes • 40 StorageTek 9940b tape drives ~1.2 GB/sec • Change to technology to LTO drive this year • CPU • Intel/Linux dual racked processor systems • ~ 2300 CPU’s for ~1800 kSPECint2000 • Mixed of Condor & LSF based LRMS • Central Disk • 170 TBytes of RAID 5 storage • Other storage solution: PANASAS, … • 32 Sun/Solaris SMP NFS servers ~1.3 GByte/sec • Distribute disk ~ 400 TBytes • x2.3 more than centralized storage !!! Jérôme LAURET IWLSC, Kolkata India 2006

How does it look like … • Not like these … although … Jérôme LAURET IWLSC, Kolkata India 2006

MSS, CPUs, Central Store • … but like these or similar(the chairs donot seem morecomfortable) Jérôme LAURET IWLSC, Kolkata India 2006

Data recording rates • Run4 set a first record 120MBytes/sec PHENIX 120MBytes/sec STAR Jérôme LAURET IWLSC, Kolkata India 2006

DAQ rates comparative A very good talk from Martin Putchke - CHEP04 Concepts and technologies used in contemporary DAQ systems Heavy Ion Experiments are in the > 100 MB/sec range ~1250 ~ 300 150 MB/sec STAR moving to x 10 capabilities inouter years (2008+) ~ 100 ~ 40 ~ 25 Jérôme LAURET IWLSC, Kolkata India 2006 All in MB/sec, approximate ...

Mid to long term computing needs • Computing projection model • Goal is to estimate CPU, disk, mass storage and network capacities • Model based on raw data scaling • Moore’s law used for cost recession • Feedback from the experimental groups • Annual meetings, model refined if necessary (has been stable for a while) • Estimate based on beam use plans • May be offset by experiment, by year • Integral consistent • Maturity factor for codes • Number of reconstruction passes • “richness” factor for the data (density of interesting events) Jérôme LAURET IWLSC, Kolkata India 2006

STAR Phenix Projected needs Jérôme LAURET IWLSC, Kolkata India 2006

Discussion of model • Data amount is accurate at 20% close • i.e. model was adjusted to 20% lower-end • Upper-end has larger impact in the outer years • DAQ1000 for STAR enabling Billions of event capabilities a major (cost) factor driven by Physics demand • Cost will be beyond current provision • Tough years start as soon as 2008 • Gets better in the outer years (Moore’s law catches up) • Uncertainties grows with time however … • Cost versus Moore’s law • Implies “aggressive” technology upgrades (HPSS for example) • Strategy heavily based on low cost distributed disk (cheap, CE attached) Jérôme LAURET IWLSC, Kolkata India 2006

Disk storage – distributed paradigm • Disk storage – distributed paradigm • The ratio is striking • x2.3 ratio now, moves to x6 in outer years • Requires SE strategy • CPU shortfall • Tier1 use (Phenix, STAR) • Tier2 user analysis and data on demand (STAR) Jérôme LAURET IWLSC, Kolkata India 2006

Central stores MSS = HPSS Phenix – dCache model Tier1 / CC-J / RIKEN • Tier 0 – Tier 1 model • Provides scalability for centralized storage • Smooth(er) distributed disk model Jérôme LAURET IWLSC, Kolkata India 2006

Phenix – Data transfer to RIKEN Network transfer rates of 700-750 Mbits/s could be achieved (i.e. ~90 MB/sec) Jérôme LAURET IWLSC, Kolkata India 2006

Manager (Head Node) xrootd xrootd olbd olbd Supervisor (Intermediate Node) xrootd xrootd olbd olbd xrootd xrootd Data Server (Leaf Node) olbd olbd STAR – SRM, GridCollector, Xrootd Different approach • Large (early) pool of distributed disks, early adoption of dd model • dd model too home-grown • Did not scale well when mixing dd and central disks • Tier 0 – Tier X (X=1 or 2) model • Need something easy to deploy, easy to maintain • Leveraging on SRM experience • Data on demand • Embryonic event level (GridCollector) • Xrootd could benefit from an SRM back-end Jérôme LAURET IWLSC, Kolkata India 2006

Where does this data go ?? VERY HOMEMADE VERY “STATIC” Client Script adds records D Pftp on local disk D DataCarousel Update FileLocations Mark {un-}available Spider and update * D Control Nodes D FileCatalog Management STAR dd Evolution – From this Jérôme LAURET IWLSC, Kolkata India 2006

D D D D STAR dd Evolution – … to that … Entire layer for Cataloguing is gone Layer for restore from MSS to dd gone DATA ON DEMAND Pftp on local disk XROOTD provides load balancing, possibly scalability, a way to avoid LFN/PFN translation ... But does NOT fit within our SRM invested directions ... BUT IS IT REALLY SUFFICIENT !!?? Jérôme LAURET IWLSC, Kolkata India 2006

Coordination of requests needed • Un-coordinated requests to MSS is a disaster • This applies to ANY SE-related tools • Gets worst if the environment combine technologies (shared infrastructure) • Effect of performance is drastic Jérôme LAURET IWLSC, Kolkata India 2006

STAR Grid program - Motivation • Tier0 production • ALL EVENT files get copied on HPSS at the end of a production job • Data reduction DAQ to Event to Micro-DST • All MuDST are on “disks” • One copy temporarily on centralized storage (NFS), one permanently in HPSS • Script checks consistency (job status, presence of files in one and the other) • If “sanity” checks (integrity / checksum), register files in Catalog • Re-distribution • If registered, MuDST may be “distributed” • Distributed disk on Tier0 sites • Tier1 (LBNL) -- Tier2 sites (“private” resources for now) • Use of SRM since 2003 ... • Strategy implies dataset IMMEDIATE replication • Allows balancing of analysis Tie0 to Tier1 • Data on demand enable Tier2 with capabilities Jérôme LAURET IWLSC, Kolkata India 2006

Needed for immediate exploitation of resources • Short / medium term strategy • To distribute data • Take advantage of the static data (schedulers, workflow, …) • Advanced strategies • Data-on-demand (planner, dataset balancing, data placement …) • Selection of sub-sets of data (datasets of datasets, …) • Consistent strategy (interoperability?, publishing?, ) • Less naïve considerations • Job Tracking • Packaging • Automatic Error recovery, Help desk • Networking • Advanced workflow, … SRM / DataMover STAR Unified Meta-Scheduler Xrootd, … GridCollector SRM back-ends Would enableXrootd with Object on demand Will leverage existing to come to existencemiddleware or addressone by one … Jérôme LAURET IWLSC, Kolkata India 2006

Client USER/APPLICATIONS Grid Middleware SRM SRM SRM SRM SRM SRM SRM Enstore dCache JASMine Unix-based disks Castor SE CCLRC RAL SRM / DataMover http://osg-docdb.opensciencegrid.org/0002/000299/001/GSM-WG-GGF15-SRM.pp t SRMs are middleware components whose function is to provide dynamicspace allocationfile managementof shared storage components on the Grid Jérôme LAURET IWLSC, Kolkata India 2006

SRM / DataMover • Layer on top of SRM • In use for BNL/LBNL data transfer for years • All MuDST moved to Tier1 this way • Extremely reliable • “Set it, and forget it !” • Several 10k files transferred, multiple TB for days, no losses • Project was (IS) extremely useful, production usage in STAR • Data availability at remote site as it is produced • We need this NOW • Faster analysis is better science and sooner • Data safety • Caveat/addition in STAR: RRS (Replica Registration Service) • 250k files, 25 TB transferred AND Catalogued • 100% reliability • Project deliverables on-time Jérôme LAURET IWLSC, Kolkata India 2006

SRM / DataMover – Flow diagram NEW • Being deployed at • Wayne State University • Sao Paolo • DRM used in data analysis scenario as light weight SE service (deployable on the fly) • All the benefits from SRM (advanced reservation, …) • If we know there IS a storage space, we can take it • No heavy duty SE deployment Jérôme LAURET IWLSC, Kolkata India 2006

CE/SE decoupling • Srm-copy from execution site DRM back to submission site • Submission site DRM is called from execution site WN • Requires outgoing, but not incoming, connection on the WN • Srm-copy callback disabled (asynchronous transfer) • Batch slot released immediately after srm-copy call • Final destination of files is HPSS or disk, owned by user Client /scratch Client DRM DRM /scratch . . . DRM cache DRM cache Client /scratch Submission Site Job execution Site Jérôme LAURET IWLSC, Kolkata India 2006

Query/Wildcard sched1043250413862_0.list / .csh resolution /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... Job description /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... test.xml ... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... <?xml version="1.0" encoding="utf-8" ?> /star/data09/reco/productionCentral/FullFie... <job maxFilesPerProcess="500"> /star/data09/reco/productionCentral/FullFie... / star/data09/reco/productionCentral/FullFie... <command>root4star -q -b sched1043250413862_1.list / .csh /star/data09/reco/productionCentral/FullFie... rootMacros/numberOfEventsList.C\ /star/data09/reco/productionCentral/FullFie... (\"$FILELIST\"\)</command> /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... <stdout /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... URL="file:/star/u/xxx/scheduler/out/$JOBID.out" /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /> /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... <input /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... URL="catalog:star.bnl.gov?production=P02gd,fil /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... ... etype=daq_reco_mudst" preferStorage="local" /star/data09/reco/productionCentral/FullFie... nFiles="all"/> ... <output fromScratch="*.root" sched1043250413862_2.list / .csh toURL="file:/star/u/xxx/scheduler/out/" /> </job> /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... /star/data09/reco/productionCentral/FullFie... ... SUMS – The STAR Unified Meta-Scheduler • STAR Unified Meta-Scheduler • Gateway to user batch-mode analysis • User writes an abstract job description • Scheduler submits where files are, where CPU is, ... • Collects usage statistics • User DO NOT need to know about the RMS layer • Dispatcher and Policy engines • DataSet drive - Full catalog implementation & Grid-aware • Throttles IO resources, avoid contentions, optimizes on CPU Avoid specifying datalocation … Jérôme LAURET IWLSC, Kolkata India 2006

SUMS – The STAR Unified Meta-Scheduler • STAR Unified Meta-Scheduler • Gateway to user batch-mode analysis • User writes an abstract job description • Scheduler submits where files are, where CPU is, ... • Collects usage statistics • User DO NOT need to know about the RMS layer • Dispatcher and Policy engines • DataSet drive - Full catalog implementation & Grid-aware • Throttles IO resources, avoid contentions, optimizes on CPU BEFORE – VERY choppy As NFS would impact computational performances AFTER modulo remainingfarm instability, smoother Jérôme LAURET IWLSC, Kolkata India 2006

SUMS – The STAR Unified Meta-Scheduler, the next generation … • NEW FEATURES • RDL in addition of U-JDL • Testing grid submission is OVER. SUMS is production and user analysis ready • Light SRM helping tremendously • Need scalability test • Made aware of multiple packaging methods (from ZIP archive to PACMAN) • Tested for simple analysis, need finalizing mixed archiving technology (detail) • Versatile configuration • Site can “plug-and-play” • Possibility of Multi-VO support within ONE install An issue since we have multi 10k jobs/day NOW with spikesat 100k (valid) jobs from nervoususers … Jérôme LAURET IWLSC, Kolkata India 2006

GridCollectorUsing an Event Catalog to Speed up User Analysis in Distributed Environment root4star -b -q doEvents.C'(25,"select MuDst where Production=P04ie \ and trgSetupName=production62GeV and magScale=ReversedFullField \ and chargedMultiplicity>3300 and NV0>200", "gc,dbon")' • STAR – event catalog … • Based on TAGS produced at reco time • Rest on now well tested and robust SRM (DRM+HRM) deployed in STAR anyhow • Immediate Access and managed SE • Files moved transparently by delegation to SRM service BEHING THE SCENE • Easier to maintain, prospects are enormous • “Smart” IO-related improvements and home-made formats no faster than using GridCollector (a priori) • Physicists could get back to physics • And STAR technical personnel better off supporting GC • It is a WORKING prototype of • Grid interactive analysis framework • VERY POWERFULL Event “server” based (no longer files) GAIN ALWAYS > 1, regardlessof selectivity Jérôme LAURET IWLSC, Kolkata India 2006

GridCollector – The next step • Can push functionalities “down” • Index BitMap technology in ROOT framework • Make “a” coordinator “aware” of events (i.e. objects) • Xrootd a good candidate • ROOT framework preferred • Both would serve as a demonstrator(immediate benefit to a few experiments …) • Object-On Demand: from files to Object Management - Science Application Partnership (SAP) – SciDAC-II • In the OSG program of work as leveraging technologies to achieve goals Jérôme LAURET IWLSC, Kolkata India 2006

The Open Science Grid • In the US, Grid is moving to the Open Science-Grid • An interesting adventure comparable similar European efforts • EGEE interoperability at its heart • Character of OSG • Distributed ownership of resources. Local Facility policies, priorities, and capabilities need to be supported. • Mix of agreed upon performance expectations and opportunistic resource use. • Infrastructure deployment based on the Virtual Data Toolkit. • Will incrementally scale the infrastructure with milestones to support stable running of mix of increasingly complex jobs and data management. • Peer collaboration of computer and application scientists, facility, technology and resource providers “end to end approach”. • Support for many VOs from the large (thousands) to the very small and dynamic (to the single researcher & high school class) • Loosely coupled consistent infrastructure - “Grid of Grids”. Jérôme LAURET IWLSC, Kolkata India 2006

STAR and the OSG • STAR could not run on Grid3 • Was running at PDSF, a Grid3 site setup in collaboration with our resources • STAR on OSG = a big improvement • OSG for an Open Science, not as strongly LHC sole focus • Expanding to other science: revisit of needs and requirements • More resources • Greater stability • Currentely • Run MC on regular basis (nightly tests, standard MC) • Recently focused on user analysis (light weight SRM) • Helped other site deploy OSG stack And it shows … FIRST Functional site in Brazil,Universidade de Sao Paolo, a STAR institution … http://www.interactions.org/sgtw/2005/0727/star_saopaulo_more.html Jérôme LAURET IWLSC, Kolkata India 2006

Summary • RHIC computing facility provides adequate resources in the short-term • Model imperfect for long term projections • Problematic years starting 2008, driven by high data throughput & physics demands • Mid-term issue • This will impact Tier1 as well, assuming a refresh and planning along the same model • Out-sourcing? • Under data “stress” and increasing complexity • RHIC experiments have integrated at one level or another distributed computing principles • Data distribution and management • Job scheduling, selectivity, … • STAR intends to • Take full advantage of OSG & help bring more institutions into the OSG • Address the issue of batch oriented user analysis (opportunistic, …) Jérôme LAURET IWLSC, Kolkata India 2006

RHIC, STAR computing towards distributed computing on the Open Science Grid

RHIC, STAR computing towards distributed computing on the Open Science Grid

Presentation Transcript

Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More

Grid Computing

Open Grid Computing Environment Summary

Distributed Computing Beyond The Grid

Open Grid Computing Environments

Grid Computing in Distributed High-End Computing Applications:

Grid Computing and the Open Grid Service Architecture

Open Grid Computing

Scalable Computing on Open Distributed Systems

Distributed, Internet and Grid Computing

The Open Grid Computing Environments Collaboratory

High Throughput Computing on the Open Science Grid

RHIC/USATLAS Grid Computing Facility Overview

The Grid computing

STAR Computing

RHIC/USATLAS Grid Computing Facility Overview

Open Grid Computing Environment Summary

Open Grid Computing Environments

The Open Grid Computing Environments Project

Grid Computing in Distributed High-End Computing Applications: