Experience and Plans for CDF Grid Computing

Experience and Plans for CDF Grid Computing Alan Sill Texas Tech University International Symposium on Grid Computing Taipei, Taiwan, 25 - 29 April 2005

Introduction and Background • CDF is an operating experiment at the Tevatron • Considerable experience in C++ software development • Extensive software distribution framework and tools • Software development plan was oriented towards development of physics analysis at remote institutions as well as at Fermilab, with most data handling occurring at FNAL • Most developments in Grid Computing happened after experiment was mature! • => Have to retrofit these to the collaboration workflow • This talk will describe CDF’s large-scale global computing and our further plans to use the Grid.

Central Global Grid Talk Outline: We developed a large central computing resource based on Linux farms with a simple job management scheme. We extended the above model, including its command line interface and gui, to manage and work with remote resources. We are now in the process of adapting and converting our work flow to the Grid.

Elements of Global Computing • It must work for everyone • Entire collaboration built the detector • Entire collaboration analyzes the data • => Keep user environment simple! • It should be easy for users to adopt • The physics workflow should look familiar to the physicist. • It should add or create additional resources • There should be some distinct advantage obtained from the computing being globally distributed.

Original CDF Central Computing • Developed in 2001-2002 to respond to experiment’s greatly increased need for computational and data handling resources to deal with Run II. • Replaced Symmetrical Multi-Processing (SMP) approach with inexpensive collection of commodity Linux-based computing and file server systems. • One of the first large-scale cluster approaches to user computing for general analysis (as opposed to farms for production of standardized jobs). • Greatly increased cpu power & data to physicists (presently 300 TB data + 1500 cpus).

Robotic Tape Storage Simulation and Reconstruction CDF Data Analysis Flow: 2002-03 Based on CDF Central Analysis Facility (CAF) for user physics analysis: User Desktops CAF part of overall analysis flow. Users submit jobs through CAF GUI or command line to central cpu, disk and tape resources. Queues are arranged to give priority to users on resources provided by their own institution. Data Analysis 7MHz beam Xing CDF 20 MB/s Read/write Data 75 Hz 0.75 Million channels Central Analysis Farm (CAF) L1 ↓ L2 ↓ 300 Hz Production Farm (~150 duals) Level 3 Trigger (~250 duals) Most user analysis occurred here.

CAF GUI * interface * (Command line submission also possible.) “Send my job to the data.” User submits job, which is tarred and sent to CAF cluster. Results packed up and sent back to or picked up by user. Develop on desktop, submit user analysis job to central resources. Most active CDF physicists became familiar with operation of this system.

Environment on a CDF CAF • All basic CDF software pre-installed on CAF. • Authentication via Kerberos • Jobs are run via mapped accounts with authentication of actual user through special principal • Database, data handling remote user ID passed on through lookup of actual user via special principal (important for monitoring) • User’s analysis environment comes over in tarball - no need to pre-register or submit only certain jobs. • Job returns results to user via secure ftp/rcp controlled by user script and principal.

The user’s job context is transported to the execution site as a sandboxed tarball => no surprises should occur when job runs. Elements of Global Computing • It must work for everyone • Entire collaboration built the detector • Entire collaboration analyzes the data • => Keep user environment simple! ✔ • It should be easy to adopt • The physics workflow should look familiar to the physicist. • It should add or create additional resources • There should be some distinct advantage obtained from the computing being globally distributed.

Sociological factors to build a CAF • All users in CDF have an account if they request one. (Currently there are 768 registered users.) • Institutions get priority on use of portions of the CAF that they contribute: • Exclusive queues • Priority using disks, etc. • Common hardware choices made by central admins. • All leftover time available to everyone. • Result: 100% utilization! This is a very efficient shared way to use resources!

Elements of Global Computing To go further, we extended the use of the CAF interface used in this system to remote clusters (DCAFs) in locations throughout the world. • It must work for everyone • Entire collaboration built the detector • Entire collaboration analyzes the data • => Keep user environment simple! • It should be easy for users to adopt • The physics workflow should look familiar to the physicist. • It should add or create additional resources • There should be some distinct advantage obtained from the computing being globally distributed.

2004: ~65%/35% 2005 so far: ~55%/45% Robotic Tape Storage CDF Data Analysis Flow: 2004-05 Based on Distributed CDF Analysis Facilities (dCAFs) User Desktops Simulation and Analysis Data Analysis 7MHz beam Xing CDF 20 MB/s Read/write Data 75 Hz 0.75 Million channels Reconstruction Central Analysis Farm (CAF) L1 ↓ L2 ↓ New DCAFs Distributed clusters in Taiwan, Korea, Japan, Italy, Germany, Spain, UK, USA, and Canada. 300 Hz Production Farm (~150 duals) Level 3 Trigger (~250 duals)

Elements of current CDF approach • Cluster technology (CAF = “CDF analysis farm”) extended to remote sites (dCAFs = decentralized CDF Analysis Farms). • Multiple batch systems supported: Converting from FBSNG system to Condor on all dCAFs. • SAM data handling system required for offsite dCAFs: • Either use to pin pre-defined datasets at remote locations, or can use SAM’s automatic cache management features. • SAM also used on CDF systems with non-CAF architectures. • Fully functional for users => heavily in use.

Other current components • Ganglia hardware monitoring. • Bandwidth monitoring by Fermilab network team. • Batch system user monitoring. • Bookkeeping and transfer of pinned datasets. • Calibration database query caching system. • Remote direct access to other central databases. (Plan to extend caching system to other DBs soon.) • Single uniform user interface (CAF GUI /commands).

Modifications to the CAF GUI • Allows selection of CAF (not automatic yet). • User specifies data set. • Use of SAM features built in. • Same interface can be used for grid.

Current CDF Dedicated Resources: As of Apr. 15, 2004 * (Counts only resources openly available to all CDF users)

How did we achieve this? • Series of workshops oriented to installation of cluster (dCAF) software and data handling (SAM) pieces, as well as other components (monitoring, DB handling, etc.) beginning Jan. 2004. (4 held so far.) • Rigorous application of “it’s got to work” mentality and focus on ability to function in our environment. • Postponement of features not yet truly needed. • Clear, “live” frequently updated installation manuals. • Focus on operations while development is going on. (Daily 15-min op’s meeting, weekly 1-hr summary.)

Feature Status Self-contained user sandbox Yes Runs arbitrary user code Yes Automatic identity management Yes Network delivery of results Yes Input and output data handling Yes Batch system priority management Yes Automatic choice of farm Not yet! Negotiation of resources Not yet! Runs on arbitrary grid resources Not yet! Functionality for Users (4/2005)

30 65 60 85 Rutgers 40 190 250 Utilization jumped up quickly! • Experiment-wide global aggregate monitoring not done yet (in progress). • Off-site new dCAF usage jumped up quickly after deployment in each case. 1K CAF Taiwan Japan Spain 1.6K 400 CondorCAF Italy Korea SDSC Toronto Central ---------------------------------------------------- Off-Site -----------------------------------------------------

2.3 of 5.6 THz now offsite in DCAFs (not counting special-purpose and “opportunistic” computing); other clusters and Grid access in progress. Current CDF Dedicated Resources: As of Apr. 15, 2004 * (Counts only resources openly available to all CDF users)

Did CDF’s route to global computing succeed? Yes, but now have to build a path to the future: Provide tools to add grid functionality without losing what we have gained. Elements of Global Computing • It must work for everyone ✔ • Entire collaboration built the detector • Entire collaboration analyzes the data • => Keep user environment simple! ✔ • It should be easy to adopt ✔ • The physics workflow should look familiar to the physicist. ✔ • It should add or create additional resources • There should be some distinct advantage obtained from the computing being globally distributed.

Next steps • Follow through on near-term technology deployments: • Complete “FroNtier” distributed database cache, extend to all experiment databases. • Aggregate global monitoring as dCAFs switch to Condor. • SRM-dCache + SAM deployment for data access. • Deploy SAM data handling: • Have used this for 2 years for off-site systems and on central systems in development mode, now switch to using it for all data handling including central. • Adapting MC production storage and reconstruction systems to SAM.

Cannot continue to expand dedicated resources ==> USE THE GRID!! Reasons for moving to Grid: • It’s the direction of the lab and of HEP • Need to take advantage of global innovations and resources • “Dedicated pool” model is becoming obsolete • CDF still has a lot of data to analyze!

you are here Luminosity and Data Volume • Expectations are for continued high-volume growth as luminosity and data logging rate continue to improve: ● Luminosity on target to reach goal of 2.5x present rate ● Data logging rate will increase to 35- 40 MB/sec in 2005 ● Rate will further increase to 60 MB/sec in FY2006

Input Conditions Resulting Requirements -1 9 Actual Estimated Total Computing Requirements • Analysis CPU, disk, tape needs scale with number of events • Changes in logging rate in FY2005 and FY2006 drive large increases in both storage and CPU requirements • FNAL portion of analysis CPU assumed at roughly 50% beyond 2005.

Towards Full Grid Methods • Testing various approaches to using Grid resources: • Adapt the CAF infrastructure to run on top of the Grid using Condor glide-ins • Advantages: Grid neutral; works both on LCG and OSG, needs just a Globus gatekeeper; can opportunistically use idle resources. Disadvantage: Need a CDF head node near each CE • Use direct submission via CAF interface to OSG and LCG • Exploring differences between job submission models and resource broker environments; work in progress now. • Use SAMGrid/JIM sandboxing as an alternate way to deliver experiment + user software. • Combine dCAFs w/ Grid resources (can be done at any time).

Grid Integration Methods • Direct submission to the Grids • Organized MC production a good candidate. • Submit either via CAF interface (preferred) or via globus submit. • Desire to use non-dedicated shared resources. • Allow Grid submission to our dedicated pools • CAF is compatible with grid integration. • dCAF can coexist with another submission source. • Can be done at any time.

Activities In Progress: More detail • “GlideCAF and GridCAF” efforts underway • Produce ability to submit CDF MC jobs to LCG and possibly OSG through CDF CAF user interface. • GlideCAF in beta test, GridCAF under development. • Requires solving Kerberos KX509 update interval problem. • Full Open Science Grid & LCG implementations • Already participating in OSG Integration Test Bed, working closely with LCG developers. • LCG includes a Resource Broker; OSG does not. • Beginning work on related submission and accounting tools.

CDF Jobs Sandboxing • Up to now, CDF jobs rely on having CDF software available and installed at the execution site • Will not work for the Grid • Studying various sandboxing methods • Moving there by steps • Production executables already sandboxed • Organized physics group activities to follow soon • Average user analysis to follow sometime, but not an easy problem to fix

Conclusions • CDF has successully deployed a global computing environment for user analysis based on a simple clustering and submission protocol, with a large number of registered and active users. • A large portion (order 50%) of the total cpu resources of the experiment are now provided offsite through a combination of DCAFs and other clusters. • Aggressive trimming of the software suite has allowed us to bring up and use large computational resources in short order with heavy adoption and utilization. • Active work is in progress to build bridges to true grid methods & protocols to provide a path to the future.

Experience and Plans for CDF Grid Computing

Experience and Plans for CDF Grid Computing

Presentation Transcript

Portlets and Portals for Grid Computing

SWITCH Plans for Shibboleth and Grid

Grid Computing and LA Grid

CDF Offline Computing

CDF Grid at KISTI

Plans for Grid@CNAF and Validation

CDF Grid

CDF computing in the GRID framework in Santander

Grid computing for CMS

Fermilab Grid Computing – CDF, D0 and more..

CDF-UK MINI-GRID

Scheduling for Grid Computing

CDF Offline Status and Plans

Applying our GRID framework for CDF computing Santander HEP Group

D Ø Computing Experience and Plans for SAM-Grid

Present plans in Finland for GRID computing

Grid Components in STAR: Experience, Needs and Plans

Progress and Plans for CDF Databases

CDF and the Grid

CDF Offline Status and Plans

DØ and CDF Detectors/Computing

Computing for CDF Status and Requests for 2003