Distributed Data Analysis and Tools

Distributed Data Analysis and Tools CHEP, 21-27 March 2009, Prague P. Mato /CERN

Foreword • Distributed Data Analysis is very wide subject and I don’t like catalogue-like talks • Narrowing the scope of the presentation to the perspective of the ‘physicists’, discussing issues that affects them directly • My presentation will be LHC centric, which is very relevant for the current phase we are now. -- Sorry • Thanks to all the people that has help me to prepare this presentation Distributed Data Analysis and Tools

Data Analysis • The full data processing chain from reconstructed event data up to producing the final plots for publication • Data analysis is a iterative process • Reduce data samples to more interesting subsets (selection) • Compute higher level information, redo some reconstruction, etc. • Calculate statistical entities • Algorithm development is essential in analysis • The ingeniousness is materialized in code Distributed Data Analysis and Tools

Some Obvious Facts • The large amount of data to be analyzed and the computing requirements prevents the idea of non-distributed data analysis • The scale of ‘distribution’ goes from a local cluster to computer center or to the whole grid(s) • Distributed analysis complicates the life of the physicists • In addition to the analysis code he/she has to worry about many other technical issues Distributed Data Analysis and Tools

LHC Analysis Data Flow • Data is generated at the experiment, process and distributed worldwide (T1, T2, T3) • The analysis will process, reduce, transform and select parts of the data iteratively until it can fit in a single computer • How this is realized? Distributed Data Analysis and Tools

HEPCAL-II† Dreams • All elements there and still valid • Less organized activity (chaotic) • Input data defined by asking questions • Data scattered all over the world • Own algorithms • Data provenance • Software version management • Resource estimation • Interactivity • Advocating for a sophisticated WMS • Common to all VO’s • Plugins to VO’s specific tools/services UserAlgorithms DatasetQuery Workload ManagementSystem Other Services UserOutput † Common use cases for a HEP Common Application Layer for Analysis, LCG-2003 Distributed Data Analysis and Tools

Need for a Common Layer “If there is no special middleware support [for analysis], the job may not benefit from being run in the grid environment, and analysis may even take a step backward from pre-grid days” Distributed Data Analysis and Tools

HEPCAL-II Reality • The implementation has evolved into a number of VO specific “middleware” using a small set of basic services • E.g. DIRAC, PanDA, AliEn, Glide-In • Development of “user-friendly” and ‘intelligent” interfaces to hide the complexity • E.g. Crab, Ganga • Not optimal for small VOs that cannot afford to develop specific services/interface • Or individuals with special needs [VO specific] Front-end interface VO specific WMS, DSC Grid middleware Basic Services Computing & Storageresources Distributed Data Analysis and Tools

Analysis Software • Specialization of the VO’s Frameworks and Data Models for data analysis to process ESD/AOD • CMS Physics Analysis Toolkit (PAT), ATLAS Analysis Framework, LHCb DaVinci/LoKi/Bender, ALICE Analysis Framework • In same cases selecting subset of Framework libraries • Collaboration approved analysis algorithms and tools • Other [scripting] languages have a role here • PYTHON is getting very popular in addition to CINT macros • Ideal for prototyping new ideas • User typically develops its own Algorithm(s) based on these frameworks but also is willing to replace parts of the official release Distributed Data Analysis and Tools

Front-End Tools Ganga ALICE Crab Distributed Data Analysis and Tools

Major Differences • Both Gangaand ALICE provide an interactive shell to configure and automate analysis jobs (Python, CINT) • In addition Ganga provides a GUI • Crab has a thin client. Most of the work (automation, recovery, monitoring, etc) is done in a server • This functionality is delegated to the VO specific WMS for the other cases • Ganga offers a convenient overview of all user jobs (job repository) enabling automation • Both Crab and Gangaare able to pack local user libraries and environment automatically making use of the configuration tool knowledge • For ALICE the user provides .par files with the sources Distributed Data Analysis and Tools

Analysis Activity • 1. Algorithm development and testing starts locally and small • Single computer  small cluster • 2. Grows to a large data and computation task • Large cluster  the Grid • 3. Final analysis is again more local and small • Small cluster  single computer • Ideally the analysis activity should be a continuum in terms of tools, software frameworks, models, etc. • LHC experiments are starting to offer this to their physicists • Ganga is a good example. From inside the same session you can do a large data job and do final analysis with the results Distributed Data Analysis and Tools

Input Data • The user specifies on what data to run the analysis using VO specific dataset catalogs • Specification is based on a query • The front-end interfaces provide functionality to facilitate the catalog queries • Each experiment has developed event tags mechanisms for sparse input data selection • Data scattered over the world • Computing model and policies of the experiment dictate the placement of data • Read-only data with several replicas • Portions of the data copied to local clusters (CAF, T3, etc) for local access Distributed Data Analysis and Tools

Output Data • Small output data files such like histogram files are returned to the client session (using the sandbox) • Usually limited to few MB • Large output files are typically put in Storage Elements (e.g. Castor) and registered in the grid file catalogue (e.g. LFC) and can be used as input for other Grid jobs (iterative process) • Tools such as CRAB and Ganga(ATLAS) provides strong links with VO’s Distributed Data Management/Transfer systems (eg. DQ2, PhEDEx) to place output where user wants it Distributed Data Analysis and Tools

Submission Transparency • The goal is to make it easy for physicists • Distributed analysis as simple as doing it locally • Which is already complicated enough!! • Hiding the technical details is a must • In Gangachanging the back-end from LSF to DIRAC requires to change one parameter • In ALICE changing from PROOF to AliEn requires to change one name and provide a AliEn plugin configuration • In CRAB changing from local batch to gLite requires a single parameter change in the configuration file Distributed Data Analysis and Tools

TSelector AM AM AM Analysis Manager AM AM AM AliAnalysisSelector AM task1 task1 task1 task1 task1 task1 task1 task2 task2 task2 task2 task2 task2 task2 Input list Outputs Outputs Outputs Outputs Outputs Outputs Outputs Inputs Inputs Input chain Inputs Inputs Inputs Inputs task3 task3 task3 task3 task3 task3 task3 taskN taskN taskN taskN taskN taskN taskN PROOF Transparency example CLIENT PROOF AM->StartAnalysis(“proof”) MyAnalysis.C Master O1 Worker Worker SlaveBegin() Worker AM Worker Process() Worker SlaveTerminate() Terminate() O1 O2 O2 Output list O On O On Andrei Gheata O Distributed Data Analysis and Tools

ATLAS Physicist Choices • A large variety of frontends and backends • It is great, but it may add confusion and complicate user support Distributed Data Analysis and Tools

Managing the Software • Distributed analysis relies on the software installed in the remote nodes (e.g local cluster, Grid) • Experiment’s officially released software is taken care by VOs • Installation procedures for big VO are well oiled • Problem for small VOs / Individuals • Physicist’s add-ons and private analysis algorithms need to be send along with the job • Every user tool provides some level of support for this • The exact matching of the OS version/compiler (platform) is required when sending binaries • The later imposes strong constrains on the platform uniformity of the different facilities • Local interactive service  Local facility  Grid Distributed Data Analysis and Tools

On-demand Install with CernVM • CernVM is a Virtual Appliance that provides a complete, portable and easy to configure user environment for developing and running analysis locally and on the Grid independently of physical software and hardware platform • It comes with the read only file system (CVMFS) optimized for software distribution • Very little fraction of the software is actually used (~10%) • Very aggressive local caching, web proxy cache (squids) • Operational in off-line mode CernVM CernVM CernVM https:// CVMFS CVMFS CVMFS LAN/WAN Distributed Data Analysis and Tools

Virtualization Role • The CernVM platform is stating to be used by Physicists to develop/test/debug data analysis • With a laptop you carry the complete development environment and the Grid UI with you • Managing all phases of analysis from the same ‘window’ • Ideally the same environment should be used to execute their jobs in the Grid • Validation with large datasets • Decoupling applicationsoftwarefrom systemsoftware and hardware • Can the existing ‘Grid’be adapted to CernVM? Distributed Data Analysis and Tools

Job Splitting • Job splitting (parallelization) is essential to be able to analyze large data samples in a limited time • Very lasting jobs are more unreliable • Tools such as PROOF splits dynamically the analysis job at the sub-file level (packets) offering [quasi] interactivity with the user • All the other Grid submission tools provides parallelization by splitting the list of input files • Sub-jobs constrained by input data location • The more difficult part is the result merging • Standard automation of the most common cases • User intervention for more complicated ones Distributed Data Analysis and Tools

Using Multi-Core Architectures • Majority of today computing resources are based on multi-core architectures • Exploiting these multi-core architectures (MT, MP) can optimize the use of resources (memory, I/O) • See V. Innocente’s presentation • Submitting a single job per node that utilizes all available cores can be advantageous • Efficient in resources, mainly increasing the fraction of shared memory • Scale down the number of jobs that the WMS needs to handle Distributed Data Analysis and Tools

Analysis Trains • Grouping data analysis is way to optimize when going over a large part or the full dataset • Requires the support of the framework, (a model) • …and some discipline • Examples: • Alice is using the AliAnalysisManagerframework to optimize CPU/IO ratio (85% savings reported) • LHCb is groupingpre-selections intheir stripping jobs Distributed Data Analysis and Tools

Resource Estimation • At the time of HEPCAL-II resource estimation was an important issue • How much CPU time this analysis would take, what will be the output data size, etc. • In practice Physicists can estimate resources pretty well since test analysis are performed with small data samples before submitting large jobs • Proper reporting of the ‘cost’ of each job with standardized units could facilitate this estimation • In the old times of CERNVM a job summary with the CPU time in ‘CERN units’ was printed in each job Distributed Data Analysis and Tools

Handling Job Failures • Job failures are very common (E.g. ~45% of the CMS analysis jobs do not terminate successful) • The reasons are very diverse (data access, stalled, upload data, application failure,…) • Proper reporting of job failures is essential for diagnosing and handling them efficiently • Detailed monitoring, log files, etc. • Handling failures may imply to provide corrections in configurations, code, re-submission, managing site backlists, etc. • Automated correction actions can handled by severs (e.g. CRAB) • Scripting support available to users (e.g. Ganga) [1]: jobs.select(status=‘failed’).resubmit() [2]: jobs.select(name= ‘testjob’).kill() [3]: newjobs = jobs.select( status=’new’) [4]: newjobs.select( name= ’urgent’).submit() Distributed Data Analysis and Tools

Monitoring • Monitoring is essential for the users and also for administrators • Physicists may use the web based interfaces to find out information about their jobs • Each WMS have develop a very complete monitoring tools • The details available arereally impressive (e.g. Panda Monitor) • Often the connectionwith the submissiontools is poor • Not well integrated Distributed Data Analysis and Tools

Application Awareness • If the front-end submission tool understands the analysis application [framework] it can become extremely helpful to the users • E.g. the Ganga application component can • Setup the correct environment, collect user shareable libraries, analyze configuration files and follow dependencies, determine inputs and outputs and register them automatically, etc. • The technical solution to achieve this is by implementing ‘plugins’ for each type of application Distributed Data Analysis and Tools

Summary • Fundamentally the way analysis is being done has not changed very much • The initial dreams that the Grid will change dramatically the paradigm has not happen • Parts of the analysis with large data jobs will be done in batch and parts will be done more locally and interactively • Each collaboration has developed tools to cope with the large data and computational requirements and to simply the life of physicists • Turned out that the model/architecture of these tools are very similar but they are not in common • The number of users of these tools are increasing rapidly Distributed Data Analysis and Tools

Distributed Data Analysis and Tools