Distributed Analysis

Distributed Analysis Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software Workshop BNL - May 7, 2002

Distributed Analysis • How to participate in new PPDG activity in distributed analysis • Grid portals • Grappa: brief status, and plans for remainder of 02 • Ganga: Karl Harrison will give a 15min overview Powerpoint PDF

Distributed Processing Models • Batch-like Processing (ala WP1) • Distributed Single Event (MPP) • Client-Server (interactive) • WAN Data Access (AMS, Clipper) • File Transfer and Local Processing (GDMP) • Agent-based Processing (distributed control) • Check-Point & Migrate (save & restore) • Scatter & Gather (parallel events) • Move the data or move the executable? • No experiment is planning to write PetaBytes of Code!

ATLAS Distributed Processing Model • At this point, it is still not clear what the final ATLAS distributed computing model will be. Although newer ideas like Agent-based Processing have a great deal of appeal, they are as yet unproven in a large-scale production environment. • A conservative approach would be some combination of Batch-like Processing and File Transfer and Local Processing for batch jobs, with perhaps a Client-Server-like approach for interactive jobs (w/ some Scatter/Gather?).

Data Access Patterns • Data access patterns of physics jobs also heavily influence our thinking about interacting with the Grid. It is likely that all possible data access patterns will be extant in ATLAS data processing at various stages in that processing.We may find that some data access patterns lend themselves to efficient use of the Grid much better than others. • Data access patterns include: • Sequential Access (reconstruction) • Random Access (interactive analysis) • File/Data Set Driven (LFN-friendly) • Navigational Driven (OODB-like) • Query Driven (SQL/OQL/JDO/etc)

Athena/Grid Interface • For the programmatic interface to Grid services, we are thinking in terms of Gaudi services to capture and present the functionality of the grid services (not necessarily a one-to-one mapping, BTW). • I think it is important at this stage (maybe forever) to insure that the framework is "grid-capable" without being "grid-dependent". IE- We should always be able to run without grid services available. • Gaudi's component architecture makes this approach to using the grid quite natural. • How do we switch between Grid/non-Grid?

Athena/Gaudi - ATLAS/LHCb Collaboration • Some is already occurring. UK Grid PP funding exists for: • Installation Kit • Many common tools & problems • CMT, Gaudi, AFS • Controlling Interface • How to interact with WP1's JCL • "GANGA"-like Concept • Grid Services API • Grid Services should be presented as Gaudi Services

Interfacing to the GRID • Making the framework working in the GRID environment requires: • Collecting use-cases and architectural design • Identify the [Gaudi/Athena] components that need to be adapted/re-implemented to make use of the Grid services • Started to identify areas of work: • Data access (persistency) • Event Selection • GANGA (job configuration & monitoring, resource estimation & booking, job scheduling, etc.) GANGA GUI GRID Services Histograms Monitoring Results JobOptions Algorithms GAUDI/AthenaProgram

Ganga Senarios • Original Proposal - October 2001 • Scenario 1 • User makes a "high-level" selection of data to process and defines processing job. • "High-level" means based on event characteristics and not on file or even identity. • High-level event selection uses ATLAS Bookkeeping DataBase (similar to current LArC Bookkeeping data base or BNL's Magda) to select event & logical file identities. • Construct JDL for WP1 using LFNs • Construct jobOptions.py using PFNs (w/ WP2) • Submit job(s) using JDL & jobOptions.py in sandbox. • Scenario 2 - The same except jobOptions.py now contains LFNs. This requires the Replica Service API-enabled EvtSelector or ConversionSrv.

CS-11 Analysis Tools “interface and integrate interactive data analysis tools with the grid and to identify common components and services.” First: • identify appropriate individuals to participate in this area, within and from outside of PPDG – several identified from each experiment • assemble a list of references to white papers, publications, tools and related activities – available on http://www.ppdg.net/pa/ppdg-pa/idat/related-info.html • produce a white paper style requirements document as an initial view of a coherent approach to this topic – draft circulated by June • develop a roadmap for the future of this activity – at/post face-to-face meeting

Analysis of large datasets over the Grid • Dataset does not fit on disk: Need access s/w to couple w/ processing; Distributed management implementing global experiment and local site policies • Demand significantly exceeding available resources: Queues always full. When/how to move job and/or data; Global optimization of (or at least not totally random) total system throughput without too many local constraints (e.g. single points of failure) • Data and Job Definition – in physicist terminology . For D0-SAM web+cl interface to specify Dataset + Dataset Snapshots. Saved in RDBMS for tracking and reuse. Many “dimensions” or attributes can be combined to define a dataset; Definitions can be iterative, extended; New versions defined at a specific date; Transforms dataset definition into SQL query to the database. Saves the transform definition. • Distributed processing and control: Schedule, control and monitor access to shared resources – CPU, disk, network. E.g. All D0-SAM job executions pass through a SAM-wrapper and are tracked in the database for monitoring and analysis. • Faults of all kind occur: Preemption, exceptions, resource unavaibility; crashes,; Checkpointing and Restart; Workflow management to complete failed tasks; Error reporting and diagnosis; • Chaotic and Large Spikes in Load; e.g.Analysis needs vary widely and difficult to predict – especially if a sniff of a new discovery.. • Estimation, Prediction, Planning, Partial Results - GriPhyN research areas.

References supplied by PPDG participants to date • Proposal to NSF for CMS Analysis: an Interactive Grid-Enabled Environment (CAIGEE) - Julian Bunn, Caltech • Grid Analysis Environment work at Caltech, April 2002 - Julian Bunn, Caltech • Views of CMS Event Data – Koen, Caltech • ATLAS Athena & Grid - Craig Tull, LBNL • CMS Distributed analysis workshop, April 2001 - Koen Holtman, Caltech • PPDG-8, Comparison of datagrid tools capabilities - Reagan Moore, SDSC • Interactivity in a Batched Grid Environment - David Liu, UCB • Deliverables document from Crossgrid WP4 • Portals, UI examples, etc.links • GENIUS: Grid Enabled web eNvironment for site Independent User job Submission - Roberto Barbera, INFN • SciDAC CoG Kit (Commodity Grid Kit) • ATLAS Grid Access Portal for Physics Applications XCAT, a Common Component Architecture implementation

Tools etc • Java Analysis Studio JASTony Johnson, SLAC • Distributed computing with JAS (prototype) linkTony Johnson, SLAC • Abstract Interfaces for Data Analysis (AIDA) homeTony Johnson, SLAC • BlueOx: Distributed Analysis with Java (Jeremiah Mans, Princeton) • homeTony Johnson, SLAC • Parallel ROOT Facility, PROOF intro, slides, update Fons Rademakers, CERN • Integration of ROOT and SAM info, example Gabriele Garzoglio, FNAL • Clarens Remote Analysis infoConrad Steenberg, Caltech • IMW: Interactive Master-Worker Style Parallel Data Analysis Tool on the Grid linkMiron Livny, Wisconsin • SC2001 demo of Bandwidth Greedy Grid-enabled Object Collection Analysis for Particle Physics linkKoen Holtman, Caltech

CS-11- Short term Status • The requirements document is now in the process of being outlined – Joseph Perl, Doug Olson – based on posted contributions. • A workshop is being planned to bring people together at LBL in mid June (18?19?). We won’t know more specifics til after the meeting.. Clearly Experiments starting to think about Remote Analysis (D0), Analysis for Grid simulatiojn production (CMS), and ATLAS/ALICE • Many experiments (will) use ROOT (& carrot? proof?). In conjunction with Run2 visit to Fermilab, Rene will have discussions with PPDG and CS groups in the last week of May. • Need to identify the narrow band in which PPDG can be a contributor rather than just adding to the meeting load: Keep to our mission of using/extending existing tools “for real” over the short/medium term (but encourage and do not derail needed longer term development work!)

“Skims”, “microDST production”, … 10 yr, 20 people Filtering chosen to make this a convenient size $100M, 10 yr, 100 people 1 yr, 50 people, 5x/yr 1 mo, 1 person, 100x/yr What’s going on in this box? Generic data flow in HENP ? Is this picture anywhere close to reality? Many groups grappling with requirements now..

PPDG Plan - Distributed analysis services • user transparent data analysis; automatic and transparent optimization of processing locale, and transparent return of result • principal initial application: analysis executing at major center under transparent local control from home institute

PPDG Plan - Distributed analysis services • Components: • Job specification tools: • tools to specify job conditions and requests, record them (cf. data signature catalog), and transpose them into actual job scripts, config files, etc. • Distributed job management: • automatic and transparent optimization of processing locale based on data and resource availability • request management • resource optimization • interaction and integration with job control services

PPDG Plan - Distributed analysis services • Components: • transparent local availability of (selected compact) results, further transparent distributed processing of results • results browsing, catenation, return • management of additional processing and/or reprocessing • selective public/private cataloguing of results • Cost-aware optimized data retrieval: • tools and services providing efficient access to distributed data optimized across all retrieval requests, with flexible support for access policies and priorities established by the experiments.

PPDG Plan - Distributed analysis services • Components: • Services for user-owned data management • Grid enabling of analysis, statistics, graphics tools • Integrated and deployed distributed analysis services: • testbed and production services deployed in the experiments. • In ATLAS: CERN <--> BNL Tier 1 <--> Tier N • Timescale: Basic remote submission services: during 2001Initial comprehensive implementation: 2003 (ATLAS MDC2)

PPDG Plan • ATLAS short-term workplan for distributed grid architecture performing increasingly complex operations (L.Perini talk 12/00): • submit event generation and simulation locally and remotely; • store events locally and remotely; • access remote data (e.g., background events, stored centrally); • (partially) duplicate event databases; • schedule job submission; • allocate resources: • monitor job execution; • optimize performances;

Distributed Analysis

Distributed Analysis

Presentation Transcript

Distributed Data Analysis and Tools

Global Analysis and Distributed Systems

WG2: Distributed Analysis Frameworks

Distributed Computing and Analysis

CMS Distributed Data Analysis Challenges

Distributed Multipole Analysis

ADA: ATLAS Distributed Analysis

BaBarGrid UK Distributed Analysis

CMS tools for distributed analysis

ATLAS Distributed Analysis: Current roadmap

ATLAS Distributed Analysis

Distributed Brain Activity Analysis Applications

ATLAS Distributed Analysis

ATLAS Distributed Analysis

ATLAS Distributed Analysis (ADA)

JAS – Distributed Data Analysis

Highly Scalable Distributed Dataflow Analysis

Distributed Analysis using GANGA

Introduction to Distributed Analysis

Distributed Data Analysis with PROOF