1 / 22

Data Analysis – Present & Future Nick Brook University of Bristol

Data Analysis – Present & Future Nick Brook University of Bristol. Generic Requirements & Introduction Expt specific approaches:. Complexity of the Problem. Detectors: ~2 orders of magnitude more channels than today Triggers must choose correctly only 1 event in every 400,000

havyn
Download Presentation

Data Analysis – Present & Future Nick Brook University of Bristol

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Analysis – Present & Future Nick Brook University of Bristol • Generic Requirements & Introduction • Expt specific approaches: Nick Brook – 4th LHC Symposium

  2. Complexity of the Problem • Detectors: • ~2 orders of magnitude more channels than today • Triggers must choose correctly only 1 event in every 400,000 • High Level triggers are software-based Computer resources will not be available in a single location Nick Brook – 4th LHC Symposium

  3. Physicists are all over the world Complexity of the Problem • Major challenges associated with: Communication and collaboration at a distance Distributed computing resources Remote software development and physics analysis Nick Brook – 4th LHC Symposium

  4. Analysis Software System 30 kSI2000sec/event 1 job year 30 kSI2000sec/event 3 jobs per year Experiment- Wide Activity (109 events) Reconstruction Re-processing 3 per year New detector calibrations Or understanding 50 kSI2000sec/event 0.25 kSI2000sec/event ~20 jobs per month Monte Carlo Trigger based and Physics based refinements Iterative selection Once per month Selection ~20 Groups’ Activity (109 107 events) 0.1 kSI2000sec/event ~500 jobs per day ~25 Individual per Group Activity (106 –108 events) Different Physics cuts & MC comparison ~1 time per day Analysis Algorithms applied to data to get results 2GHz ~ 700 SI2000 Nick Brook – 4th LHC Symposium

  5. Data Browser Generic analysis Tools GRID Distributed Data Store & Computing Infrastructure Reconstruction Analysis job wizards LCG tools Framework Simulation Detector/Event Display Expt tools Data Management tools Software development and installation Coherent set of basic tools and mechanisms Stable User Interface Analysis Software System Nick Brook – 4th LHC Symposium

  6. Peak of Inflated Expectations Hype Plateau of Productivity Slope of Enlightenment Trough of Disillusionment Trigger Time • Philosophy • we want to perform analysis from day 1 (now) ! • building on Grid tools/concepts to simplify distributed environment Nick Brook – 4th LHC Symposium

  7. DIRAC – Distributed Infrastructure with Remote Agent Control Data Challenges & Production Tools All experiments have well-developed production tools for co-ordinated data challenges e.g. CHEP talks on Tools provide management of workflows, job submission, monitoring, book-keeping, … Nick Brook – 4th LHC Symposium

  8. AliEn (ALIce ENvironment) is an attempt to gradually approach and tackle computing problems at LHC scale and implement ALICE Computing Model Main features • Distributedfile catalogue built on top of RDBMS • File replica and cache manager with interface to MSS • CASTOR,HPSS,HIS… • AliEnFS – Linux file system that uses AliEn File Catalogue and replica manager • SASL based authentication which supports various authentication mechanisms (including Globus/GSSAPI) • Resource Broker with interface to batch systems • LSF,PBS,Condor,BQS,… • Various user interfaces • command line, GUI, Web portal • Package manager (dependencies, distribution…) • Metadata catalogue • C/C++/perl/java API • ROOT interface (TAliEn) • SOAP/Web Services • EDG compatible user interface • Common authentication • Compatible JDL (Job description language) based on CONDOR ClassAds Nick Brook – 4th LHC Symposium

  9. RDBMS (MySQL) DBD DBI AliEn Architecture External software AliEn Core Components & services Interfaces Database Proxy ADBI User Application File & Metadata Catalogue API (C/C++/perl) LDAP Authentication RB FS External Libraries User Interface Perl Modules Perl Core CE CLI Config Mgr SOAP/XML V.O. Packages & Commands SE GUI Web Portal Package Mgr (…) Logger Low level High level Nick Brook – 4th LHC Symposium

  10. ALICE have deployed a distributed computing environment which meets their experimental needs • Simulation & Reconstruction • Event mixing • Analysis • Using Open Source components (representing 99% of the code), internet standards (SOAP,XML, PKI…) and scripting language (perl) has been a key element - quick prototyping and very fast development cycles • close to finalizing AliEn architecture and API • OpenAliEn? Nick Brook – 4th LHC Symposium

  11. PROOF – The Parallel ROOT Facility • Collaboration between core ROOT group at CERN and MIT Heavy Ion Group • Part of and based on ROOT framework • Uses heavily ROOT networking and other infrastructure classes • Currently no external technologies • Motivation: • interactive analysis of very large sets of ROOT data files on a cluster of computers • speed up the query processing by employing parallelism • to extend from a local cluster to a wide area “virtual cluster” - GRID. • analyze a globally distributed data set and get back a “single” result with “single” query Nick Brook – 4th LHC Symposium

  12. stdout/obj proof ana.C proof TFile TFile TFile proof TNetFile proof proof proof = master server proof = slave server PROOF – Parallel Script Execution #proof.conf slave node1 slave node2 slave node3 slave node4 Local PC Remote PROOF Cluster root *.root node1 ana.C *.root node2 $ root root [0] .x ana.C root [1] gROOT->Proof(“remote”) $ root $ root root [0] tree->Process(“ana.C”) root [1] gROOT->Proof(“remote”) root [2] chain->Process(“ana.C”) $ root root [0] .x ana.C *.root node3 *.root node4 Nick Brook – 4th LHC Symposium

  13. PROOF & the Grid Nick Brook – 4th LHC Symposium

  14. Converter Converter Application Manager Converter Transient Event Store Data Files Message Service Persistency Service Event Data Service JobOptions Service Algorithm Algorithm Algorithm Data Files Transient Detector Store Particle Prop. Service Persistency Service Detec. Data Service Other Services Data Files Transient Histogram Store Persistency Service Histogram Service Gaudi – ATLAS/LHCb software framework Nick Brook – 4th LHC Symposium

  15. GANGA GUI Collective & Resource Grid Services Histograms Monitoring Results JobOptions Algorithms GAUDI Program GANGA: Gaudi ANd Grid AllianceJoint Atlas and LHCb project, • Based on the concept ofPython bus: • use different modules whichever are required to provide full functionality of the interface • use Python to glue this modules, i.e., allow interaction and communication between them Nick Brook – 4th LHC Symposium

  16. GUI PYTHON SW BUS GaudiPython PythonROOT LRMS Athena\GAUDI XML RPCserver LAN/WAN PYTHON SW BUS GRID Server Bookkeeping DB Production DB Local Job DB OS Module EDG UI XML RPC module GANGA Core Module Remote user (client) Job Configuration DB Python Software Bus Nick Brook – 4th LHC Symposium

  17. Current Status • Most of base classes are developed. Serialization of objects (user jobs) is implemented with the Python pickle module. • GaudiApplicationHandler can access Configuration DB for some Gaudi applications (Brunel). It is implemented with the xmlrpclib module. Ganga can create user-customized Job Options files using this DB. • DaVinci andAtlFastapplication handlers are implemented • Various LRMSare implemented - allows to submit and to get simple monitoring information for a job on several batch systems. • Much of GRID-related functionality is already implemented inGridJobHandler using EDG testbed 1.4 software. Ganga can submit, monitor, and get output from GRID jobs. • JobsRegistryclass provides jobs monitoring via multithreaded environment based on Python threading module • GUI available - using wxPython extension module • ALPHA release available Nick Brook – 4th LHC Symposium

  18. m b/t e/g JetMet Event generation PYTHIA … … … … … … … … … … … … … … … … MC ntuples MB Detector simulation OSCAR Detector Hits MB Digitization ORCA Digis: raw data bx Reconstruction, L1, HLT ORCA DST DST stripping ORCA Ntuples: MC info, tracks, etc … … … … Analysis Iguana/ Root/PAW CMS analysis/production chain Calibration Nick Brook – 4th LHC Symposium

  19. Production system and data repositories TAG and AOD extraction/conversion/transport services RDBMS based data warehouse(s) PIAF/Proof/..type analysis farm(s) ORCA analysis farm(s) (or distributed `farm’ using grid queues) Production data flow Data extraction Web service(s) Query Web service(s) Tool plugin module TAGs/AODs data flow Local analysis tool: Iguana/ROOT/… Physics Query flow Web browser Local disk CMS components and data flows Tier 0/1/2 Tier 1/2 Tier 3/4/5 User Nick Brook – 4th LHC Symposium

  20. CLARENS – a CMS Grid Portal • Grid-enabling the working environment for physicists' data analysis • Clarensconsists of a server communicating with various clients via the commodity XML-RPC protocol. This ensures implementation independence. • The server will provide a remote API to Grid tools: Service Clarens Web Server http/https • The Virtual Data Toolkit: Object collection access • Data movement between Tier centres using GSI-FTP • CMS analysis software (ORCA/COBRA), • Security services provided by the Grid (GSI) • No Globus needed on client side, only certificate RPC Client Current prototype is running on the Caltech proto-Tier2 Nick Brook – 4th LHC Symposium

  21. CLARENS Several web services applications have been built on the Clarens web service architectures: • Proxy escrow • Client access available from wide variety of languages • PYTHON • C/C++ • Java application • Java/Javascript browser-based client • Access to JetMET data via SQL2ROOT • Root access to remote data files • Access to files managed by San Diego SC storage resource broker (SRB) Nick Brook – 4th LHC Symposium

  22. Summary • all 4 expts have successfully “managed” distributed production • many lessons learnt – not only by expt but useful feedback to m/w providers • a large degree of automisation achieved • Expts moving onto next challenge – analysis • Chaotic, unmanaged access to data & resources • Tools already (being) developed to aid Joe Bloggs • Success will be measured in terms: • Simplicity, stability & effectiveness • Access to resources • Management & access to data • Ease of development of user applications Nick Brook – 4th LHC Symposium

More Related