1 / 37

ALICE Computing Model

ALICE Computing Model. F.Carminati BNL Seminar March 21, 2005. Offline framework. AliRoot in development since 1998 Entirely based on ROOT Used since the detector TDR’s for all ALICE studies Two packages to install (ROOT and AliRoot) Plus MC’s Ported on most common architectures

elita
Download Presentation

ALICE Computing Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ALICE Computing Model F.Carminati BNL Seminar March 21, 2005

  2. Offline framework • AliRoot in development since 1998 • Entirely based on ROOT • Used since the detector TDR’s for all ALICE studies • Two packages to install (ROOT and AliRoot) • Plus MC’s • Ported on most common architectures • Linux IA32, IA64 and AMD, Mac OS X, Digital True64, SunOS… • Distributed development • Over 50 developers and a single CVS repository • 2/3 of the code developed outside CERN • Tight integration with DAQ (data recorder) and HLT (same code-base) • Wide use of abstract interfaces for modularity • “Restricted” subset of c++ used for maximum portability ALICE Computing Model

  3. AliRoot layout G4 G3 FLUKA ISAJET AliEn/gLite Virtual MC AliRoot AliReconstruction HIJING EVGEN MEVSIM AliSimulation HBTAN PYTHIA6 STEER PDF PMD EMCAL TRD ITS PHOS TOF ZDC RICH HBTP STRUCT CRT START FMD MUON TPC RALICE ESD AliAnalysis ROOT ALICE Computing Model

  4. Software management • Regular release schedule • Major release every six months, minor release (tag) every month • Emphasis on delivering production code • Corrections, protections, code cleaning, geometry • Nightly produced UML diagrams, code listing, coding rule violations, build and tests , single repository with all the code • No version management software (we have only two packages!) • Advanced code tools under development (collaboration with IRST/Trento) • Smell detection (already under testing) • Aspect oriented programming tools • Automated genetic testing ALICE Computing Model

  5. ALICE Detector Construction Database (DCDB) • Specifically designed to aid detector construction in distributed environment: • Sub-detector groups around the world work independently • All data collected in central repository and used to move components from one sub-detector group to another and during integration and operation phase at CERN • Multitude of user interfaces: • WEB-based for humans • LabView, XML for laboratory equipment and other sources • ROOT for visualisation • In production since 2002 • A very ambitious project with important spin-offs • Cable Database • Calibration Database ALICE Computing Model

  6. G3 G3 transport User Code VMC G4 G4 transport FLUKA transport FLUKA Reconstruction Geometrical Modeller Visualisation Generators The Virtual MC ALICE Computing Model

  7. TGeo modeller ALICE Computing Model

  8. 5000 1GeV/c protons in 60T field Results Geant3 FLUKA HMPID 5 GeV Pions ALICE Computing Model

  9. ITS – SPD: Cluster SizePRELIMINARY! ALICE Computing Model

  10. Reconstruction strategy • Main challenge - Reconstruction in the high flux environment (occupancy in the TPC up to 40%) requires a new approach to tracking • Basic principle – Maximum information approach • Use everything you can, you will get the best • Algorithms and data structures optimized for fast access and usage of all relevant information • Localize relevant information • Keep this information until it is needed ALICE Computing Model

  11. Tracking strategy – Primary tracks • Incremental process • Forward propagation towards to the vertex TPCITS • Back propagation ITSTPCTRDTOF • Refit inward TOFTRDTPCITS • Continuous seeding • Track segment finding in all detectors • Combinatorial tracking in ITS • Weighted two-tracks 2 calculated • Effective probability of cluster sharing • Probability not to cross given layer for secondary particles ALICE Computing Model

  12. ITS & TPC & TOF Efficiency Contamination Tracking & PID TPC ITS+TPC+TOF+TRD • PIV 3GHz – (dN/dy – 6000) • TPC tracking - ~ 40s • TPC kink finder ~ 10 s • ITS tracking ~ 40 s • TRD tracking ~ 200 s ALICE Computing Model

  13. Condition and alignment • Heterogeneous information sources are periodically polled • ROOT files with condition information are created • These files are published on the Grid and distributed as needed by the Grid DMS • Files contain validity information and are identified via DMS metadata • No need for a distributed DBMS • Reuse of the existing Grid services ALICE Computing Model

  14. External relations and DB connectivity Relations between DBs not final not all shown From URs: Source, volume, granularity, update frequency, access pattern, runtime environment and dependencies Physicsdata files calibration procedures API ECS DAQ API calibration files Trigger API Calibration classes AliEngLite: metadata file store DCS API AliRoot DCDB API API HLT Call for UR sent to subdetectors API API – Application Program Interface ALICE Computing Model

  15. Metadata • MetaData are essential for the selection of events • We hope to be able to use the Grid file catalogue for one part of the MetaData • During the Data Challenge we used the AliEn file catalogue for storing part of the MetaData • However these are file-level MetaData • We will need an additional event-level MetaData • This can be simply the TAG catalogue with externalisable references • We are discussing with STAR on this subject • We will take a decision soon • We would prefer that the Grid scenario be clearer ALICE Computing Model

  16. ALICE CDC’s ALICE Computing Model

  17. Use of HLT for monitoring in CDC’s CASTOR Root file AliEn alimdc ESD Monitoring Histograms Event builder HLT Algorithms Aliroot Simulation GDC LDC LDC Digits LDC LDC Raw Data ALICE Computing Model

  18. ALICE Physics Data Challenges ALICE Computing Model

  19. PDC04 schema AliEn job control Data transfer Production of RAW Shipment of RAW to CERN Reconstruction of RAW in all T1’s CERN Analysis Tier2 Tier1 Tier1 Tier2 ALICE Computing Model

  20. Signal-free event Phase 2 principle Mixed signal ALICE Computing Model

  21. Simplified view of the ALICE Grid with AliEn ALICE VO – central services User authentication File Catalogue Workload management Job submission Configuration Job Monitoring Central Task Queue Accounting Storage volume manager AliEn Site services Computing Element Data Transfer Disk and MSS Local scheduler Storage Element Cluster Monitor Existing site components ALICE VO – Site services integration ALICE Computing Model

  22. Site services • Inobtrusive – entirely in user space: • Singe user account • All authentication already assured by central services • Tuned to the existing site configuration – supports various schedulers and storage solutions • Running on many Linux flavours and platforms (IA32, IA64, Opteron) • Automatic software installation and updates (both service and application) • Scalableand modular – different services can be run on different nodes (in front/behind firewalls) to preserve site security and integrity: CERN firewall solution for large volume file transfers ONLY High ports (50K-55K) for parallel file transport Fire wall Load balanced file transfer nodes (on HTAR) CERN Intranet AliEn Other services AliEn Data Transfer ALICE Computing Model

  23. HP ProLiant DL380 AliEn Proxy Server Up to 2500 concurrent client connections HP server rx2600 AliEn Job Services 500 K archived jobs HP ProLiant DL580 AliEn File Catalogue 9Mio entries, 400K directories, 10GB MySQL DB HP ProLiant DL380 AliEn Storage Elements Volume Manager 4 Mio entries, 3GB MySQL DB HP ProLiant DL360 AliEn to CASTOR (MSS) interface Log files, application software storage 1TB SATA Disk server ALICE Computing Model

  24. Task - simulate the event reconstruction and remote event storage Phase 2 job structure Central servers Register in AliEn FC: LCG SE: LCG LFN = AliEn PFN Master job submission, Job Optimizer (N sub-jobs), RB, File catalogue, processes monitoring and control, SE… Sub-jobs Sub-jobs AliEn-LCG interface Storage Underlying event input files Completed Sep. 2004 CERN CASTOR: underlying events RB Storage CEs CEs CERN CASTOR: backup copy Job processing Job processing Output files Output files zip archive of output files Local SEs Local SEs File catalogue Primary copy Primary copy edg(lcg) copy&register ALICE Computing Model

  25. Production history • Statistics • 400 000 jobs, 6 hours/job, 750 MSi2K hours • 9M entries in the AliEn file catalogue • 4M physical files at 20 AliEn SEs in centres world-wide • 30 TB stored at CERN CASTOR • 10 TB stored at remote AliEn SEs + 10 TB backup at CERN • 200 TB network transfer CERN –> remote computing centres • AliEn efficiency observed >90% • LCG observed efficiency 60% (see GAG document) • ALICE repository – history of the entire DC • ~ 1 000 monitored parameters: • Running, completed processes • Job status and error conditions • Network traffic • Site status, central services monitoring • …. • 7 GB data • 24 million records with 1 minute granularity – analysed to improve GRID performance ALICE Computing Model

  26. Job repartition • Jobs (AliEn/LCG): Phase 1 - 75/25%, Phase 2 – 89/11% • More operation sites added to the ALICE GRID as PDC progressed • 17 permanent sites (33 total) under AliEn direct control and additional resources through GRID federation (LCG) Phase 1 Phase 2 ALICE Computing Model

  27. Summary of PDC’04 • Computing resources • It took some effort to ‘tune’ the resources at the remote computing centres • The centres’ response was very positive – more CPU and storage capacity was made available during the PDC • Middleware • AliEn proved to be fully capable of executing high-complexity jobs and controlling large amounts of resources • Functionality for Phase 3 has been demonstrated, but cannot be used • LCG MW proved adequate for Phase 1, but not for Phase 2 and in a competitive environment • It cannot provide the additional functionality needed for Phase 3 • ALICE computing model validation: • AliRoot – all parts of the code successfully tested • Computing elements configuration • Need for a high-functionality MSS shown • Phase 2 distributed data storage schema proved robust and fast • Data Analysis could not be tested ALICE Computing Model

  28. Development of Analysis • Analysis Object Data designed for efficiency • Contain only data needed for a particular analysis • Analysis à la PAW • ROOT + at most a small library • Work on the distributed infrastructure has been done by the ARDA project • Batch analysis infrastructure • Prototype published at the end of 2004 with AliEn • Interactive analysis infrastructure • Demonstration performed at the end 2004 with AliEngLite • Physics working groups are just starting now, so timing is right to receive requirements and feedback ALICE Computing Model

  29. PROOF Master PROOF Steer PROOF Client LCG Site A Site B PROOF SLAVE SERVERS PROOF SLAVE SERVERS Proofd Rootd Forward Proxy Forward Proxy New Elements Optional Site Gateway Only outgoing connectivity Site <X> Slave ports mirrored on Master host Proofd Startup Slave Registration/ Booking- DB Grid Service Interfaces TGrid UI/Queue UI Master Setup Grid Access Control Service Grid/Root Authentication “Standard” Proof Session Grid File/Metadata Catalogue Master Booking Request with logical file names Client retrieves list of logical file (LFN + MSN) Grid-Middleware independend PROOF Setup Client ALICE Computing Model

  30. gLite gLate Grid situation • History • Jan ‘04: AliEn developers are hired by EGEE and start working on new MW • May ‘04: A prototype derived from AliEn is offered to pilot users (ARDA, Biomed..) under the gLite name • Dec ‘04: The four experiments ask for this prototype to be deployed on larger preproduction service and be part of the EGEE release • Jan ‘05: This is vetoed at management level -- AliEn will not be common software • Current situation • EGEE has vaguely promised to provide the same functionality of AliEn-derived MW • But with a 2-4 months delay at least on top of the one already accumulated • But even this will be just the beginning of the story: the different components will have to be field tested in a real environment, it took four years for AliEn • All experiments have their own middleware • Our is not maintained because our developers have been hired by EGEE • EGEE has formally vetoed any further work on AliEn or AliEn-derived software • LCG has allowed some support for ALICE but the situation is far from being clear ALICE Computing Model

  31. ALICE computing model • For pp similar to the other experiments • Quasi-online data distribution and first reconstruction at T0 • Further reconstruction passes at T1’s • For AA different model • Calibration, alignment and pilot reconstructions during data taking • Data distribution and first reconstruction at T0 during the four months after AA run (shutdown) • Second and third pass distributed at T1’s • For safety one copy of RAW at T0 and a second one distributed among all T1’s • T0: First pass reconstruction, storage of one copy of RAW, calibration data and first-pass ESD’s • T1: Subsequent reconstructions and scheduled analysis, storage of the second collective copy of RAW and one copy of all data to be safely kept (including simulation), disk replicas of ESD’s and AOD’s • T2: Simulation and end-user analysis, disk replicas of ESD’s and AOD’s • Very difficult to estimate network load ALICE Computing Model

  32. ALICE requirements on MiddleWare • One of the main uncertainties of the ALICE computing model comes from the Grid component • ALICE was developing its computing model assuming that a MW with the same quality and functionality that AliEn would have had in two years from now will be deployable on the LCG computing infrastructure • If not, we will still analyse the data (!), but • Less efficiency  more computers  more time and money • More people for production  more money • To elaborate an alternative model we should know what will be • The functionality of the MW developed by EGEE • The support we can count on from LCG • Our “political” “margin of manoeuvre” ALICE Computing Model

  33. Possible strategy • If • Basic services from LCG/EGEE MW can be trusted at some level • We can get some support to port the “higher functionality” MW onto these services • We have a solution • If a) above is not true but if • We have support for deploying the ARDA-tested AliEn-derived gLite • We do not have a political “veto” • We still have a solution • Otherwise we are in trouble ALICE Computing Model

  34. CDC 05 nous sommes ici PDC05 Computing TDR AliRoot ready CDC 04? PDC06 PDC04 2004 2005 2006 PDC06 preparation Development of new components Final development of AliRoot PDC04 PDC05 First data takingpreparation Analysis PDC04 Design of new components PDC06 ALICE Offline Timeline ALICE Computing Model

  35. Main parameters ALICE Computing Model

  36. Processing pattern ALICE Computing Model

  37. Conclusions • ALICE has made a number of technical choices for the Computing framework since 1998 that have been validated by experience • The Offline development is on schedule, although contingency is scarce • Collaboration between physicists and computer scientists is excellent • Tight integration with ROOT allows fast prototyping and development cycle • AliEn goes a long way in providing a GRID solution adapted to HEP needs • However its evolution into a common project has been “stopped” • This is probably the largest single “risk factor” for ALICE computing • Some ALICE-developed solutions have a high potential to be adopted by other experiments and indeed are becoming “common solutions” ALICE Computing Model

More Related