1 / 54

Data Analysis: Algorithms & Methods

Data Analysis: Algorithms & Methods. Highlights. Vincenzo Innocente (CERN-CMS) Ed Frank (Univ. of Pennsylvania - BaBar). Contributions. General Architecture 12 Foundation Libraries 3 Detector reconstruction (all but one: tracking!) Focus on Program Structure 7 Strictly Algorithms 3

garth
Download Presentation

Data Analysis: Algorithms & Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Analysis: Algorithms & Methods Highlights Vincenzo Innocente (CERN-CMS) Ed Frank (Univ. of Pennsylvania - BaBar)

  2. Contributions • General Architecture 12 • Foundation Libraries 3 • Detector reconstruction (all but one: tracking!) • Focus on Program Structure 7 • Strictly Algorithms 3 • Simulation 8 • Detector description 4 Vincenzo Innocente

  3. Architecture

  4. ORCA Software & Architecture • When project started, most people were worried about ways to bring on the physicists, develop the sub-detector software etc. • Important, major emphasis of the last year, but actually less critical in the long term • Engineering of the architecture, and crucially the data-handling issues, are really the critical items • Tracking algorithms can, and will, be rewritten many times. But having an architecture that allows and keeps track of plug-and-play is vital. • Even now we face very large datasets (multi TB). Production, automation, mirroring, evolution are (some of) the hard issues. Reconstruction is much more than the reconstruction code Vincenzo Innocente

  5. Offline Architecture: New Requirements • Bigger Experiment, higher rate, more data • Larger and dispersed user community performing non trivial queries against a large event store • Make best use of new IT technologies • Increased demand of both flexibility and coherence • ability to plug-in new algorithms • ability to run the same algorithms in multiple environments • guarantees of quality and reproducibility • high-performance user-friendliness Vincenzo Innocente

  6. CMS (offline) Software Quasi-online Reconstruction Environmental data Slow Control Online Monitoring store Request part of event Store rec-Obj Request part of event Event Filter Objectivity Formatter Request part of event store Persistent Object Store Manager Object Database Management System Store rec-Obj and calibrations store Request part of event Data Quality Calibrations Group Analysis Simulation G3 and or G4 User Analysis on demand Vincenzo Innocente

  7. March 2000 HLT Production Plans • 2M events ORCA reconstructed with high-luminosity pile-up • 2-4 Tera-Bytes in Objectivity/Db • 400 CPU-weeks • ~6 Production-Units • ~1-2 Production Units off CERN site • Copy of all data at CERN in hpss, use of IT/ASD AMS-backend to stage data to ~1TB of disk pools • Mirroring of Data to a few off-site centers, including trans-Atlantic Users want (need!) now what they were promised for 2005.. Vincenzo Innocente

  8. Offline Architecture:Solution • One coherent architecture from online event filtering to final physics analysis • Clear definition of Clients’ and Services’ interfaces and roles • Framework which orchestrates instances of all these modules • Set of common foundation libraries Vincenzo Innocente

  9. Software Structure Applications implementing the physics algorithms. Triggers Reconstruction Simulation Analysis One main framework: GAUDI. Various specialised frameworks: visualisation, persistency, interactivity, simulation (Geant4), etc. Frameworks Toolkits Basic libraries: STL, CLHEP, etc. (Vocabulary) Foundation Libraries Vincenzo Innocente

  10. DØ C++ Framework • Set of well established interfaces from which reconstruction and analysis algorithms are built. • Propagates events through a sets of algorithms in a well defined and established manner. • The algorithm configuration and set is determined at program execution time. • The framework hides many system related complexities from the user and the algorithm developer and allow for sharing of code for common or related tasks.

  11. Offline Architecture: Enabling Technologies • C++ & OO • Run Time Dynamic Loading • Event Driven Notification • State Machines • Persistent Object Store • Database Technologies • Networked Client-Server Architectures • Layered Architecture to shield the user from the above! Vincenzo Innocente

  12. CLEO III Dynamic Loading vs. Static Linking • Both equally well supported, can mix. • Static linking required for reconstruction jobs • need stable environment for long periods of time • Dynamic Linking/Loading for rapid code development • Fast turn-around time needed • Cutting link times from hours/minutes to minutes/seconds • Limit the number of libraries to link to: • Proper Layering of code Separation of data types from the algorithms that supply them why would I have to link to a tracker to access tracks??? • No direct links between objects reduces # of libs to link to • instead we use index-list objects (“Lattice”) • Run-time cost of resolving symbols is low! Vincenzo Innocente

  13. CMS Conclusions • An “implicit invocation” architecture is a flexible software solution which can scale with the complexity of the CMS project. • ODBMS, integrated into the framework, • provides a coherent management of persistent objects coupled withrun-time dynamic-loading, allows to automatically configure an application • The framework can effectively shield physics modules from the underlying technology without penalizing performances Vincenzo Innocente

  14. Component-based Architecture NOVA Vincenzo Innocente

  15. Lots of Associations Lots of EmcDigis Lots of EmcClusters Track Associator Emc Clustering Lots of RecoTracks Offline Architecture:Commonalties and Differences • Event Data Reduction • Externally: Pipes&Filters • Internally: Blackboard • CMS: Action on Demand • External Services (geometry, run conditions etc.) • Mainly procedural • CMS and DØ: “Event” Notification (implicit invocation) Vincenzo Innocente

  16. Offline Architecture:Commonalties and Differences • Distinction among data, detector and algorithms • Only BaBar makes no clear distinction • Access to object-collections by name • everybody uses named registries (flat or tree) • central component of Gaudi (LHCB) Services • Persistency insulation layer: • Transient copy (managed by the framework) • direct smart pointer Vincenzo Innocente

  17. Principal design choices • Separation between “data” and “algorithms” • Data objects primarily carry data, have only basic methods • e.g. Tracking hits • Algorithm objects primarily manipulate data • e.g. Track fitter • Three basic categories of data: • “event data” (obtained from particle collisions, real or simulated) • “detector data” (structure, geometry, calibration, alignment, ....) • “statistical data” (histograms, ....) • Separation between “transient” and “persistent” data. • Isolate user code from persistency technology . • Different optimisation criteria. • Transient as a bridge between independent representations. Vincenzo Innocente

  18. Lots of Associations Lots of EmcDigis Lots of EmcClusters Track Associator Emc Clustering Lots of RecoTracks Module, event and environment structure • Modules provide the algorithms • Use existing information to create new objects • Styles range from procedural monoliths to OO castles • Framework/AC++ provides control & config • Uses TCL scripting, command line • Production executables run 300 modules • Objects have behaviors, not just values • “Networks of objects collaborate to provide semantics” • Internal form of our track objects is irrelevant • Objects kept in event and environment • Named access in a flat space • event -> Ifd<EmcCluster>::get(“MergedClusters”) • Implemented via ProxyDict • Proxies provide complex access when needed • Ensures physical decoupling Vincenzo Innocente

  19. Algorithms Data T1 Logical view Physical view Parent Data T1 Algorithm A A Transient data store Data T2, T3 Data T2 Data T3 B Data T2 Algorithm B Data T4 C Data T4 Data T3, T4 Algorithm C Data T5 Data T5 • An Algorithm knows only which data (type and name) it uses as input and produces as output. • The only coupling between algorithms is via the data. • The execution order of the sub-algorithms is the responsibility of the parent algorithm. Vincenzo Innocente

  20. Action on Demand Compare the results of two different track reconstruction algorithms Rec Hits Rec Hits Rec Hits Detector Element Hits Event Rec T1 T1 CaloCl Rec T2 Analysis Rec CaloCl T2 Vincenzo Innocente

  21. StMaker GetDataSet() .maker StMaker StMaker AddData() .data .const .const .data 1. Init() 2. Make() “regular” makers communication Vincenzo Innocente

  22. ALICE's choice • Migrate immediately to C++ • Immediately abandon PAW • But accept GEANT3.21 (initially) • Adopt the ROOT framework • Not worried of being dependent on ROOT • Much more worried being dependent on G4, Objy.... • Allow use of FORTRAN and C++ • Allow to start with wrapping and bad design • Impose a single framework • Provide central support, documentation and distribution • Train users in the framework Vincenzo Innocente

  23. Detector Description

  24. Persistent Detector Store DetectorPersistency Service Detector DataService Geant4Service Converter DetElement1 DetElement1 DetElement Converter DetElement G4Converter DetElement2 DetElement G4Converter DetElement2 G4Converter Converter Transient Detector Store Geant4 Representation Detector Data Store Algorithm The transient detector store contains a “snapshot” of the detector data valid for the currently processed event Vincenzo Innocente

  25. For 1st pass LCD used ad hoc file format, one-of-a-kind code for serial-only parsing of detector geom. XML is a standard meta-language for defining markup languages. Good free parsers exist, more tools coming. XML languages are plain-text, self-documenting. Appl. interface to data (XML document) may be serial or random-access. Avoid growing private file formats or, worse, hard-coding parameters. Make it easy (well, easier) for several programs to use same input. LCD J.Bogart Input: Why Use XML?

  26. LCD J.Bogart Detector Description in XML Start subdetector description <lcdparm> <global file=“largeParms2.xml” /> <physical_detector topology=“large” id = “L2” > <volume id=“EM_BARREL” > <tube> <barrel_dimensions inner_r = “196.0” outer_z = “322.0” /> <layering n=“40”> <slice material = “Pb” width = “0.4” /> <slice material = “Tyvek” width = “0.05” /> <slice material = “Polystyrene” width = “0.1” sensitive = “yes” /> </layering> <segmentation cos_theta = “300” phi = “300” /> </tube> <calorimeter type = “em” /> </volume> ... Geometry, materials function End subdectector description

  27. Detector Reconstruction

  28. Track Reconstruction Framework: Motivation • We cannot implement the optimal track reconstruction algorithm right away There’s probably no one optimal algorithm but several,each optimized for a specific task • We need a flexible framework for developing and evaluating algorithms • The mathematical complexity of track finding/fitting often limits the number of developers The involved algebra is often localized in a few places • If we could encapsulate the involved algebra in a few classes and separate it from the logic of the algorithm it would make track finding easier for developers Vincenzo Innocente

  29. mcluster pcluster Reconstruction Object Model (BaBar IFR) • Objects encapsulate the behavior of: • reconstruction information (strip, hit, cluster,…) • the detector model (sector, layer, …) • algorithm strategies (clusterizer, …) • etc. strip “hit” : 1D-cluster Vincenzo Innocente

  30. The BaBar Track Fit • Written in OO C++ • Integrated with the BaBar software framework • Exploits a novel formulation of the Kalman equations • Symmetric processing for both track directions • Processing in Parameter and Weight space • reduces the number of matrix inversions required • Fit result is expressed as a Piecewise Helix • Joined helix segments describing ‘most likely’ path through space • Integrates support other tracking operations • Pattern recognition • Alignment • Used to fit >108 tracks in the commissioning run Vincenzo Innocente

  31. Effect Processing Vincenzo Innocente

  32. Code Organization Vincenzo Innocente

  33. KalStub: A Pattern Recognition Tool Vincenzo Innocente

  34. Experience with software development (BaBar IFR) • Inflexible design was spotted when problems repeatedly occurred in the same code areas introducing changes • Applying a more flexible design has usually improved the software management • more effective development • problems isolation • A concrete example: computation of number of interaction lengths: • Abstract base class for cluster curve approximation • Path length in the detector model computation has been tested using a straight line implementation of the curve approximation • Polynomial approximation from a fit in each view was implemented separately • The integration of the two pieces has been immediately successful Vincenzo Innocente

  35. Simulation

  36. Geant4 Capabilities • Very powerful Geant4 kernel • tracking, stacks, geometry, hits, .. • Extensive & transparent physics models • electromagnetic, hadronic, … • extended energy range, new models • Persistency, Visualization, ... • Surpasses Geant-3 • in nearly every respect Vincenzo Innocente

  37. X-Ray Surveys of Asteroids and Moons Cosmic rays, jovian electrons Solar X-rays, e, p Geant3.21 ITS3.0, EGS4 Courtesy SOHO EIT Geant4 Induced X-ray line emission: indicator of target composition (~100 mm surface layer) C, N, O line emissions included ESA Space Environment & Effects Analysis Section Vincenzo Innocente

  38. Hadronic shower models in Geant4 Typical Example of OO design • Highly structured and layered object model (inheritance tree): • at each level a given set of functionalities is made concrete which will be common to a given branch • 1st level: calculation of cross-sections and final states for particles in flight and at rest in a medium. • 5th: implement the fragmentation function for string decay • Result in a flexible framework to implement new hadronic interaction models Vincenzo Innocente

  39. Vincenzo Innocente

  40. Changing cuts • Results very stable with variation of cuts • even track length • Also see shower profiles for different cuts (next slide) • between 10mm and 50 microns Vincenzo Innocente

  41. CMS Geometry Model using GEANT4 • Categories based on responsibilities • Geometry categories:CMS specific, OSCAR(Geant4) & Persistent • Hits categories:CMS & OSCAR • User Interactioncategories:User Actions, GUI • Utilities:Materials, Rotation Matrices Vincenzo Innocente

  42. ATLAS Accordion Calorimeter • G3: 0.5 Megabytes, 10 seconds*SPECint95/GeV • STATIC GEOMETRY • 110 Megabytes of memory • CPU time is 9.5 seconds*SPECint95/GeV • PARAMETERIZED GEOMETRY • 1500 seconds*SPECint95/GeV (1D voxelization) • TAILORED GEOMETRY (G4Accordeon) • 8 Megabytes of memory • CPU time is 11.5 seconds*SPECint95/GeV. Vincenzo Innocente

  43. ATLAS Calorimeter • The first results on EM shower simulations are close to test beam and GEANT3 results, but more work is needed to understand the differences. • GEANT4 performance comparable to that of GEANT3 can be achieved. • The design of GEANT4 allows a user to extend GEANT4 functionality. This helps to implement the new idea of “tailored” geometry description that can be used for high performance simulation of any calorimeter or other regular structure. Vincenzo Innocente

  44. G3 geometry AliRun G4 geometry DetectorCode G3toG4 The Virtual MC TGeant3 AliMC TGeant4 TFluka Vincenzo Innocente

  45. Tracking schema Inverse Framework plug-in FLUKA Step GUSTEP AliRun::StepManager Module Version StepManager Add the hit Geant4StepManager Disk I/O Root Vincenzo Innocente

  46. StdHepC++ • There is a strong need for C++ standard Monte Carlo generator interface. • StdHepC++ is a natural object-oriented implementation of such an interface. • At present we have working examples which integrate StdHepC++ with the Fortran versions of Herwig, Pythia, Isajet. • On the other side, StdHepC++ provides event blocks readable by MCFast and Geant3, and will have an interface to Geant4. Vincenzo Innocente

  47. LHC++: what it is (I) • Modular replacement of current CERNLIB for use in HEP experiments • memory management (C++) • persistency (“I/O”) • mathematical library • foundation classes • random number generators • histogramming • fitting • simulation Vincenzo Innocente

  48. LHC++ Present configuration • Object persistency • from RD45 collaboration (Objectivity/DB) • Foundation classes • HEP specific foundation classes (CLHEP) • Random number generators (CLHEP) • Mathematical library from NAG (NAG_C) • covers broad range of functionality • extensions required by CERN will be added in next release (Mark 6) • quality assurance Vincenzo Innocente

  49. LHC++ Present configuration (cont.) • Simulation: GEANT-4 • worldwide collaboration • complete OO design • Histogramming: HTL • Fitting: Gemini, HepFitting packages • interface to any minimizer (at present: NAG, Minuit) • Event generators • Lund people started Pythia-7 (C++) • StdHep++ in process to become part of CLHEP Vincenzo Innocente

  50. LHC++ packages and dependencies Vincenzo Innocente

More Related