1 / 43

Anaphe OO libraries for data analysis

Anaphe OO libraries for data analysis. Jakub T. Mościcki CERN IT/API jakub.moscicki@cern.ch. Outline. Overview of Anaphe and LHC Computing Anaphe components Lizard - Interactive Data Analysis Tool Summary. LHC Computing challenge. LHC & The Alps. Interaction Points. ~100m deep.

Download Presentation

Anaphe OO libraries for data analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anaphe OO libraries for data analysis Jakub T. Mościcki CERN IT/API jakub.moscicki@cern.ch CERN IT/API, Jakub.Moscicki@cern.ch

  2. Outline • Overview of Anaphe and LHC Computing • Anaphe components • Lizard - Interactive Data Analysis Tool • Summary CERN IT/API, Jakub.Moscicki@cern.ch

  3. LHC Computing challenge CERN IT/API, Jakub.Moscicki@cern.ch

  4. LHC & The Alps Interaction Points ~100m deep 27km circumference CERN IT/API, Jakub.Moscicki@cern.ch

  5. LHC Computing Challenge • 4 experiments will create huge amount of data • >1 PetaByte/year for each experiment ! • 1015 Bytes • 1,000 TeraBytes • 20,000 Redwood tapes • 100,000 dual-sided DVD-RAM disks • 1,500,000 sets of the Encyclopaedia Britannica(w/o photos) • Need lots of CPU power to reconstruct/analyse • about 1000 PC boxes per experiment (2004 ones !) • complex data models • Data mining and analysis by thousands of geographically dispersed scientists around the globe CERN IT/API, Jakub.Moscicki@cern.ch

  6. Lifetime of LHC software = 25 yrs WWW CERN IT/API, Jakub.Moscicki@cern.ch

  7. Technology (R)Evolution • 10 yrs major cycle length (HW,SW,OS) • ~12 evolutionary changes in the market • 1 revolutionary change • towards greater diversity • don’t forget changes of requirements • Consequences • SW written today most probably will be rewritten tomorrow • We must anticipate changes CERN IT/API, Jakub.Moscicki@cern.ch

  8. Anaphe: what it is • Modular (OO/C++) replacement of CERNLIB functionality for use in HEP experiments (previously LHC++) • memory management and I/O • foundation classes • histogramming, minimizing/fitting • visualization • interactive data analysis • Trying to use standards wherever possible • Trying to re-use existing class libraries • This talk will not cover detector simulation (GEANT-4) CERN IT/API, Jakub.Moscicki@cern.ch

  9. Anaphe Components CERN IT/API, Jakub.Moscicki@cern.ch

  10. ‘Layered’ Approach • Components are individual C++ class libraries. • Easy to replace one part without throwing away everything • Alternative implementations interchangeable • HepODBMS versus HBOOK Ntuples • Nag C minimizers versus MINUIT • Easy customization to match experiment specific needs • Runtime flexibility • Components may be used individually (limited interdependencies) • Insulate components through Abstract Interfaces • “wrapper” layer to implement Interfaces in terms of existing libs • Identify and use patterns - avoid anti-patterns • learn from other people’s experiences/failures CERN IT/API, Jakub.Moscicki@cern.ch

  11. Anaphe Components: Overview CERN IT/API, Jakub.Moscicki@cern.ch

  12. Users and Collaborations • AIDA spoken here! • IGUANA (CMS visualization) • GAUDI (LHCb) framework • ATHENA (Atlas) framework • Analyzer modules in Geant 4 • JAS • Open Scientist • …you? CERN IT/API, Jakub.Moscicki@cern.ch

  13. Anaphe components CERN IT/API, Jakub.Moscicki@cern.ch

  14. CLHEP • HEP foundation class library • Random number generators • Physics vectors • 3- and 4- vectors • Geometry • Linear algebra • System of units • more packages recently added • will continue to evolve • wwwinfo.cern.ch/asd/lhc++/clhep/ CERN IT/API, Jakub.Moscicki@cern.ch

  15. 2D Graphics libraries • Qt • multi-platform C++ GUI toolkit • C++ class library, not wrapper around C libs • superset of Motif and MFC • available on Unix and MS Windows • no change for developer • commercial but with public domain version • www.troll.no • Qplotter • “add-on” functionality for HEP • “HIGZ/HPLOT” CERN IT/API, Jakub.Moscicki@cern.ch

  16. Basic 3D Graphic Libraries • OpenGL(basic graphics) • De-facto industry standard for basic 3D graphics • Used in CAD/CAE, games, VR, medical imaging • OpenInventor(scene mgmt.) • OO 3D toolkit for graphics • Cubes, polygons, text, materials • Cameras, lights, picking • 3D viewers/editors,animation • Based on OpenGL/MesaGL CERN IT/API, Jakub.Moscicki@cern.ch

  17. Mathematical Libraries • NAG (Numerical Algorithms Group) C Library • Covers a broad range of functionality • Linear algebra • differential equations • quadrature, etc. • Special functions of CERNLIB added to Mark-6 release • mostly for theory and accelerator • Quality assurance • extensive testing done by NAG • www.nag.com CERN IT/API, Jakub.Moscicki@cern.ch

  18. Histograms: the HTL package • Histograms are the basic tool for physics analysis • Statistical information of density distributions • Histogram Template Library (HTL) • design based on C++ templates • Modular : separation between sampling and display • Extensible : open for user defined binning systems • Flexible: support transient/persistent at the same time • Open: large use of abstract interfaces • recent addition: 3D histograms CERN IT/API, Jakub.Moscicki@cern.ch

  19. Fitting and Minimization • Fittingand Minimization Library(FML) • common OO interface • NAG-C, MINUIT • based on Abstract Interfaces • IVector, IModelFunction, … • fitting as a special case of minimization • minimize “distance” between data and model • replacement for HepFitting (and Gemini) • Gemini • common minimization interface • very thin layer CERN IT/API, Jakub.Moscicki@cern.ch

  20. Object Association Tags, Ntuples and Events • NtupleTag Library • Ntuple navigation and analysis • common OO interface for different storage • ODBMS • HBook (CERNLIB) • Exploiting Tag concept • enhanced Ntuples • associated with an underlying persistent store • optional association to the Event may be used to navigate to any other part of the Event • even from an interactive visualization program • main use: speedup data selection for analysis… • Tag data is typically better clustered than the original data CERN IT/API, Jakub.Moscicki@cern.ch

  21. Interactive Data Analysis CERN IT/API, Jakub.Moscicki@cern.ch

  22. Interactive Data Analysis • Aim: “OO replacement for PAW” • analysis of “ntuple-like data” (“Tags”, “Ntuples”, …) • visualisation of data (Histograms, scatter-plot, “Vectors”) • fitting of histograms (and other data) • access to experiment specific data/code • Maximize flexibility and re-use • plug-in structure • careful design with limited source and binary dependencies • Foresee customization/integration • allow use from within experiment’s s/w • framework! CERN IT/API, Jakub.Moscicki@cern.ch

  23. Lizard Internals: Interfaces CERN IT/API, Jakub.Moscicki@cern.ch

  24. Anaphe components CERN IT/API, Jakub.Moscicki@cern.ch

  25. Architectural issue: Scripting • Typical use of scripting is quite different from programming (reconstruction, analysis, ...) • history “go back to where I was before” • repetition/looping - with “modifiable parameters” • SWIG to (semi-) automatically create connection to chosen scripting language • allows flexibility to choose amongst several scripting languages • Python, Perl, Tcl, Guile, Ruby, (Java) … • Python - OO scripting, no “strange $!%-variables” • other scripting languages possible (through SWIG) • Can be enhanced and/or replaced by a GUI • scripting window within GUI application CERN IT/API, Jakub.Moscicki@cern.ch

  26. Example script (ntuple) # get list of names of all tuples from tuplemanager ntm.listTuples() nt1=ntm.findNtuple(“Charm1”) # retrieve tuple by name # create 1D histos to project into h1=hm.create1D(10, “mass” ,100, 0., 5000.) h2=hm.create1D(20, “mass for pt1>10” ,100, 0., 5000.) # project the attribute ”MASS" into histo h1 without cut ("") nt1.project1D( h1, “” , “MASS”) # project the attribute ”MASS" into histo h2 with cut (”PT1>10") nt1.project1D( h2, “PT1>10” , “MASS”) CERN IT/API, Jakub.Moscicki@cern.ch

  27. CERN IT/API, Jakub.Moscicki@cern.ch

  28. Lizard: History and Present Status • Started after CHEP-2000 • Full version out since June 2001 • “PAW like” analysis functionality plus • on-demand loading of compiled code using shared libraries • gives full access to experiment’s analysis code and data • based on Abstract Interfaces • flexible and extensible CERN IT/API, Jakub.Moscicki@cern.ch

  29. Possible Future Enhancements • Access to otherimplementations of components • HBOOK histograms and ntuples (RWN) /coming soon/ • OpenScientist, ROOT histograms? • Adding other “scripting” languages • Perl , Tcl, cint ? • Communication with Java tools/packages • via AIDA • JAS • WIRED CERN IT/API, Jakub.Moscicki@cern.ch

  30. Architectural issue: Distributed Computing • Motivation • move code to data • parallel analysis • Techniques • services via AI • late binding • plug-in architecture • End-user (Lizard) • look-and-feel of local analysis • R&D started and first prototype available soon • CORBA CERN IT/API, Jakub.Moscicki@cern.ch

  31. Summary • The architecture of Anaphe shows some important items for flexible and modular data analysis: • weak coupling between components through use of Abstract Interface • basic functionality is covered by C++ class libraries • Major criteria are flexibility, extensibility and interoperability • recent example: GEANT-4 space examples using G4Analysis component (based on AIDA) • Lizard is based on Anaphe components and the Python scripting language (through SWIG) • Lizard is young but has very solid base in mature Anaphe libraries • real plug-in structure CERN IT/API, Jakub.Moscicki@cern.ch

  32. More information • cern.ch/Anaphe • cern.ch/Anaphe/Lizard • aida.freehep.org/ • cern.ch/DB • wwwinfo.cern.ch/asd/lhc++/clhep/ CERN IT/API, Jakub.Moscicki@cern.ch

  33. CERN IT/API, Jakub.Moscicki@cern.ch

  34. Opening bracket: Persistency CERN IT/API, Jakub.Moscicki@cern.ch

  35. Event Data Files Ntuple File Ad hoc extraction prg. Federated DB of Event & Tag Object Association Ntuple versus TagDB Model CERN IT/API, Jakub.Moscicki@cern.ch

  36. Object persistencyTwo concepts: serial and page I/O • “Sequential access to objects” (streaming) • good in networking context or serial writes to files • much like “good old Fortran” • often perceived to be “simpler” to implement (“<<“, “>>”) • “Navigational access to objects” (buffered) • I/O on demand for complex data models • optimized for (random) disk access (disks deliver pages) • sequential write to file still ok • Both concepts need to take care about changes of the internal structure of the objects (schema evolution) CERN IT/API, Jakub.Moscicki@cern.ch

  37. Architectural Issue:Persistency (“Object-I/O”) • Brings a completely new quality into the design • Objects have now lifetime • don’t “delete” until you really are sure you want to • persistency is kind of “intended memory leak” • Objects may change during their (extended) life • “schema evolution” • additions/deletions of attributes • changes of inheritance relations CERN IT/API, Jakub.Moscicki@cern.ch

  38. Architectural Issue:Persistency (“Object-I/O”) (II) • Objects can be placed (“clustering”) • de-coupling of logical and physical view of data • Special care needed to ensure consistency in data set • avoid reading group of objects (tracks, events,...) for which writing/updating is not (yet) complete • clean up if only part of the objects are written • typically taken care of by using transactions • Complications possible in distributed computing • need to protect disk access now like memory access in past (“Segmentation violation”) CERN IT/API, Jakub.Moscicki@cern.ch

  39. Physical Model and Logical Model • Physical model may be changed to optimise performance • Existing applications continue to work CERN IT/API, Jakub.Moscicki@cern.ch

  40. Concurrent Access • Data changes are part of a Transaction • ACID: Atomicity, Consistency, Isolation, Durability • Guarantees consistency of data • Support for multiple concurrent writers • e.g. Multiple parallel data streams • e.g. Filter or reconstruction farms • e.g. Distributed simulation • Access is co-ordinated by a lock server • MROW: Multiple Reader, One Writer per container (Objectivity/DB) CERN IT/API, Jakub.Moscicki@cern.ch

  41. Plain files vs. Databases • HEP is using DBs since LEP • e.g., FATMEN, HepDB • Mainly for “Meta data” • event -> file -> tape mappings • calibration data (conditions) • Why not use a single system for “Meta data” and “data” ? • Overhead for DB administration is there anyway • accessing “Meta data” from “data” is significantly easier • simple navigation from event -> calibrationData • transaction safety also for event/reconstructed data CERN IT/API, Jakub.Moscicki@cern.ch

  42. Persistency: Objectivity/DB • ODMG compliant database • Object Data Management Group defined a standard • Language binding for C++, Java, Smalltalk • ODBMS allow to use persistent objects directly as variables of the OO language • Storage entity is a complete object • State of all data members & Object class • Guarantees consistent view of data (DB feature) • C++ Language Support • Abstraction, Inheritance, Polymorphism • Parameterised Types (Templates) • Location transparent access to objects CERN IT/API, Jakub.Moscicki@cern.ch

  43. Closing bracket: Persistency CERN IT/API, Jakub.Moscicki@cern.ch

More Related