Anaphe OO libraries for data analysis

Anaphe OO libraries for data analysis Jakub T. Mościcki CERN IT/API jakub.moscicki@cern.ch CERN IT/API, Jakub.Moscicki@cern.ch

Outline • Overview of Anaphe and LHC Computing • Anaphe components • Lizard - Interactive Data Analysis Tool • Summary CERN IT/API, Jakub.Moscicki@cern.ch

LHC Computing challenge CERN IT/API, Jakub.Moscicki@cern.ch

LHC & The Alps Interaction Points ~100m deep 27km circumference CERN IT/API, Jakub.Moscicki@cern.ch

LHC Computing Challenge • 4 experiments will create huge amount of data • >1 PetaByte/year for each experiment ! • 1015 Bytes • 1,000 TeraBytes • 20,000 Redwood tapes • 100,000 dual-sided DVD-RAM disks • 1,500,000 sets of the Encyclopaedia Britannica(w/o photos) • Need lots of CPU power to reconstruct/analyse • about 1000 PC boxes per experiment (2004 ones !) • complex data models • Data mining and analysis by thousands of geographically dispersed scientists around the globe CERN IT/API, Jakub.Moscicki@cern.ch

Lifetime of LHC software = 25 yrs WWW CERN IT/API, Jakub.Moscicki@cern.ch

Technology (R)Evolution • 10 yrs major cycle length (HW,SW,OS) • ~12 evolutionary changes in the market • 1 revolutionary change • towards greater diversity • don’t forget changes of requirements • Consequences • SW written today most probably will be rewritten tomorrow • We must anticipate changes CERN IT/API, Jakub.Moscicki@cern.ch

Anaphe: what it is • Modular (OO/C++) replacement of CERNLIB functionality for use in HEP experiments (previously LHC++) • memory management and I/O • foundation classes • histogramming, minimizing/fitting • visualization • interactive data analysis • Trying to use standards wherever possible • Trying to re-use existing class libraries • This talk will not cover detector simulation (GEANT-4) CERN IT/API, Jakub.Moscicki@cern.ch

Anaphe Components CERN IT/API, Jakub.Moscicki@cern.ch

‘Layered’ Approach • Components are individual C++ class libraries. • Easy to replace one part without throwing away everything • Alternative implementations interchangeable • HepODBMS versus HBOOK Ntuples • Nag C minimizers versus MINUIT • Easy customization to match experiment specific needs • Runtime flexibility • Components may be used individually (limited interdependencies) • Insulate components through Abstract Interfaces • “wrapper” layer to implement Interfaces in terms of existing libs • Identify and use patterns - avoid anti-patterns • learn from other people’s experiences/failures CERN IT/API, Jakub.Moscicki@cern.ch

Anaphe Components: Overview CERN IT/API, Jakub.Moscicki@cern.ch

Users and Collaborations • AIDA spoken here! • IGUANA (CMS visualization) • GAUDI (LHCb) framework • ATHENA (Atlas) framework • Analyzer modules in Geant 4 • JAS • Open Scientist • …you? CERN IT/API, Jakub.Moscicki@cern.ch

Anaphe components CERN IT/API, Jakub.Moscicki@cern.ch

CLHEP • HEP foundation class library • Random number generators • Physics vectors • 3- and 4- vectors • Geometry • Linear algebra • System of units • more packages recently added • will continue to evolve • wwwinfo.cern.ch/asd/lhc++/clhep/ CERN IT/API, Jakub.Moscicki@cern.ch

2D Graphics libraries • Qt • multi-platform C++ GUI toolkit • C++ class library, not wrapper around C libs • superset of Motif and MFC • available on Unix and MS Windows • no change for developer • commercial but with public domain version • www.troll.no • Qplotter • “add-on” functionality for HEP • “HIGZ/HPLOT” CERN IT/API, Jakub.Moscicki@cern.ch

Basic 3D Graphic Libraries • OpenGL(basic graphics) • De-facto industry standard for basic 3D graphics • Used in CAD/CAE, games, VR, medical imaging • OpenInventor(scene mgmt.) • OO 3D toolkit for graphics • Cubes, polygons, text, materials • Cameras, lights, picking • 3D viewers/editors,animation • Based on OpenGL/MesaGL CERN IT/API, Jakub.Moscicki@cern.ch

Mathematical Libraries • NAG (Numerical Algorithms Group) C Library • Covers a broad range of functionality • Linear algebra • differential equations • quadrature, etc. • Special functions of CERNLIB added to Mark-6 release • mostly for theory and accelerator • Quality assurance • extensive testing done by NAG • www.nag.com CERN IT/API, Jakub.Moscicki@cern.ch

Histograms: the HTL package • Histograms are the basic tool for physics analysis • Statistical information of density distributions • Histogram Template Library (HTL) • design based on C++ templates • Modular : separation between sampling and display • Extensible : open for user defined binning systems • Flexible: support transient/persistent at the same time • Open: large use of abstract interfaces • recent addition: 3D histograms CERN IT/API, Jakub.Moscicki@cern.ch

Fitting and Minimization • Fittingand Minimization Library(FML) • common OO interface • NAG-C, MINUIT • based on Abstract Interfaces • IVector, IModelFunction, … • fitting as a special case of minimization • minimize “distance” between data and model • replacement for HepFitting (and Gemini) • Gemini • common minimization interface • very thin layer CERN IT/API, Jakub.Moscicki@cern.ch

Object Association Tags, Ntuples and Events • NtupleTag Library • Ntuple navigation and analysis • common OO interface for different storage • ODBMS • HBook (CERNLIB) • Exploiting Tag concept • enhanced Ntuples • associated with an underlying persistent store • optional association to the Event may be used to navigate to any other part of the Event • even from an interactive visualization program • main use: speedup data selection for analysis… • Tag data is typically better clustered than the original data CERN IT/API, Jakub.Moscicki@cern.ch

Interactive Data Analysis CERN IT/API, Jakub.Moscicki@cern.ch

Interactive Data Analysis • Aim: “OO replacement for PAW” • analysis of “ntuple-like data” (“Tags”, “Ntuples”, …) • visualisation of data (Histograms, scatter-plot, “Vectors”) • fitting of histograms (and other data) • access to experiment specific data/code • Maximize flexibility and re-use • plug-in structure • careful design with limited source and binary dependencies • Foresee customization/integration • allow use from within experiment’s s/w • framework! CERN IT/API, Jakub.Moscicki@cern.ch

Lizard Internals: Interfaces CERN IT/API, Jakub.Moscicki@cern.ch

Anaphe components CERN IT/API, Jakub.Moscicki@cern.ch

Architectural issue: Scripting • Typical use of scripting is quite different from programming (reconstruction, analysis, ...) • history “go back to where I was before” • repetition/looping - with “modifiable parameters” • SWIG to (semi-) automatically create connection to chosen scripting language • allows flexibility to choose amongst several scripting languages • Python, Perl, Tcl, Guile, Ruby, (Java) … • Python - OO scripting, no “strange $!%-variables” • other scripting languages possible (through SWIG) • Can be enhanced and/or replaced by a GUI • scripting window within GUI application CERN IT/API, Jakub.Moscicki@cern.ch

Example script (ntuple) # get list of names of all tuples from tuplemanager ntm.listTuples() nt1=ntm.findNtuple(“Charm1”) # retrieve tuple by name # create 1D histos to project into h1=hm.create1D(10, “mass” ,100, 0., 5000.) h2=hm.create1D(20, “mass for pt1>10” ,100, 0., 5000.) # project the attribute ”MASS" into histo h1 without cut ("") nt1.project1D( h1, “” , “MASS”) # project the attribute ”MASS" into histo h2 with cut (”PT1>10") nt1.project1D( h2, “PT1>10” , “MASS”) CERN IT/API, Jakub.Moscicki@cern.ch

CERN IT/API, Jakub.Moscicki@cern.ch

Lizard: History and Present Status • Started after CHEP-2000 • Full version out since June 2001 • “PAW like” analysis functionality plus • on-demand loading of compiled code using shared libraries • gives full access to experiment’s analysis code and data • based on Abstract Interfaces • flexible and extensible CERN IT/API, Jakub.Moscicki@cern.ch

Possible Future Enhancements • Access to otherimplementations of components • HBOOK histograms and ntuples (RWN) /coming soon/ • OpenScientist, ROOT histograms? • Adding other “scripting” languages • Perl , Tcl, cint ? • Communication with Java tools/packages • via AIDA • JAS • WIRED CERN IT/API, Jakub.Moscicki@cern.ch

Architectural issue: Distributed Computing • Motivation • move code to data • parallel analysis • Techniques • services via AI • late binding • plug-in architecture • End-user (Lizard) • look-and-feel of local analysis • R&D started and first prototype available soon • CORBA CERN IT/API, Jakub.Moscicki@cern.ch

Summary • The architecture of Anaphe shows some important items for flexible and modular data analysis: • weak coupling between components through use of Abstract Interface • basic functionality is covered by C++ class libraries • Major criteria are flexibility, extensibility and interoperability • recent example: GEANT-4 space examples using G4Analysis component (based on AIDA) • Lizard is based on Anaphe components and the Python scripting language (through SWIG) • Lizard is young but has very solid base in mature Anaphe libraries • real plug-in structure CERN IT/API, Jakub.Moscicki@cern.ch

More information • cern.ch/Anaphe • cern.ch/Anaphe/Lizard • aida.freehep.org/ • cern.ch/DB • wwwinfo.cern.ch/asd/lhc++/clhep/ CERN IT/API, Jakub.Moscicki@cern.ch

CERN IT/API, Jakub.Moscicki@cern.ch

Opening bracket: Persistency CERN IT/API, Jakub.Moscicki@cern.ch

Event Data Files Ntuple File Ad hoc extraction prg. Federated DB of Event & Tag Object Association Ntuple versus TagDB Model CERN IT/API, Jakub.Moscicki@cern.ch

Object persistencyTwo concepts: serial and page I/O • “Sequential access to objects” (streaming) • good in networking context or serial writes to files • much like “good old Fortran” • often perceived to be “simpler” to implement (“<<“, “>>”) • “Navigational access to objects” (buffered) • I/O on demand for complex data models • optimized for (random) disk access (disks deliver pages) • sequential write to file still ok • Both concepts need to take care about changes of the internal structure of the objects (schema evolution) CERN IT/API, Jakub.Moscicki@cern.ch

Architectural Issue:Persistency (“Object-I/O”) • Brings a completely new quality into the design • Objects have now lifetime • don’t “delete” until you really are sure you want to • persistency is kind of “intended memory leak” • Objects may change during their (extended) life • “schema evolution” • additions/deletions of attributes • changes of inheritance relations CERN IT/API, Jakub.Moscicki@cern.ch

Architectural Issue:Persistency (“Object-I/O”) (II) • Objects can be placed (“clustering”) • de-coupling of logical and physical view of data • Special care needed to ensure consistency in data set • avoid reading group of objects (tracks, events,...) for which writing/updating is not (yet) complete • clean up if only part of the objects are written • typically taken care of by using transactions • Complications possible in distributed computing • need to protect disk access now like memory access in past (“Segmentation violation”) CERN IT/API, Jakub.Moscicki@cern.ch

Physical Model and Logical Model • Physical model may be changed to optimise performance • Existing applications continue to work CERN IT/API, Jakub.Moscicki@cern.ch

Concurrent Access • Data changes are part of a Transaction • ACID: Atomicity, Consistency, Isolation, Durability • Guarantees consistency of data • Support for multiple concurrent writers • e.g. Multiple parallel data streams • e.g. Filter or reconstruction farms • e.g. Distributed simulation • Access is co-ordinated by a lock server • MROW: Multiple Reader, One Writer per container (Objectivity/DB) CERN IT/API, Jakub.Moscicki@cern.ch

Plain files vs. Databases • HEP is using DBs since LEP • e.g., FATMEN, HepDB • Mainly for “Meta data” • event -> file -> tape mappings • calibration data (conditions) • Why not use a single system for “Meta data” and “data” ? • Overhead for DB administration is there anyway • accessing “Meta data” from “data” is significantly easier • simple navigation from event -> calibrationData • transaction safety also for event/reconstructed data CERN IT/API, Jakub.Moscicki@cern.ch

Persistency: Objectivity/DB • ODMG compliant database • Object Data Management Group defined a standard • Language binding for C++, Java, Smalltalk • ODBMS allow to use persistent objects directly as variables of the OO language • Storage entity is a complete object • State of all data members & Object class • Guarantees consistent view of data (DB feature) • C++ Language Support • Abstraction, Inheritance, Polymorphism • Parameterised Types (Templates) • Location transparent access to objects CERN IT/API, Jakub.Moscicki@cern.ch

Closing bracket: Persistency CERN IT/API, Jakub.Moscicki@cern.ch

Anaphe OO libraries for data analysis