Anaphe oo libraries for data analysis
Download
1 / 43

Anaphe OO libraries for data analysis - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Anaphe OO libraries for data analysis. Jakub T. Mościcki CERN IT/API [email protected] Outline. Overview of Anaphe and LHC Computing Anaphe components Lizard - Interactive Data Analysis Tool Summary. LHC Computing challenge. LHC & The Alps. Interaction Points. ~100m deep.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Anaphe OO libraries for data analysis' - ciaran-downs


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Anaphe oo libraries for data analysis

Anaphe OO libraries for data analysis

Jakub T. Mościcki

CERN IT/API

[email protected]

CERN IT/API, [email protected]


Outline
Outline

  • Overview of Anaphe and LHC Computing

  • Anaphe components

  • Lizard - Interactive Data Analysis Tool

  • Summary

CERN IT/API, [email protected]



Lhc the alps
LHC & The Alps

Interaction

Points

~100m deep

27km circumference

CERN IT/API, [email protected]


Lhc computing challenge1
LHC Computing Challenge

  • 4 experiments will create huge amount of data

    • >1 PetaByte/year for each experiment !

      • 1015 Bytes

      • 1,000 TeraBytes

      • 20,000 Redwood tapes

      • 100,000 dual-sided DVD-RAM disks

      • 1,500,000 sets of the Encyclopaedia Britannica(w/o photos)

  • Need lots of CPU power to reconstruct/analyse

    • about 1000 PC boxes per experiment (2004 ones !)

      • complex data models

  • Data mining and analysis by thousands of geographically dispersed scientists around the globe

CERN IT/API, [email protected]



Technology r evolution
Technology (R)Evolution

  • 10 yrs major cycle length (HW,SW,OS)

    • ~12 evolutionary changes in the market

    • 1 revolutionary change

    • towards greater diversity

    • don’t forget changes of requirements

  • Consequences

    • SW written today most probably will be rewritten tomorrow

    • We must anticipate changes

CERN IT/API, [email protected]


Anaphe what it is
Anaphe: what it is

  • Modular (OO/C++) replacement of CERNLIB functionality for use in HEP experiments (previously LHC++)

    • memory management and I/O

    • foundation classes

    • histogramming, minimizing/fitting

    • visualization

    • interactive data analysis

  • Trying to use standards wherever possible

  • Trying to re-use existing class libraries

  • This talk will not cover detector simulation (GEANT-4)

CERN IT/API, [email protected]



Layered approach
‘Layered’ Approach

  • Components are individual C++ class libraries.

    • Easy to replace one part without throwing away everything

    • Alternative implementations interchangeable

      • HepODBMS versus HBOOK Ntuples

      • Nag C minimizers versus MINUIT

    • Easy customization to match experiment specific needs

    • Runtime flexibility

    • Components may be used individually (limited interdependencies)

  • Insulate components through Abstract Interfaces

    • “wrapper” layer to implement Interfaces in terms of existing libs

  • Identify and use patterns - avoid anti-patterns

    • learn from other people’s experiences/failures

CERN IT/API, [email protected]



Users and collaborations
Users and Collaborations

  • AIDA spoken here!

    • IGUANA (CMS visualization)

    • GAUDI (LHCb) framework

    • ATHENA (Atlas) framework

    • Analyzer modules in Geant 4

    • JAS

    • Open Scientist

    • …you?

CERN IT/API, [email protected]



Clhep
CLHEP

  • HEP foundation class library

    • Random number generators

    • Physics vectors

      • 3- and 4- vectors

    • Geometry

    • Linear algebra

    • System of units

    • more packages recently added

      • will continue to evolve

  • wwwinfo.cern.ch/asd/lhc++/clhep/

CERN IT/API, [email protected]


2d graphics libraries
2D Graphics libraries

  • Qt

    • multi-platform C++ GUI toolkit

      • C++ class library, not wrapper around C libs

      • superset of Motif and MFC

      • available on Unix and MS Windows

      • no change for developer

    • commercial but with public domain version

    • www.troll.no

  • Qplotter

    • “add-on” functionality for HEP

      • “HIGZ/HPLOT”

CERN IT/API, [email protected]


Basic 3d graphic libraries
Basic 3D Graphic Libraries

  • OpenGL(basic graphics)

    • De-facto industry standard for basic 3D graphics

    • Used in CAD/CAE, games, VR, medical imaging

  • OpenInventor(scene mgmt.)

    • OO 3D toolkit for graphics

    • Cubes, polygons, text, materials

    • Cameras, lights, picking

    • 3D viewers/editors,animation

    • Based on OpenGL/MesaGL

CERN IT/API, [email protected]


Mathematical libraries
Mathematical Libraries

  • NAG (Numerical Algorithms Group) C Library

    • Covers a broad range of functionality

      • Linear algebra

      • differential equations

      • quadrature, etc.

    • Special functions of CERNLIB added to Mark-6 release

      • mostly for theory and accelerator

      • Quality assurance

      • extensive testing done by NAG

    • www.nag.com

CERN IT/API, [email protected]


Histograms the htl package
Histograms: the HTL package

  • Histograms are the basic tool for physics analysis

    • Statistical information of density distributions

  • Histogram Template Library (HTL)

    • design based on C++ templates

    • Modular : separation between sampling and display

    • Extensible : open for user defined binning systems

    • Flexible: support transient/persistent at the same time

    • Open: large use of abstract interfaces

    • recent addition: 3D histograms

CERN IT/API, [email protected]


Fitting and minimization
Fitting and Minimization

  • Fittingand Minimization Library(FML)

    • common OO interface

      • NAG-C, MINUIT

    • based on Abstract Interfaces

      • IVector, IModelFunction, …

    • fitting as a special case of minimization

      • minimize “distance” between data and model

    • replacement for HepFitting (and Gemini)

  • Gemini

    • common minimization interface

    • very thin layer

CERN IT/API, [email protected]


Tags ntuples and events

Object Association

Tags, Ntuples and Events

  • NtupleTag Library

    • Ntuple navigation and analysis

    • common OO interface for different storage

      • ODBMS

      • HBook (CERNLIB)

  • Exploiting Tag concept

    • enhanced Ntuples

    • associated with an underlying persistent store

    • optional association to the Event may be used to navigate to any other part of the Event

      • even from an interactive visualization program

    • main use: speedup data selection for analysis…

      • Tag data is typically better clustered than the original data

CERN IT/API, [email protected]



Interactive data analysis1
Interactive Data Analysis

  • Aim: “OO replacement for PAW”

    • analysis of “ntuple-like data” (“Tags”, “Ntuples”, …)

    • visualisation of data (Histograms, scatter-plot, “Vectors”)

    • fitting of histograms (and other data)

    • access to experiment specific data/code

  • Maximize flexibility and re-use

    • plug-in structure

    • careful design with limited source and binary dependencies

  • Foresee customization/integration

    • allow use from within experiment’s s/w

    • framework!

CERN IT/API, [email protected]




Architectural issue scripting
Architectural issue: Scripting

  • Typical use of scripting is quite different from programming (reconstruction, analysis, ...)

    • history “go back to where I was before”

    • repetition/looping - with “modifiable parameters”

  • SWIG to (semi-) automatically create connection to chosen scripting language

    • allows flexibility to choose amongst several scripting languages

    • Python, Perl, Tcl, Guile, Ruby, (Java) …

  • Python - OO scripting, no “strange $!%-variables”

    • other scripting languages possible (through SWIG)

  • Can be enhanced and/or replaced by a GUI

    • scripting window within GUI application

CERN IT/API, [email protected]


Example script ntuple
Example script (ntuple)

# get list of names of all tuples from tuplemanager

ntm.listTuples()

nt1=ntm.findNtuple(“Charm1”) # retrieve tuple by name

# create 1D histos to project into

h1=hm.create1D(10, “mass” ,100, 0., 5000.)

h2=hm.create1D(20, “mass for pt1>10” ,100, 0., 5000.)

# project the attribute ”MASS" into histo h1 without cut ("")

nt1.project1D( h1, “” , “MASS”)

# project the attribute ”MASS" into histo h2 with cut (”PT1>10")

nt1.project1D( h2, “PT1>10” , “MASS”)

CERN IT/API, [email protected]



Lizard history and present status
Lizard: History and Present Status

  • Started after CHEP-2000

  • Full version out since June 2001

    • “PAW like” analysis functionality plus

    • on-demand loading of compiled code using shared libraries

      • gives full access to experiment’s analysis code and data

    • based on Abstract Interfaces

      • flexible and extensible

CERN IT/API, [email protected]


Possible future enhancements
Possible Future Enhancements

  • Access to otherimplementations of components

    • HBOOK histograms and ntuples (RWN) /coming soon/

    • OpenScientist, ROOT histograms?

  • Adding other “scripting” languages

    • Perl , Tcl, cint ?

  • Communication with Java tools/packages

    • via AIDA

      • JAS

      • WIRED

CERN IT/API, [email protected]


Architectural issue distributed computing
Architectural issue: Distributed Computing

  • Motivation

    • move code to data

    • parallel analysis

  • Techniques

    • services via AI

    • late binding

    • plug-in architecture

  • End-user (Lizard)

    • look-and-feel of local analysis

  • R&D started and first prototype available soon

    • CORBA

CERN IT/API, [email protected]


Summary
Summary

  • The architecture of Anaphe shows some important items for flexible and modular data analysis:

    • weak coupling between components through use of Abstract Interface

    • basic functionality is covered by C++ class libraries

  • Major criteria are flexibility, extensibility and interoperability

    • recent example: GEANT-4 space examples using G4Analysis component (based on AIDA)

  • Lizard is based on Anaphe components and the Python scripting language (through SWIG)

    • Lizard is young but has very solid base in mature Anaphe libraries

    • real plug-in structure

CERN IT/API, [email protected]


More information
More information

  • cern.ch/Anaphe

  • cern.ch/Anaphe/Lizard

  • aida.freehep.org/

  • cern.ch/DB

  • wwwinfo.cern.ch/asd/lhc++/clhep/

CERN IT/API, [email protected]



Opening bracket persistency

Opening bracket:

Persistency

CERN IT/API, [email protected]


Ntuple versus tagdb model

Event Data Files

Ntuple File

Ad hoc

extraction prg.

Federated DB of Event & Tag

Object Association

Ntuple versus TagDB Model

CERN IT/API, [email protected]


Object persistency two concepts serial and page i o
Object persistencyTwo concepts: serial and page I/O

  • “Sequential access to objects” (streaming)

    • good in networking context or serial writes to files

    • much like “good old Fortran”

    • often perceived to be “simpler” to implement (“<<“, “>>”)

  • “Navigational access to objects” (buffered)

    • I/O on demand for complex data models

    • optimized for (random) disk access (disks deliver pages)

    • sequential write to file still ok

  • Both concepts need to take care about changes of the internal structure of the objects (schema evolution)

CERN IT/API, [email protected]


Architectural issue persistency object i o
Architectural Issue:Persistency (“Object-I/O”)

  • Brings a completely new quality into the design

  • Objects have now lifetime

    • don’t “delete” until you really are sure you want to

    • persistency is kind of “intended memory leak”

  • Objects may change during their (extended) life

    • “schema evolution”

    • additions/deletions of attributes

    • changes of inheritance relations

CERN IT/API, [email protected]


Architectural issue persistency object i o ii
Architectural Issue:Persistency (“Object-I/O”) (II)

  • Objects can be placed (“clustering”)

    • de-coupling of logical and physical view of data

  • Special care needed to ensure consistency in data set

    • avoid reading group of objects (tracks, events,...) for which writing/updating is not (yet) complete

    • clean up if only part of the objects are written

    • typically taken care of by using transactions

  • Complications possible in distributed computing

    • need to protect disk access now like memory access in past (“Segmentation violation”)

CERN IT/API, [email protected]


Physical model and logical model
Physical Model and Logical Model

  • Physical model may be changed to optimise performance

  • Existing applications continue to work

CERN IT/API, [email protected]


Concurrent access
Concurrent Access

  • Data changes are part of a Transaction

    • ACID: Atomicity, Consistency, Isolation, Durability

    • Guarantees consistency of data

  • Support for multiple concurrent writers

    • e.g. Multiple parallel data streams

    • e.g. Filter or reconstruction farms

    • e.g. Distributed simulation

  • Access is co-ordinated by a lock server

    • MROW: Multiple Reader, One Writer per container (Objectivity/DB)

CERN IT/API, [email protected]


Plain files vs databases
Plain files vs. Databases

  • HEP is using DBs since LEP

    • e.g., FATMEN, HepDB

  • Mainly for “Meta data”

    • event -> file -> tape mappings

    • calibration data (conditions)

  • Why not use a single system for “Meta data” and “data” ?

    • Overhead for DB administration is there anyway

    • accessing “Meta data” from “data” is significantly easier

      • simple navigation from event -> calibrationData

    • transaction safety also for event/reconstructed data

CERN IT/API, [email protected]


Persistency objectivity db
Persistency: Objectivity/DB

  • ODMG compliant database

    • Object Data Management Group defined a standard

  • Language binding for C++, Java, Smalltalk

    • ODBMS allow to use persistent objects directly as variables of the OO language

  • Storage entity is a complete object

    • State of all data members & Object class

    • Guarantees consistent view of data (DB feature)

  • C++ Language Support

    • Abstraction, Inheritance, Polymorphism

    • Parameterised Types (Templates)

  • Location transparent access to objects

CERN IT/API, [email protected]


Closing bracket persistency

Closing bracket:

Persistency

CERN IT/API, [email protected]


ad