1 / 62

Perspective on Future Data Analysis in HENP

This talk discusses the evolution of data analysis in high energy physics computing, exploring new techniques for interacting with data and the challenges of experiment-independent analysis.

tnigel
Download Presentation

Perspective on Future Data Analysis in HENP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Perspective on Future Data Analysis in HENP Computing in High Energy Physics 2003 La Jolla 24 March René Brun CERN Perspective on Future Data AnalysisL

  2. Data Analysis ?? • Data Analysis has been traditionally associated with the final stages of data processing, ie Physics Analysis. • In this talk, I will cover a more general aspect of Data Analysis (in the true sense). • How to interact with data at all stages of data processing (batch or interactive modes)? • Can we imagine an experiment-independent way to achieve this? Perspective on Future Data Analysis

  3. Evolution • To understand the possible directions, we must understand some messages from the past, the solid recipes! • One important message is “Make it simple”. • Heavy experiment frameworks are often perceived as a serious obstacle and push users to use more basic but universal frameworks. Perspective on Future Data Analysis

  4. Once upon a time (seventies) • With the first electronic (as opposed to bubble chamber) experiments, data analysis was experiment specific, an activity after the data taking. • The only common software was the histograming package (eg Hbook) ,the fitting package (eg Minuit), some plotting packages and independent routines in cernlib (linear algebra and small utilities) • Data structures = Fortran common blocks Perspective on Future Data Analysis

  5. Early Eighties • With the growing complexity of the experiments and corresponding software, we see the development of Data Structures management systems (hydra, zbook-->zebra, bos). • These systems are able to write/read complex bank collections. Zebra had a self-describing bank format with built-in support for bank evolution. • Most data processed in batch, but many prototypes of interactive systems start to appear (htv, gep, then paw..) Perspective on Future Data Analysis

  6. PAW • Designed in 1985. Stable since 1993 • Row-Wise-Ntuples. OK for small data sets, interactive histograming with cuts. • Column-Wise-Ntuples. A major step illustrating the advantage of structured data sets • PAW: a success • not so much because of its technical merits • but perceived as a tool widely available • stability since many years: an important element Perspective on Future Data Analysis

  7. 1993-->2000 (1) • Move from Fortran to OO • Took far more time than expected • new language(s) • new programming techniques • basic infrastructure not available to compete with existing libraries and tools • conflicts between projects • ad-hoc software in experiments Perspective on Future Data Analysis

  8. 1993-->2000 (2) • False hopes with OODBMS (or too early?) • OODBMS -->Objectivity • OO models designed for Objy • batch oriented • Interactive use via conversion to PAW ntuples • central data base does not fit well with GRID concepts • Licensing problems and more Perspective on Future Data Analysis

  9. Data Analysis Models Perspective on Future Data AnalysisL

  10. From the desktop to the GRID Online/Offline Farms Local/remote Storage Desktop New data analysis tools must be able to use in parallel remote CPUS, storage elements and networks in a transparent way for a user at a desktop GRID Perspective on Future Data Analysis

  11. My laptop in 200X Using a naïve extrapolation of Moore’s law for a state of the art laptop Year CPU/Ghz RAM/GB disk/GB 2003 2.4 0.5 60 2005 5 1 150 2007 10 2 300 2009 20 4 600 2011 40 8 1000 Nice ! But less than 1/1000 of what I need Perspective on Future Data Analysis

  12. Batch-mode Local analysis • Conventional model: The user has full control on the event loop. • The program produces histograms, ntuples or trees. • The selection is via user private code • Histograms are then added (tool or in the interactive session) • ntuples/trees are combined into a chain and analyzed interactively. Perspective on Future Data Analysis

  13. Batch Analysis on the GRID • From a user viewpoint, a simple extrapolation of the local batch analysis. • In practice, must involve all the GRID machinery: authentication, resource brokers, sandboxes. • Viewing the current status (histograms) must be possible. • Advantage: Stateless, can process large data volumes. Advanced systems already exist (see talk by Andreas Wagner) Perspective on Future Data Analysis

  14. Kernel Space Linux File System Kernel /alien/ User Space ******************************************* * * * W E L C O M E to R O O T * * * * Version 3.03/09 3 December 2002 * * * * You are welcome to visit our Web site * * http://root.cern.ch * * * ******************************************* Compiled for linux with thread support. CINT/ROOT C/C++ Interpreter version 5.15.61, Oct 6 2002 Type ? for help. Commands must be C++ statements. Enclose multiple statements between { }. root [0]newanalysis->Submit(); VFS alice/ atlas/ AliEn API ? AliEnFS Query for Input Data LUFS prod/ data/ mc/ a/ b/ Analysis Macro soap:// root:// MSS MSS MSS MSS castor:// CE root:// CE root:// MSS https:// CE MSS MSS MSS MSS MSS CE CE merged Trees +Histograms AliEnFS & Distributed Analysis Perspective on Future Data Analysis

  15. Interactive Local Analysis • On a public cluster, or the user’s laptop. • Tools like PAW or successor are used for visualization and ntuples/trees analysis. Perspective on Future Data Analysis

  16. GRID: Interactive AnalysisCase 1 • Data transfer to user’s laptop • Optional Run/File catalog • Optional GRID software Optional run/File Catalog Analysis scripts are interpreted or compiled on the local machine Trees Remote file server eg rootd Trees Perspective on Future Data Analysis

  17. GRID: Interactive AnalysisCase 2 • Remote data processing • Optional Run/File catalog • Optional GRID software Optional run/File Catalog Analysis scripts are interpreted or compiled on the remote machine Trees Remote data analyzer eg proofd Commands, scripts Trees histograms Perspective on Future Data Analysis

  18. GRID: Interactive AnalysisCase 3 • Remote data processing • Run/File catalog • Full GRID software Run/File Catalog Analysis scripts are interpreted or compiled on the remote master(s) Trees slave Trees Trees Trees slave Remote data analyzer eg proofd slave Commands, scripts slave Trees Histograms,trees Trees slave Trees Trees slave Perspective on Future Data Analysis

  19. Data Analysis Projects Perspective on Future Data AnalysisL

  20. Tools for data analysis • PAW: started in 1985, no major developments since 1994. • HippoDraw: started in 1991 • ROOT: started in 1995, continuous developments • JAS: started in 1995, continuous developments • Open Scientist: ? • LHC++/Anaphe: 1996-->2002 • PI: new project in the LHC Computing Grid, just starting now Perspective on Future Data Analysis

  21. PAW • The reference since 18 years (1985), Used by most collaborations • ported on many platforms, small (3 to 15 MB) • many criticisms during the development phase • applauded since it is stable • maintained by Olivier Couet (ROOT team) Usage still growing 0.1 FTE Perspective on Future Data Analysis

  22. HippoDraw • Author: Paul Kunz • show the way in 1991/1992 • Usage: Paul + “a 50 year-old CERN physicist” • Seems to be in constant prototyping phases • Good to have this type of prototype to illustrate new possible interactive techniques. 1 FTE ? Perspective on Future Data Analysis

  23. ROOT • In constant development since 1995 • Used by many collaborations and outside HEP More than 10000 distributions of binary tar files in February 6 +2+..FTE Perspective on Future Data Analysis

  24. JAS • Started in 1995. (Tony Johnson) • Current version 2. JAS3 presented at this CHEP • For the Java world. • How to cooperate with C++ frameworks? 3 FTE ? Perspective on Future Data Analysis

  25. In AIDA you believe ? • The Abstract Interfaces for Data Analysis project was started by the defunct LHC++ and continued by Anaphe (now stopped). • Supported by JAS and Open Scientist • Goal: define abstract interfaces to facilitate cooperation between developers and facilitate migration of users to new products • Versions 1, 2 and 3 (version 4 for PI ?) Perspective on Future Data Analysis

  26. In AIDA I don’t believe • Abstract Interfaces are fundamental in modern systems to make a system more modular and adaptable. • But, common abstract interfaces are not a good idea. • They force a lowest common denominator • They require international agreements • Users will be confused (what is common and not) • you become slave of a deal: against creativity • It is more important to agree on object interchange formats and data base access • You can easily change a few hundred lines of code. You cannot copy Terabytes of data Perspective on Future Data Analysis

  27. The LCG PI project • Fresh from the oven • One of the projects recently launched by the Applications Area of the LCG project. • Ideas: • promote the use of AIDA (version 4) • Python for scripting • interface to ROOT & CINT • in gestation • see Vincenzo Perspective on Future Data Analysis

  28. User & Developer views • Users Requests • very rarely requests for grandiose new features • zillions of tiny new features • zillions of tiny improvements • want consolidation & stability • Developers view • want to implement the sexy features • target modularity (more complex installation?) • maintenance & helpdesk: a problem or a chance? Perspective on Future Data Analysis

  29. Lessons from the past • It takes time to develop a general tool • more than 7 years for PAW, ROOT and JAS • User feedback is essential in the development phase • People like stable systems • Efficient access to data sets is a prerequisite • 24h x 7days x 12 months x N years online support is vital Perspective on Future Data Analysis

  30. Develop/Debug/maintain In an Interactive system with N basic functions, the number of combinations may be unlimited, (Not NxN, but N! ) 10% of the time to develop first 90% of the code. 90% of the time to develop the remaining 10% Perspective on Future Data Analysis

  31. Time to develop LCG Perspective on Future Data Analysis

  32. Technical aspects Perspective on Future Data AnalysisL

  33. Desktop • Plug-in Manager and Dictionary • GUI • Graphics 2-d, 3-d • Event Displays • Histograming & Fitting • Statistics tools • Scripting • Data/Program organization Perspective on Future Data Analysis

  34. Plug-in Manager Exp Shared libs User Shared lib Exp Shared libs Exp Shared libs Basic Services, GUI, Math.. General Utility Shared lib Plug-in manager I/O manager I/O manager Interpreter Object Dictionary Perspective on Future Data Analysis

  35. The Object Dictionary Object Dictionary Data dictionary Functions dictionary Inspectors Browsers I/O Interpreted scripts GUI Command line Compiled code Perspective on Future Data Analysis

  36. Scripting for data analysis • After KUIP and Tk/Tcl era • Command line Interface required • Scripts • interpreted or/and byte-code interpreted • automatic compilation and linking • call compiled or interpreted code • compiled code must be able to call interpreted code (GUI and configuration scripts) • Big bonus if compiled and interpreted languages are the same • Scripting and object dictionary symbiosis • Remote execution of scripts (in parallel) Perspective on Future Data Analysis

  37. Languages & scripting C++ Compiled code C++ Interpreted scripts Python/Perl scripts GUI with signal/slots Batch User Interactive User Perspective on Future Data Analysis

  38. Comparing scripts Very interesting project from Subir Sarkar Cooperation between Java and a C++ framework based on Object Dictionary http://sarkar.home.cern.ch/sarkar/jroot/main.html Perspective on Future Data Analysis

  39. GUI(s) • Constant evolution • +Microsoft MFC, Win32 API • Signals/Slots principle: very nice. It helps designing large and modular GUI systems • Interpreters help GUI builders/editors 1983 Vax/VMS SMS VT100 1989 MOTIF Unix workstations 1985 GKS Textronix 1997 Java/Swing The Web 2001 Qt Linux/Laptops Perspective on Future Data Analysis

  40. 2-D graphics • An area where constant improvements are required. • Better plotters, better fonts,... • Better drivers: postscript, SVG, XML, etc Publication quality is a must. This requirement alone explains why many proposed data analysis systems do not penetrate experiments Perspective on Future Data Analysis

  41. 3-D graphics • Data structures: Objects <--> scene • Scene renderers: OpenGL, Open Inventor • Most difficult is detector geometry graphics • z-buffer algorithms OK for fast real time fancy graphics, not OK for good debugging (shape outline is important on top of z-buffer views). • Vector Postscript (or PDF/SVG) must be available (not Postscript from OpenGL triangles) • see talks about GraXML and Persint Perspective on Future Data Analysis

  42. Example with PERSINT/ATLAS Perspective on Future Data Analysis

  43. Event Displays • The most successful event displays so far were 2-D projections (see Aleph, Atlas/Atlantis) • A lot of work with 3-d graphics in many experiments (see talks about Iguana) • Client-server model • Access to framework objects, browsers • One could have expected a bigger role for Java! • Mismatch with experiment C++ frameworks? • Possible directions • standardize object exchange (SOAP/XML/Root I/O) • standardize low level graphics exchange (HEPREP) Perspective on Future Data Analysis

  44. Histograming • This should be a stable area • Thread Safety • Binning on parallel systems • Merging on batch/parallel systems Perspective on Future Data Analysis

  45. Fitting • Minuit: the standard • Fumili: was nice and fast • Upgrade of Minuit with new algorithms including Fumili in the pipeline • several GUIs on top • a very powerful package developed by BaBar • see talk on RooFit by D.Kirkby Perspective on Future Data Analysis

  46. Statistics & Math • Many tools and algorithms exist • GSL ? • Gnu R-Math project • TerraFerma Initiative • Subject of discussions at many workshops • confidence limits workshops • ACAT FermiLab and Moscow • Durham • Need to be federated in a coherent framework Perspective on Future Data Analysis

  47. Lost with Complexity? • In large collaborations, users are often lost when confronted to the complexity of big simulation and reconstruction programs: • What is the data organization? • How are algorithms organized? The hierarchy? • The problem is amplified by the use of dynamically configurable systems, dynamic linking and polymorphism • Browsing data and algorithms is a must Perspective on Future Data Analysis

  48. Folders/ white boards Folders help understanding complex hierarchical structures Language Independent Could be GRID-aware Perspective on Future Data Analysis

  49. Why Folders ? This diagram shows a system without folders. The objects have pointers to each other to access each other's data. Pointers are an efficient way to share data between classes. However, a direct pointer creates a direct coupling between classes. This design can become a very tangled web of dependencies in a system with a large number of classes. Perspective on Future Data Analysis

  50. Why Folders ? In the diagram below, a reference to the data is in the folder and the consumers refer to the folder rather than each other to access the data. A naming and search service provides an alternative. It loosely couples the classes and greatly enhances I/O operations. In this way, folders separate the data from the algorithms and greatly improve the modularity of an application by minimizing the class dependencies. Perspective on Future Data Analysis

More Related