T MVA A Toolkit for (Parallel) MultiVariate Data Analysis

TMVA A Toolkit for (Parallel) MultiVariate Data Analysis Andreas Höcker (ATLAS), Helge Voss (LHCb), Kai Voss (ex. ATLAS) Joerg Stelzer (ATLAS) • apply a (large) variety of sophisticated data selection algorithms to your data sample • have them all trained and tested • choose the best one for your selection problem and “go” http://tmva.sourceforge.net/ Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Outline • Introduction: MVAs, what / where / why • the MVA methods available in TMVA • demonstration with toy examples • (some “real example) • Summary/Outlook Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Introduction to MVA • At the beginning of each physics analysis: • select your event sample, discriminating against background • event tagging • … • Or even earlier: • e.g. particle identifiaction, pattern recognition (ring finding) in RICH • trigger applications  Always one uses several variables in some sort of combination: • MVA -- MulitVariate Analysis: • nice name, means nothing else but: Use several observables from your events to form ONE combined variable and use this in order to discriminate between “signal” and “background” Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Introduction to MVAs “expected” signal and background distributions E.g. Variable Et Et (event2): signal “like” Et (event1): background “like” Et • sequence of cuts  multivariate methods • sequence of cuts is easy to understand offers easy interpretation • sequences of cuts are often inefficient!! e.g. events might be rejected because of just ONE variable, while the others look very “signal like” MVA: several observables  One selection criterium • e.g: Likelihood selection • calculate for each observable in an event a probability that its value belongs to a “signal” or a “background” event using reference distributions (PDFs) for signal and background. • Then cut on the combined likelihood of all these probabilities Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

MVA Experience • MVAs certainly made it’s way through HEP -- e.g. LEP Higgs analyses, but simple “cuts” are probably still most accepted and used • widely understood and accepted: likelihood (probability density estimators – PDE ) • often disliked, but more and more used (LEP, BABAR): Artificial Neural Networks • much used in BABAR and Belle: Fisher discriminants • used by MiniBooNE and recently by BABAR: Boosted Decision Trees • Often people use custom tools • MVAs tedious to implement:  few true comparisons between methods ! • All interesting methods … but often mixed with skepticism due to: • black boxes ! how to interpret the selection ? • what if the training samples incorrectly describe the data ? • how can one evaluate systematics ? • would like to try … but which MVA should I use and where to find the time to code it?  : just look at TMVA ! Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

What is TMVA • Toolkit for Multivariate Analysis (TMVA): C++ , ROOT • ‘parallel’ processing of various MVA techniques to discriminate signal from background samples.  easy comparison of different MVA techniques – choose the best one • TMVA presently includes: • Rectangular cut optimisation • Likelihood estimator (PDE approach) • Multi-dimensional likelihood estimator (PDE - range-search approach) • Fisher discriminant • H-Matrix approach (2 estimator) • Artificial Neural Network (3 different implementations) • Boosted Decision Trees • upcoming: RuleFit, Support Vector Machines, “Committee” methods • common pre-processing of input data: de-correlation, principal component analysis • TMVA package provides training, testing and evaluation of the MVAs • Each MVA method provides a ranking of the input variables • MVAs produce weight files that are read by reader class for MVA application Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

TMVAMethods Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Cut Optimisation • Simplest method: cut in rectangular volume using Nvarinput variables • Usually training files in TMVA do not contain realistic signal and background abundance  cannot optimize for best significance (S/(S+B) ) • scan in signal efficiency [0 1] and maximise background rejection • Technical problem: how to perform optimisation • random sampling: robust (if not too many observables used) but suboptimal • new techniques  Genetics Algorithm and Simulated Annealing Huge speed improvement by sorting training events in Binary Search Tree (for 4 variables we gained a factor 41) • do this in normal variable space or de-correlated variable space Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Preprocessing the Input Variables: Decorrelation • Commonly realised for all methods in TMVA (centrally in DataSet class): • Removal of linear correlations by rotating variable space of PDEs • Determine square-rootC of correlation matrix C, i.e., C = CC • compute C by diagonalising C: • transformation from original(x)in de-correlated variable space(x)by:x = C1x • Various ways to choose diagonalisation (also implemented: principal component analysis) • Note that this “de-correlation” is only complete, if: • input variables are Gaussians • correlations linear only • in practise: gain form de-correlation often rather modest – or even harmful  Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Projected Likelihood Estimator (PDE Appr.) PDFs discriminating variables Likelihood ratio for event ievent Species: signal, background types automatic,unbiased, but suboptimal difficult to automate easy to automate, can create artefacts TMVA uses: Splines0-5 • Combine probability density distributions to likelihood estimator • Assumes uncorrelated input variables • in that case it is the optimal MVA approach, since it contains all the information • usually it is not true development of different methods • Technical problem: how to implement reference PDFs • 3 ways: function fittingparametric fitting (splines, kernel est.)counting Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Multidimensional Likelihood Estimator • Generalisation of 1D PDE approach to Nvar dimensions • Optimal method – in theory – since full information is used • Practical challenges: • parameterisation of multi-dimensional phase space needs huge training samples • implementation of Nvar-dim. reference PDF with kernel estimates or counting • TMVA implementation following Range-Search method • count number of signal and background events in “vicinity” of a data event • “vicinity” defined by fixed or adaptiveNvar-dim. volume size adaptive: rescale volume size to achieve constant number of reference events • volumes can be rectangular or spherical • use multi-D kernels (Gaussian, triangular, …) to weight events within a volume • speed up range search by sorting training events in Binary Trees Carli-Koblitz, NIM A501, 576 (2003) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Fisher Discriminant (and H-Matrix) • Well-known, simple and elegant MVA method: event selection is performed in a transformed variable space with zero linear correlations, by distinguishing the mean values of the signal and background distributions • Instead of equations, words: • optimal for linearly correlated Gaussians with equal RMS’ and different means • no separation if equal means and different RMS (shapes) An axis is determined in the (correlated) hyperspace of the input variables such that, when projecting the output classes (signal and background) upon this axis, they are pushed as far as possible away from each other, while events of a same class are confined in a close vicinity. The linearity property of this method is reflected in the metric with which "far apart" and "close vicinity" are determined: the covariance matrix of the discriminant variable space. • Computation of Fisher MVA couldn’t be simpler: “Fisher coefficients” • H-Matrix estimator: correlated 2 : poor man’s variation of Fisher discriminant Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Artificial Neural Network (ANN) 1 input layer k hidden layers 1 ouput layer ... 1 1 1 2 output classes (signal and background) . . . . . . . . . Nvar discriminating input variables i j Mk . . . . . . N M1 (“Activation” function) with: • ANNs are non-linear discriminants: Fisher= ANN without hidden layer • they seem to be better adapted to realistic use cases than Fisher and Likelihood • TMVA ANN implementations: Feed-Forward Multilayer Perceptrons • Clermont-Ferrand ANN: used for ALEPH Higgs analysis; translated from FORTRAN • TMultiLayerPerceptron interface: ANN implemented in ROOT • MLP: “our” own ANN, Similar to the ROOT ANN, faster more flexible, no automatic validation however.  recent comparison with Stuttgart NN in ATLAS tau-ID confirmed its excellent performance! Feed-forward Multilayer Perceptron Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Decision Trees • Decision Trees: a sequential application of “cuts” which splits the data into nodes, and the final nodes (leaf) classifies an event as signal or background ROOT-Node Nsignal Nbkg Decision tree • Training: • start with the root-node • split training sample at node into two, using a cut in the variable that gives best separation • continue splitting until: minimal #events reached, or max purity reached • leaf-nodes classify S,B according to majority of events / or give a S/B probability Leaf-Node Nsignal<Nbkg  bkg Node Nsignal Nbkg depth: 1 Leaf-Node Nsignal >Nbkg  signal Node Nsignal Nbkg depth: 2 • Bottom up Pruning: • remove statistically dominated nodes Leaf-Node Nsignal<Nbkg  bkg Leaf-Node Nsignal >Nbkg  signal depth: 3 • Testing/Application: • a test event is “filled” at the root-node and classified according to the leaf-node where it ends up after the “cut”-sequence Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Boosted Decision Trees Boosted Decision Trees • Decision Trees: used since a long time in general “data-mining” applications, less known in HEP (although very similar to “simple Cuts”) • Advantages: • easy to interpret: independently of Nvar, can always be visualised in a 2D tree • independent of monotone variable transformation: rather immune against outliers • immune against addition of weak variables • Disadvatages: • instability: small changes in training sample can give large changes in tree structure • Boosted Decision Trees (1996): combining several decision trees (forest) derived from one training sample via the application of event weights into ONE mulitvariate event classifier by performing “majority vote” • e.g. AdaBoost: wrong classified training events are given a larger weight • bagging, random weights, re-sampling with replacement Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Boosted Decision Trees Boosted Decision Trees • Boosted Decision Trees: two different interpretations • give events that are “difficult to categorize” more weight to get better classifiers • see each Tree as a “basis function” of a possible classifier  boosting or bagging is just a mean to generate a set of “basis funciton” linear combination of basis functions gives final classifier or: final classifier is an expansion in the basis functions. • AdaBoost: • normally, in the forest, each tree is weighed by (1-ferr)/ferr with: ferr=error fraction • then AdaBoost can be understood as: with minimzed coefficients αiaccording to an “exponential Loss critirium” (expectation value of: exp( -y F(α,x)) where y=-1 (bkg) and y=1 (signal) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

RuleFit • Ref: Friedman-Popescu: http://www-stat.stanford.edu/~jhf/R-RuleFit.html • Model is a linear combination of rules;a rule is a sequence of cuts: score coefficients rules data • The problem to solve is: • create the rule ensemble  created from a set of decision trees • fit the coefficients  “Gradient direct regularization” (Friedman et al) • Fast, robust and good performance Test on 4k simulated SUSY + top events with 4 discriminating variables Methods tested: • Multi Layer Perceptron (MLP) • Fisher • RuleFit Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

A Purely Academic Example • Simple toy to illustrate the strength of the de-correlation technique • 4 linearly correlated Gaussians, with equal RMS and shifted means between S and B TMVA output : --- TMVA_Factory: correlation matrix (signal): ---------------------------------------------------------------- --- var1 var2 var3 var4 --- var1: +1.000 +0.336 +0.393 +0.447 --- var2: +0.336 +1.000 +0.613 +0.668 --- var3: +0.393 +0.613 +1.000 +0.907 --- var4: +0.447 +0.668 +0.907 +1.000 --- TMVA_MethodFisher: ranked output (top --- variable is best ranked) ---------------------------------------------------------------- --- Variable : Coefficient: Discr. power: ---------------------------------------------------------------- --- var4 : +8.077 0.3888 --- var3 : 3.417 0.2629 --- var2 : 0.982 0.1394 --- var1 : 0.812 0.0391 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Likelihood Distributions Likelihood reference PDFs Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

ANN convergence test Likelihood reference PDFs MLP convergence Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

ANN architecture Quality Assurance Likelihood reference PDFs MLP convergence MLP architecture Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Decision Tree Structure Quality Assurance decision tree structure BEFORE pruning Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Decision Tree Structure Quality Assurance decision tree structure AFTER pruning Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Output The Output • TMVA output distributions for Fisher, Likelihood, BDT and MLP Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Output The Output • TMVA output distributions for Fisher, Likelihood, BDT and MLP For this case: Fisher discriminant provides the theoretically ‘best’ possible method  Same as de-correlated Likelihood Cuts, Decision Trees and Likelihood w/o de-correlation are inferior Note: About All Realistic Use Cases are Much More Difficult Than This One Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Academic Example (II) y x • Simple toy to illustrate difficulties of some techniques with correlations • 2x2 variables with circular correlations: each set, equal means and different RMS Signal Background TMVA output : --- TMVA_Factory: correlation matrix (signal): ------------------------------------------------------------------- --- var1 var2 var3 var4 --- var1: +1.000 +0.001 0.004 0.012 --- var2: +0.001 +1.000 0.020 +0.001 --- var3: 0.004 0.020 +1.000 +0.012 --- var4: 0.012 +0.001 +0.012 +1.000 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Academic Example (II) The Output • … resulting in the following “performance” highly nonlinear correlations: • Decision Trees outperform other methods by far (to be fair, ANN just used default parameters) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Some “real life example” • I got a sample (LHCb Monte Carlo) with simply a list of 21 possible variables • “played” for just 1 h • looked at all the variables (they get plotted “automatically”) •  filled them in as “log( variable )” where distribution were too “steep” • removed some variables that seem to have no obvious separataion Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Some “real life example” • I got a sample (LHCb Monte Carlo) with simply a list of 21 possible variables • “played” for just 1 h • looked at all the variables (they get plotted “automatically”) •  filled them in as “log( variable )” where distribution were too “steep” • removed some variables that seem to have no obvious separataion • then removed variables with extremely large correlation Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Tuned a little bit the pruning parameters Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Had a result: • Unfortunatly… I don’t know how it compares to “his” current selection performance Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

TMVA Technicalities • TMVA releases: • sourceforge (SF) package to accommodate world-wide access • code can be downloaded as tar file, or via anonymous cvs access • home page: http://tmva.sourceforge.net/ • SF project page: http://sourceforge.net/projects/tmva • view CVS: http://cvs.sourceforge.net/viewcvs.py/tmva/TMVA/ • mailing lists: http://sourceforge.net/mail/?group_id=152074 • ROOT::TMVA class index….http://root.cern.ch/root/htmldoc/TMVA_Index.html • part of the ROOT package (since Development release 5.11/06) • TMVA features: • automatic iteration over all selected methods for training/testing etc. • each method has specific options that can be set by the user for optimisation • ROOT scripts are provided for all relevant performance analysis • We enthusiastically welcome new users, testers and developers  • now we have 20 people on developers list !! (http://sourceforge.net/project/memberlist.php?group_id=152074) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Concluding Remarks • TMVA is still a young project! • first release on sourceforge March 8, 2006 • now also as part of the ROOT package  new devel. release: Nov. 22 • it is an open source project  new contributors are welcome  • TMVA provides the training and evaluation tools, but the decision which method is the best is certainly depends on the use case  train several methods in parallel and see what is best for YOUR analysis • Most methods can be improved over default by optimising the training options • Aimed for easy to use tool to give access to many different “complicated” selection algorithms • Already have a number of users, but still need more real-life experience • need to write a manual (use “How To…” on Web page and Examples in distribution) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Outlook • We will continue to improve the selection Methods already implemented • implement event weights for all methods (missing: Cuts, Fisher, H-Matrix, RuleFit) • improve Kernel estimators • improve Decision Tree pruning (automatic, cross validation) • improve decorrelation techniques • New Methods are under development • RuleFit • Support Vector Machines • “Committee Method”  combination of different MVA techniques Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Using TMVA Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Curious? Want to have a look at TMVA Nothing easier than that: • Get ROOT version 5.12.00 (better wait for 22. Nov and get the new development version) • go into your working directory (need write access) • then start: root $ROOTSYS/tmva/test/TMVAnalysis.C (toy samples are provided, look at the results/plots and then how you can implement your stuff in the code) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

Two ways to download TMVA source code: • get tgz file by clicking on green button, or directly from here: http://sourceforge.net/project/showfiles.php?group_id=152074&package_id=168420&release_id=397782 • via anonymous cvs: cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/tmva co -P TMVA Web Presentation on Sourceforge.net http://tmva.sourceforge.net/ Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

T MVA A Toolkit for (Parallel) MultiVariate Data Analysis

T MVA A Toolkit for (Parallel) MultiVariate Data Analysis

Presentation Transcript

T MVA ― Status and Developments

A Pattern Language for Parallel Programming

A Parallel Delaunay algorithm for CGAL

Parallel Programming Using the Global Arrays Toolkit

Cuts and Likelihood Classifiers in T MVA

A Survey of Parallel T ree-based Methods on Option Pricing

phschool MVA-7221 MVA-7222 MVA-7223 MVA-7224

Docking@GRID – A Web Portal for Massively Parallel Flexible Docking, using the ChemAxon toolkit.

The GENIA Toolkit

MVA

The ACTS Toolkit ( What can it do for you? )

A Toolkit for Continuous Improvement:

A parallel software for a saltwater intrusion problem

A Toolkit for User Engagement

A Parallel Method for Heat Equation with Memory

MVA-111

A Toolkit for Change

MVA-MUC1

A Parallel Delaunay algorithm for CGAL

Overview of the Global Arrays Parallel Software Development Toolkit

T MVA ― Status and Developments