1 / 37

T MVA A Toolkit for (Parallel) MultiVariate Data Analysis

T MVA A Toolkit for (Parallel) MultiVariate Data Analysis. Andreas Höcker (ATLAS), Helge Voss (LHCb), Kai Voss (ex. ATLAS) Joerg Stelzer (ATLAS). apply a (large) variety of sophisticated data selection algorithms to your data sample have them all trained and tested

tanuja
Download Presentation

T MVA A Toolkit for (Parallel) MultiVariate Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TMVA A Toolkit for (Parallel) MultiVariate Data Analysis Andreas Höcker (ATLAS), Helge Voss (LHCb), Kai Voss (ex. ATLAS) Joerg Stelzer (ATLAS) • apply a (large) variety of sophisticated data selection algorithms to your data sample • have them all trained and tested • choose the best one for your selection problem and “go” http://tmva.sourceforge.net/ Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  2. Outline • Introduction: MVAs, what / where / why • the MVA methods available in TMVA • demonstration with toy examples • (some “real example) • Summary/Outlook Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  3. Introduction to MVA • At the beginning of each physics analysis: • select your event sample, discriminating against background • event tagging • … • Or even earlier: • e.g. particle identifiaction, pattern recognition (ring finding) in RICH • trigger applications  Always one uses several variables in some sort of combination: • MVA -- MulitVariate Analysis: • nice name, means nothing else but: Use several observables from your events to form ONE combined variable and use this in order to discriminate between “signal” and “background” Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  4. Introduction to MVAs “expected” signal and background distributions E.g. Variable Et Et (event2): signal “like” Et (event1): background “like” Et • sequence of cuts  multivariate methods • sequence of cuts is easy to understand offers easy interpretation • sequences of cuts are often inefficient!! e.g. events might be rejected because of just ONE variable, while the others look very “signal like” MVA: several observables  One selection criterium • e.g: Likelihood selection • calculate for each observable in an event a probability that its value belongs to a “signal” or a “background” event using reference distributions (PDFs) for signal and background. • Then cut on the combined likelihood of all these probabilities Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  5. MVA Experience • MVAs certainly made it’s way through HEP -- e.g. LEP Higgs analyses, but simple “cuts” are probably still most accepted and used • widely understood and accepted: likelihood (probability density estimators – PDE ) • often disliked, but more and more used (LEP, BABAR): Artificial Neural Networks • much used in BABAR and Belle: Fisher discriminants • used by MiniBooNE and recently by BABAR: Boosted Decision Trees • Often people use custom tools • MVAs tedious to implement:  few true comparisons between methods ! • All interesting methods … but often mixed with skepticism due to: • black boxes ! how to interpret the selection ? • what if the training samples incorrectly describe the data ? • how can one evaluate systematics ? • would like to try … but which MVA should I use and where to find the time to code it?  : just look at TMVA ! Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  6. What is TMVA • Toolkit for Multivariate Analysis (TMVA): C++ , ROOT • ‘parallel’ processing of various MVA techniques to discriminate signal from background samples.  easy comparison of different MVA techniques – choose the best one • TMVA presently includes: • Rectangular cut optimisation • Likelihood estimator (PDE approach) • Multi-dimensional likelihood estimator (PDE - range-search approach) • Fisher discriminant • H-Matrix approach (2 estimator) • Artificial Neural Network (3 different implementations) • Boosted Decision Trees • upcoming: RuleFit, Support Vector Machines, “Committee” methods • common pre-processing of input data: de-correlation, principal component analysis • TMVA package provides training, testing and evaluation of the MVAs • Each MVA method provides a ranking of the input variables • MVAs produce weight files that are read by reader class for MVA application Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  7. TMVAMethods Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  8. Cut Optimisation • Simplest method: cut in rectangular volume using Nvarinput variables • Usually training files in TMVA do not contain realistic signal and background abundance  cannot optimize for best significance (S/(S+B) ) • scan in signal efficiency [0 1] and maximise background rejection • Technical problem: how to perform optimisation • random sampling: robust (if not too many observables used) but suboptimal • new techniques  Genetics Algorithm and Simulated Annealing Huge speed improvement by sorting training events in Binary Search Tree (for 4 variables we gained a factor 41) • do this in normal variable space or de-correlated variable space Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  9. Preprocessing the Input Variables: Decorrelation • Commonly realised for all methods in TMVA (centrally in DataSet class): • Removal of linear correlations by rotating variable space of PDEs • Determine square-rootC of correlation matrix C, i.e., C = CC • compute C by diagonalising C: • transformation from original(x)in de-correlated variable space(x)by:x = C1x • Various ways to choose diagonalisation (also implemented: principal component analysis) • Note that this “de-correlation” is only complete, if: • input variables are Gaussians • correlations linear only • in practise: gain form de-correlation often rather modest – or even harmful  Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  10. Projected Likelihood Estimator (PDE Appr.) PDFs discriminating variables Likelihood ratio for event ievent Species: signal, background types automatic,unbiased, but suboptimal difficult to automate easy to automate, can create artefacts TMVA uses: Splines0-5 • Combine probability density distributions to likelihood estimator • Assumes uncorrelated input variables • in that case it is the optimal MVA approach, since it contains all the information • usually it is not true development of different methods • Technical problem: how to implement reference PDFs • 3 ways: function fittingparametric fitting (splines, kernel est.)counting Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  11. Multidimensional Likelihood Estimator • Generalisation of 1D PDE approach to Nvar dimensions • Optimal method – in theory – since full information is used • Practical challenges: • parameterisation of multi-dimensional phase space needs huge training samples • implementation of Nvar-dim. reference PDF with kernel estimates or counting • TMVA implementation following Range-Search method • count number of signal and background events in “vicinity” of a data event • “vicinity” defined by fixed or adaptiveNvar-dim. volume size adaptive: rescale volume size to achieve constant number of reference events • volumes can be rectangular or spherical • use multi-D kernels (Gaussian, triangular, …) to weight events within a volume • speed up range search by sorting training events in Binary Trees Carli-Koblitz, NIM A501, 576 (2003) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  12. Fisher Discriminant (and H-Matrix) • Well-known, simple and elegant MVA method: event selection is performed in a transformed variable space with zero linear correlations, by distinguishing the mean values of the signal and background distributions • Instead of equations, words: • optimal for linearly correlated Gaussians with equal RMS’ and different means • no separation if equal means and different RMS (shapes) An axis is determined in the (correlated) hyperspace of the input variables such that, when projecting the output classes (signal and background) upon this axis, they are pushed as far as possible away from each other, while events of a same class are confined in a close vicinity. The linearity property of this method is reflected in the metric with which "far apart" and "close vicinity" are determined: the covariance matrix of the discriminant variable space. • Computation of Fisher MVA couldn’t be simpler: “Fisher coefficients” • H-Matrix estimator: correlated 2 : poor man’s variation of Fisher discriminant Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  13. Artificial Neural Network (ANN) 1 input layer k hidden layers 1 ouput layer ... 1 1 1 2 output classes (signal and background) . . . . . . . . . Nvar discriminating input variables i j Mk . . . . . . N M1 (“Activation” function) with: • ANNs are non-linear discriminants: Fisher= ANN without hidden layer • they seem to be better adapted to realistic use cases than Fisher and Likelihood • TMVA ANN implementations: Feed-Forward Multilayer Perceptrons • Clermont-Ferrand ANN: used for ALEPH Higgs analysis; translated from FORTRAN • TMultiLayerPerceptron interface: ANN implemented in ROOT • MLP: “our” own ANN, Similar to the ROOT ANN, faster more flexible, no automatic validation however.  recent comparison with Stuttgart NN in ATLAS tau-ID confirmed its excellent performance! Feed-forward Multilayer Perceptron Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  14. Decision Trees • Decision Trees: a sequential application of “cuts” which splits the data into nodes, and the final nodes (leaf) classifies an event as signal or background ROOT-Node Nsignal Nbkg Decision tree • Training: • start with the root-node • split training sample at node into two, using a cut in the variable that gives best separation • continue splitting until: minimal #events reached, or max purity reached • leaf-nodes classify S,B according to majority of events / or give a S/B probability Leaf-Node Nsignal<Nbkg  bkg Node Nsignal Nbkg depth: 1 Leaf-Node Nsignal >Nbkg  signal Node Nsignal Nbkg depth: 2 • Bottom up Pruning: • remove statistically dominated nodes Leaf-Node Nsignal<Nbkg  bkg Leaf-Node Nsignal >Nbkg  signal depth: 3 • Testing/Application: • a test event is “filled” at the root-node and classified according to the leaf-node where it ends up after the “cut”-sequence Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  15. Boosted Decision Trees Boosted Decision Trees • Decision Trees: used since a long time in general “data-mining” applications, less known in HEP (although very similar to “simple Cuts”) • Advantages: • easy to interpret: independently of Nvar, can always be visualised in a 2D tree • independent of monotone variable transformation: rather immune against outliers • immune against addition of weak variables • Disadvatages: • instability: small changes in training sample can give large changes in tree structure • Boosted Decision Trees (1996): combining several decision trees (forest) derived from one training sample via the application of event weights into ONE mulitvariate event classifier by performing “majority vote” • e.g. AdaBoost: wrong classified training events are given a larger weight • bagging, random weights, re-sampling with replacement Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  16. Boosted Decision Trees Boosted Decision Trees • Boosted Decision Trees: two different interpretations • give events that are “difficult to categorize” more weight to get better classifiers • see each Tree as a “basis function” of a possible classifier  boosting or bagging is just a mean to generate a set of “basis funciton” linear combination of basis functions gives final classifier or: final classifier is an expansion in the basis functions. • AdaBoost: • normally, in the forest, each tree is weighed by (1-ferr)/ferr with: ferr=error fraction • then AdaBoost can be understood as: with minimzed coefficients αiaccording to an “exponential Loss critirium” (expectation value of: exp( -y F(α,x)) where y=-1 (bkg) and y=1 (signal) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  17. RuleFit • Ref: Friedman-Popescu: http://www-stat.stanford.edu/~jhf/R-RuleFit.html • Model is a linear combination of rules;a rule is a sequence of cuts: score coefficients rules data • The problem to solve is: • create the rule ensemble  created from a set of decision trees • fit the coefficients  “Gradient direct regularization” (Friedman et al) • Fast, robust and good performance Test on 4k simulated SUSY + top events with 4 discriminating variables Methods tested: • Multi Layer Perceptron (MLP) • Fisher • RuleFit Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  18. A Purely Academic Example • Simple toy to illustrate the strength of the de-correlation technique • 4 linearly correlated Gaussians, with equal RMS and shifted means between S and B TMVA output : --- TMVA_Factory: correlation matrix (signal): ---------------------------------------------------------------- --- var1 var2 var3 var4 --- var1: +1.000 +0.336 +0.393 +0.447 --- var2: +0.336 +1.000 +0.613 +0.668 --- var3: +0.393 +0.613 +1.000 +0.907 --- var4: +0.447 +0.668 +0.907 +1.000 --- TMVA_MethodFisher: ranked output (top --- variable is best ranked) ---------------------------------------------------------------- --- Variable : Coefficient: Discr. power: ---------------------------------------------------------------- --- var4 : +8.077 0.3888 --- var3 : 3.417 0.2629 --- var2 : 0.982 0.1394 --- var1 : 0.812 0.0391 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  19. Likelihood Distributions Likelihood reference PDFs Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  20. ANN convergence test Likelihood reference PDFs MLP convergence Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  21. ANN architecture Quality Assurance Likelihood reference PDFs MLP convergence MLP architecture Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  22. Decision Tree Structure Quality Assurance decision tree structure BEFORE pruning Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  23. Decision Tree Structure Quality Assurance decision tree structure AFTER pruning Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  24. Output The Output • TMVA output distributions for Fisher, Likelihood, BDT and MLP Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  25. Output The Output • TMVA output distributions for Fisher, Likelihood, BDT and MLP For this case: Fisher discriminant provides the theoretically ‘best’ possible method  Same as de-correlated Likelihood Cuts, Decision Trees and Likelihood w/o de-correlation are inferior Note: About All Realistic Use Cases are Much More Difficult Than This One Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  26. Academic Example (II) y x • Simple toy to illustrate difficulties of some techniques with correlations • 2x2 variables with circular correlations: each set, equal means and different RMS Signal Background TMVA output : --- TMVA_Factory: correlation matrix (signal): ------------------------------------------------------------------- --- var1 var2 var3 var4 --- var1: +1.000 +0.001 0.004 0.012 --- var2: +0.001 +1.000 0.020 +0.001 --- var3: 0.004 0.020 +1.000 +0.012 --- var4: 0.012 +0.001 +0.012 +1.000 Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  27. Academic Example (II) The Output • … resulting in the following “performance” highly nonlinear correlations: • Decision Trees outperform other methods by far (to be fair, ANN just used default parameters) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  28. Some “real life example” • I got a sample (LHCb Monte Carlo) with simply a list of 21 possible variables • “played” for just 1 h • looked at all the variables (they get plotted “automatically”) •  filled them in as “log( variable )” where distribution were too “steep” • removed some variables that seem to have no obvious separataion Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  29. Some “real life example” • I got a sample (LHCb Monte Carlo) with simply a list of 21 possible variables • “played” for just 1 h • looked at all the variables (they get plotted “automatically”) •  filled them in as “log( variable )” where distribution were too “steep” • removed some variables that seem to have no obvious separataion • then removed variables with extremely large correlation Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  30. Tuned a little bit the pruning parameters Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  31. Had a result: • Unfortunatly… I don’t know how it compares to “his” current selection performance Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  32. TMVA Technicalities • TMVA releases: • sourceforge (SF) package to accommodate world-wide access • code can be downloaded as tar file, or via anonymous cvs access • home page: http://tmva.sourceforge.net/ • SF project page: http://sourceforge.net/projects/tmva • view CVS: http://cvs.sourceforge.net/viewcvs.py/tmva/TMVA/ • mailing lists: http://sourceforge.net/mail/?group_id=152074 • ROOT::TMVA class index….http://root.cern.ch/root/htmldoc/TMVA_Index.html • part of the ROOT package (since Development release 5.11/06) • TMVA features: • automatic iteration over all selected methods for training/testing etc. • each method has specific options that can be set by the user for optimisation • ROOT scripts are provided for all relevant performance analysis • We enthusiastically welcome new users, testers and developers  • now we have 20 people on developers list !! (http://sourceforge.net/project/memberlist.php?group_id=152074) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  33. Concluding Remarks • TMVA is still a young project! • first release on sourceforge March 8, 2006 • now also as part of the ROOT package  new devel. release: Nov. 22 • it is an open source project  new contributors are welcome  • TMVA provides the training and evaluation tools, but the decision which method is the best is certainly depends on the use case  train several methods in parallel and see what is best for YOUR analysis • Most methods can be improved over default by optimising the training options • Aimed for easy to use tool to give access to many different “complicated” selection algorithms • Already have a number of users, but still need more real-life experience • need to write a manual (use “How To…” on Web page and Examples in distribution) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  34. Outlook • We will continue to improve the selection Methods already implemented • implement event weights for all methods (missing: Cuts, Fisher, H-Matrix, RuleFit) • improve Kernel estimators • improve Decision Tree pruning (automatic, cross validation) • improve decorrelation techniques • New Methods are under development • RuleFit • Support Vector Machines • “Committee Method”  combination of different MVA techniques Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  35. Using TMVA Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  36. Curious? Want to have a look at TMVA Nothing easier than that: • Get ROOT version 5.12.00 (better wait for 22. Nov and get the new development version) • go into your working directory (need write access) • then start: root $ROOTSYS/tmva/test/TMVAnalysis.C (toy samples are provided, look at the results/plots and then how you can implement your stuff in the code) Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

  37. Two ways to download TMVA source code: • get tgz file by clicking on green button, or directly from here: http://sourceforge.net/project/showfiles.php?group_id=152074&package_id=168420&release_id=397782 • via anonymous cvs: cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/tmva co -P TMVA Web Presentation on Sourceforge.net http://tmva.sourceforge.net/ Particle and Astrophysics Seminar: TMVA Toolkit for MultiVariate Analysis

More Related