1 / 25

Status of TMVA the T oolkit for M ulti V ariate A nalysis

Status of TMVA the T oolkit for M ulti V ariate A nalysis. Eckhard von Toerne (University of Bonn) For the TMVA core developer team : A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.T., H. Voss. Outline. Overview New developments Recent physics results that use TMVA

olathe
Download Presentation

Status of TMVA the T oolkit for M ulti V ariate A nalysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Status of TMVA the Toolkit for MultiVariateAnalysis Eckhard von Toerne(University of Bonn) For the TMVA core developerteam: A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.T., H. Voss

  2. Outline • Overview • New developments • Recentphysicsresultsthatuse TMVA • Web-Site: http://tmva.sourceforge.net/ • See also: "TMVA - Toolkitfor Multivariate Data Analysis , A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.Toerne, H. Voss et al., arXiv:physics/0703039v5 [physics.data-an]

  3. What is TMVA • Supervisedlearning • Classificationand Regression tasks • Easy totrain, evaluateandcomparevarious MVA methods • Variouspreprocessingmethods (Decorr.,PCA,Gauss...) • Integrated in ROOT MVA output

  4. TMVA workflow • Training: • Classification: Learnthefeaturesofthe different eventclassesfrom a sample with knownsignal/backgroundcomposition • Regression: Learnthefunctionaldependence betweeninput variables andtargets • Testing: • Evaluatetheperformanceofthetrained classifier/regressor on an independent test sample • Compare different methods • Application: • Applytheclassifier/regressorto real data

  5. x2 x2 x2 H1 H1 H1 H0 H0 H0 x1 x1 x1 Classification/Regression Classificationofsignal/background – Howto find bestdecisionboundary? Regression – Howtodeterminethecorrect model?

  6. If you have a training sample with only few events?  Number of „parameters“ must be limited  Use Linear classifier or FDA, small BDT, small MLP Variables are uncorrelated (or only linear corrs)  likelihood I just want something simple  use Cuts, LD, Fisher Methods for complex problems  use BDT, MLP, SVM Howtochoose a method? List of acronyms: BDT = boosted decision tree, see manualpage 103 ANN = articifical neuralnetwork MLP = multi-layer perceptron, a specific form of ANN, also the name of our flagship ANN, manual p. 92 FDA = functional discriminant analysis, see manual p. 87 LD = linear discriminant , manual p. 85 SVM = support vector machine, manual p. 98 , SVM currently available only for classification Cuts = like in “cut selection“, manual p. 56 Fisher = Ronald A. Fisher, classifier similar to LD, manual p. 83

  7. 1 input layer k hidden layers 1 ouput layer ... 1 1 1 2 output classes (signal and background) . . . . . . . . . Nvar discriminating input variables i j Mk . . . . . . N M1 (“Activation” function) with: Artificial Neural Networks • Modellingofarbitrarynonlinearfunctionsas a nonlinearcombinationof simple „neuronactivationfunctions“ • Advantages: • very flexible, noassumption aboutthefunction necessary • Disadvantages: • „black box“ • needstuning • seeddependent Feed-forward Multilayer Perceptron

  8. Boosted Decision Trees • Grow a forestofdecisiontreesanddeterminetheeventclass/targetbymajorityvote • Weightsofmisclassifiedeventsareincreased in thenextiteration • Advantages: • ignoresweak variables • works out of the box • Disadvantages: • vulnerable toovertraining

  9. No Single Best Classifier… The properties of the Function discriminant (FDA) depend on the chosen function

  10. Neyman-Pearson Lemma Type-1 error small Type-2 error large “limit” in ROC curve given by likelihood ratio 1 Neyman-Pearson: The Likelihood ratio used as “selection criterion” y(x) gives for each selection efficiency the best possible background rejection. i.e. it maximizes the area under the “Receiver Operation Characteristics” (ROC) curve better classification 1- ebackgr. good classification random guessing Type-1 error large Type-2 error small 0 esignal 0 1 • Varying y(x)>“cut” moves working point (efficiency and purity) along ROC curve • How to choose “cut”?  need to know prior probabilities (S, B abundances) • Measurement of signal cross section: maximum of S/√(S+B) or equiv. √(e·p) • Discovery of a signal : maximum of S/√(B) • Precision measurement: high purity (p) • Trigger selection: high efficiency (e)

  11. Performance withtoydata • 3-dimensional distribution • Signal: sumofGaussians • Background=flat • TheoreticallimitcalculatedusingNeyman-Pearson Lemma • Neural net (MLP) with two hidden layers and backpropagation training. Bayesian option has little influence on high statistics training • TMVA-ANN convergestowardstheoreticallimitforsufficientNtrain(~100k)

  12. Recent developments • Current version: TMVA version 4.1.2 in root release 5.30 • Unit test framework for daily software and method performance validation (C. Rosemann, E.v.T.) • Multiclass classification for MLP, BDTG, FDA • BDT automatic parameter optimization for building the tree architecture • new method to treat data with distinct sub-populations (Method Category) • Optional Bayesian treatment of ANN weights in MLP with back-propagation (Jiahang Zhong) • Extended PDEFoam functionality (A. Voigt) • Variable transformations on a user-defined subset of variables

  13. C. Rosemann, E.v.T. Unit test • automatedframeworktoverifyfunctionalityandperformance(oursbased on B. Eckel‘sdescription) • slimmedversionrunseverynight on various OS *************************************************************** * TMVA - U N I T test : Summary * *************************************************************** Test 0 : Event [107/107]....................................OK Test 1 : VariableInfo [31/31]...............................OK Test 2 : DataSetInfo [20/20]................................OK Test 3 : DataSet [15/15]....................................OK Test 4 : Factory [16/16]....................................OK Test 7 : LDN_selVar_Gauss [4/4].............................OK .... Test 107 : BoostedPDEFoam [4/4]..............................OK Test 108 : BoostedDTPDEFoam [4/4]............................OK Total number of failures: 0 ***************************************************************

  14. Andnow: switchingfromstatisticstophysics …acknowledgingthehardworkofourusers

  15. Review of recent results BDT trained on individual mH samples with 10 variables. Expect 1.47 ev signal events at mH=160GeV .. compared to 1.27ev with cut-based analysis (on 4 variables) and same bgd. • x Phys.Lett.B699:25-47,2011.

  16. Review of recent results Super-Kamiokande Coll., “Kinematic reconstruction of atmospheric neutrino events in a large water Cherenkov detector with proton identification“ Phys.Rev.D79:112010,2009. 7 input variables MLP with one hidden layer Signal Background

  17. Review of recent results • CDF Coll., “First Observation ofElectroweak Single Top Quark Production“, Phys.Rev.Lett.103:092002,2009. • BDT analysis with ~20 input variables • lepton + ET-mis+ jets • Results for s+t-channel

  18. Review of recent results using TMVA • CDF+D0 combinedhiggsworkinggroup, hep-ex/1107.5518. (SVM) • CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT) • IceCubeColl., astro-ph/1101.1692. (MLP) • D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT) • IceCubeColl., Phys.Rev.D83:012001,2011. (BDT) • IceCubeColl., Phys.Rev.D82:112003,2010. (BDT) • D0 Coll., Higgssearch, Phys.Rev.Lett.105:251801,2010. (BDT) • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT) • D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT) • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood) • CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT) • Super-KamiokandeColl., Phys.Rev.D79:112010,2009. (MLP) • BABAR Coll., Phys.Rev.D79:051101,2009. (BDT) • + otherpapers • + several ATLAS paperswith TMVA abouttocome out… • + many ATLAS resultsabouttobepublished…

  19. Review of recent results using TMVA • CDF+D0 combinedhiggsworkinggroup, hep-ex/1107.5518. (SVM) • CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT) • IceCubeColl., astro-ph/1101.1692. (MLP) • D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT) • IceCubeColl., Phys.Rev.D83:012001,2011. (BDT) • IceCubeColl., Phys.Rev.D82:112003,2010. (BDT) • D0 Coll., Higgssearch, Phys.Rev.Lett.105:251801,2010. (BDT) • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT) • D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT) • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood) • CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT) • Super-KamiokandeColl., Phys.Rev.D79:112010,2009. (MLP) • BABAR Coll., Phys.Rev.D79:051101,2009. (BDT) • + otherpapers • + several ATLAS paperswith TMVA abouttocome out… • + many ATLAS resultsabouttobepublished… Thankyouforusing TMVA !

  20. Summary • TMVA versatile packageforclassificationandregressiontasks • Integrated into ROOT • Easy totrainclassifiers/regressionmethods • A multitudeofphysicsresultsbased on TMVA arecoming out • Thankyouforyourattention!

  21. Credits • TMVA is open source software • Use & redistribution of source permitted according to terms in BSD license • Several similar data mining efforts with rising importance in most fields of science and industry Contributed to TMVA have: Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Moritz Backes (Geneva University, Switzerland), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Christoph Rosemann (DESY), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland), Jiahang Zhong (Academica Sinica, Taipeh).

  22. Spare Slides

  23. A complete TMVA training/testing session voidTMVAnalysis( ) { TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" ); TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V"); TFile *input = TFile::Open("tmva_example.root"); factory->AddVariable("var1+var2", 'F'); factory->AddVariable("var1-var2", 'F'); //factory->AddTarget("tarval", 'F'); factory->AddSignalTree ( (TTree*)input->Get("TreeS"), 1.0 ); factory->AddBackgroundTree ( (TTree*)input->Get("TreeB"), 1.0 ); //factory->AddRegressionTree ( (TTree*)input->Get("regTree"), 1.0 ); factory->PrepareTrainingAndTestTree( "", "", "nTrain_Signal=200:nTrain_Background=200:nTest_Signal=200:nTest_Background=200:NormMode=None" ); factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood", "!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" ); factory->BookMethod( TMVA::Types::kMLP, "MLP", "!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" ); factory->TrainAllMethods(); // factory->TrainAllMethodsForRegression(); factory->TestAllMethods(); factory->EvaluateAllMethods(); outputFile->Close(); deletefactory; } Create Factory Add variables/ targets Initialize Trees Book MVA methods Train, test and evaluate

  24. What is a multi-variate analysis? “Combine“ all input variables into one output variable Supervised learning means learning by example: the program extracts patterns from training data Input Variables Classifier Output

  25. Metaclassifiers – Category Classifier and Boosting • The categoryclassifieriscustom-madefor HEP • Use different classifiersfor different phasespaceregionsandcombinetheminto a singleoutput • TMVA supportsboostingforall classifiers • Use a collectionof “weaklearners“ toimprovetheirperformace (boosted Fisher, boostedneuralnetswithfewneuronseach…)

More Related