Status of tmva the t oolkit for m ulti v ariate a nalysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Status of TMVA the T oolkit for M ulti V ariate A nalysis PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on
  • Presentation posted in: General

Status of TMVA the T oolkit for M ulti V ariate A nalysis. Eckhard von Toerne (University of Bonn) For the TMVA core developer team : A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.T., H. Voss. Outline. Overview New developments Recent physics results that use TMVA

Download Presentation

Status of TMVA the T oolkit for M ulti V ariate A nalysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Status of tmva the t oolkit for m ulti v ariate a nalysis

Status of TMVA

the Toolkit for MultiVariateAnalysis

Eckhard von Toerne(University of Bonn)

For the TMVA core developerteam:

A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.T., H. Voss


Outline

Outline

  • Overview

  • New developments

  • Recentphysicsresultsthatuse TMVA

  • Web-Site: http://tmva.sourceforge.net/

  • See also: "TMVA - Toolkitfor Multivariate Data Analysis , A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E.v.Toerne, H. Voss et al., arXiv:physics/0703039v5 [physics.data-an]


What is tmva

What is TMVA

  • Supervisedlearning

  • Classificationand Regression tasks

  • Easy totrain, evaluateandcomparevarious MVA methods

  • Variouspreprocessingmethods (Decorr.,PCA,Gauss...)

  • Integrated in ROOT

MVA output


Tmva workflow

TMVA workflow

  • Training:

    • Classification:

      Learnthefeaturesofthe different

      eventclassesfrom a sample with

      knownsignal/backgroundcomposition

    • Regression:

      Learnthefunctionaldependence

      betweeninput variables andtargets

  • Testing:

    • Evaluatetheperformanceofthetrained

      classifier/regressor on an independent

      test sample

    • Compare different methods

  • Application:

    • Applytheclassifier/regressorto real data


Classification regression

x2

x2

x2

H1

H1

H1

H0

H0

H0

x1

x1

x1

Classification/Regression

Classificationofsignal/background – Howto find bestdecisionboundary?

Regression – Howtodeterminethecorrect model?


Status of tmva the t oolkit for m ulti v ariate a nalysis

If you have a training sample with only few events?

 Number of „parameters“ must be limited

 Use Linear classifier or FDA, small BDT, small MLP

Variables are uncorrelated (or only linear corrs)  likelihood

I just want something simple  use Cuts, LD, Fisher

Methods for complex problems  use BDT, MLP, SVM

Howtochoose a method?

List of acronyms:

BDT = boosted decision tree, see manualpage 103

ANN = articifical neuralnetwork

MLP = multi-layer perceptron, a specific form of ANN, also the name of our flagship ANN, manual p. 92

FDA = functional discriminant analysis, see manual p. 87

LD = linear discriminant , manual p. 85

SVM = support vector machine, manual p. 98 , SVM currently available only for classification

Cuts = like in “cut selection“, manual p. 56

Fisher = Ronald A. Fisher, classifier similar to LD, manual p. 83


Artificial neural networks

1 input layer

k hidden layers

1 ouput layer

...

1

1

1

2 output classes (signal and background)

. . .

. . .

. . .

Nvar discriminating input variables

i

j

Mk

. . .

. . .

N

M1

(“Activation” function)

with:

Artificial Neural Networks

  • Modellingofarbitrarynonlinearfunctionsas a nonlinearcombinationof simple „neuronactivationfunctions“

  • Advantages:

    • very flexible,

      noassumption

      aboutthefunction

      necessary

  • Disadvantages:

    • „black box“

    • needstuning

    • seeddependent

Feed-forward Multilayer Perceptron


Boosted decision trees

Boosted Decision Trees

  • Grow a forestofdecisiontreesanddeterminetheeventclass/targetbymajorityvote

  • Weightsofmisclassifiedeventsareincreased in thenextiteration

  • Advantages:

    • ignoresweak

      variables

    • works out of

      the box

  • Disadvantages:

    • vulnerable toovertraining


Status of tmva the t oolkit for m ulti v ariate a nalysis

No Single Best Classifier…

The properties of the Function discriminant (FDA) depend on the chosen function


Neyman pearson lemma

Neyman-Pearson Lemma

Type-1 error small

Type-2 error large

“limit” in ROC curve

given by likelihood ratio

1

Neyman-Pearson:

The Likelihood ratio used as “selection criterion” y(x) gives for each selection efficiency the best possible background rejection.

i.e. it maximizes the area under the “Receiver Operation Characteristics” (ROC) curve

better classification

1- ebackgr.

good classification

random guessing

Type-1 error large

Type-2 error small

0

esignal

0

1

  • Varying y(x)>“cut” moves working point (efficiency and purity) along ROC curve

  • How to choose “cut”?  need to know prior probabilities (S, B abundances)

    • Measurement of signal cross section: maximum of S/√(S+B) or equiv. √(e·p)

    • Discovery of a signal : maximum of S/√(B)

    • Precision measurement: high purity (p)

    • Trigger selection: high efficiency (e)


Status of tmva the t oolkit for m ulti v ariate a nalysis

Performance withtoydata

  • 3-dimensional distribution

    • Signal: sumofGaussians

    • Background=flat

  • TheoreticallimitcalculatedusingNeyman-Pearson Lemma

  • Neural net (MLP) with two hidden layers and backpropagation training. Bayesian option has little influence on high statistics training

  • TMVA-ANN convergestowardstheoreticallimitforsufficientNtrain(~100k)


Recent developments

Recent developments

  • Current version: TMVA version 4.1.2 in root release 5.30

  • Unit test framework for daily software and method performance validation (C. Rosemann, E.v.T.)

  • Multiclass classification for MLP, BDTG, FDA

  • BDT automatic parameter optimization for building the tree architecture

  • new method to treat data with distinct sub-populations (Method Category)

  • Optional Bayesian treatment of ANN weights in MLP with back-propagation (Jiahang Zhong)

  • Extended PDEFoam functionality (A. Voigt)

  • Variable transformations on a user-defined subset of variables


Unit test

C. Rosemann, E.v.T.

Unit test

  • automatedframeworktoverifyfunctionalityandperformance(oursbased on B. Eckel‘sdescription)

  • slimmedversionrunseverynight on various OS

***************************************************************

* TMVA - U N I T test : Summary *

***************************************************************

Test 0 : Event [107/107]....................................OK

Test 1 : VariableInfo [31/31]...............................OK

Test 2 : DataSetInfo [20/20]................................OK

Test 3 : DataSet [15/15]....................................OK

Test 4 : Factory [16/16]....................................OK

Test 7 : LDN_selVar_Gauss [4/4].............................OK

....

Test 107 : BoostedPDEFoam [4/4]..............................OK

Test 108 : BoostedDTPDEFoam [4/4]............................OK

Total number of failures: 0

***************************************************************


Status of tmva the t oolkit for m ulti v ariate a nalysis

Andnow: switchingfromstatisticstophysics

…acknowledgingthehardworkofourusers


Review of recent results

Review of recent results

BDT trained on individual mH samples with 10 variables. Expect 1.47 ev signal events at mH=160GeV

.. compared to 1.27ev with cut-based analysis (on 4 variables) and same bgd.

  • x

Phys.Lett.B699:25-47,2011.


Review of recent results1

Review of recent results

Super-Kamiokande Coll., “Kinematic reconstruction of atmospheric neutrino events in a large water Cherenkov detector with proton identification“ Phys.Rev.D79:112010,2009.

7 input variables MLP with one hidden layer

Signal

Background


Review of recent results2

Review of recent results

  • CDF Coll., “First Observation ofElectroweak Single Top Quark Production“, Phys.Rev.Lett.103:092002,2009.

  • BDT analysis with ~20 input variables

  • lepton + ET-mis+ jets

  • Results for s+t-channel


Review of recent results using tmva

Review of recent results using TMVA

  • CDF+D0 combinedhiggsworkinggroup, hep-ex/1107.5518. (SVM)

  • CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT)

  • IceCubeColl., astro-ph/1101.1692. (MLP)

  • D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT)

  • IceCubeColl., Phys.Rev.D83:012001,2011. (BDT)

  • IceCubeColl., Phys.Rev.D82:112003,2010. (BDT)

  • D0 Coll., Higgssearch, Phys.Rev.Lett.105:251801,2010. (BDT)

  • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT)

  • D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT)

  • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood)

  • CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT)

  • Super-KamiokandeColl., Phys.Rev.D79:112010,2009. (MLP)

  • BABAR Coll., Phys.Rev.D79:051101,2009. (BDT)

  • + otherpapers

  • + several ATLAS paperswith TMVA abouttocome out…

  • + many ATLAS resultsabouttobepublished…


Review of recent results using tmva1

Review of recent results using TMVA

  • CDF+D0 combinedhiggsworkinggroup, hep-ex/1107.5518. (SVM)

  • CMS Coll., H-->WW search, Phys.Lett.B699:25-47,2011. (BDT)

  • IceCubeColl., astro-ph/1101.1692. (MLP)

  • D0 Coll., top-pairs, Phys.Rev.D84:012008,2011. (BDT)

  • IceCubeColl., Phys.Rev.D83:012001,2011. (BDT)

  • IceCubeColl., Phys.Rev.D82:112003,2010. (BDT)

  • D0 Coll., Higgssearch, Phys.Rev.Lett.105:251801,2010. (BDT)

  • CDF Coll., single top, Phys.Rev.D82:112005,2010. (BDT)

  • D0 Coll., single top, Phys.Lett.B690:5-14,2010. (BDT)

  • D0 Coll., top pairs, Phys.Rev.D82:032002,2010. (Likelihood)

  • CDF Coll., single top obs., Phys.Rev.Lett.103:092002,2009. (BDT)

  • Super-KamiokandeColl., Phys.Rev.D79:112010,2009. (MLP)

  • BABAR Coll., Phys.Rev.D79:051101,2009. (BDT)

  • + otherpapers

  • + several ATLAS paperswith TMVA abouttocome out…

  • + many ATLAS resultsabouttobepublished…

Thankyouforusing TMVA !


Summary

Summary

  • TMVA versatile packageforclassificationandregressiontasks

  • Integrated into ROOT

  • Easy totrainclassifiers/regressionmethods

  • A multitudeofphysicsresultsbased on TMVA arecoming out

  • Thankyouforyourattention!


Credits

Credits

  • TMVA is open source software

  • Use & redistribution of source permitted according to terms in BSD license

  • Several similar data mining efforts with rising importance in most fields of science and industry

Contributed to TMVA have: Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Moritz Backes (Geneva University, Switzerland), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Christoph Rosemann (DESY), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland), Jiahang Zhong (Academica Sinica, Taipeh).


Status of tmva the t oolkit for m ulti v ariate a nalysis

Spare Slides


Status of tmva the t oolkit for m ulti v ariate a nalysis

A complete TMVA training/testing session

voidTMVAnalysis( )

{

TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" );

TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V");

TFile *input = TFile::Open("tmva_example.root");

factory->AddVariable("var1+var2", 'F');

factory->AddVariable("var1-var2", 'F'); //factory->AddTarget("tarval", 'F');

factory->AddSignalTree ( (TTree*)input->Get("TreeS"), 1.0 );

factory->AddBackgroundTree ( (TTree*)input->Get("TreeB"), 1.0 );

//factory->AddRegressionTree ( (TTree*)input->Get("regTree"), 1.0 );

factory->PrepareTrainingAndTestTree( "", "",

"nTrain_Signal=200:nTrain_Background=200:nTest_Signal=200:nTest_Background=200:NormMode=None" );

factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood",

"!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" );

factory->BookMethod( TMVA::Types::kMLP, "MLP",

"!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" );

factory->TrainAllMethods(); // factory->TrainAllMethodsForRegression();

factory->TestAllMethods();

factory->EvaluateAllMethods();

outputFile->Close();

deletefactory;

}

Create Factory

Add variables/

targets

Initialize Trees

Book MVA methods

Train, test

and evaluate


What is a multi variate analysis

What is a multi-variate analysis?

“Combine“ all input variables into one output variable

Supervised learning means learning by example: the program extracts patterns from training data

Input

Variables

Classifier

Output


Status of tmva the t oolkit for m ulti v ariate a nalysis

Metaclassifiers – Category Classifier and Boosting

  • The categoryclassifieriscustom-madefor HEP

    • Use different classifiersfor different phasespaceregionsandcombinetheminto a singleoutput

  • TMVA supportsboostingforall classifiers

    • Use a collectionof “weaklearners“ toimprovetheirperformace

      (boosted Fisher, boostedneuralnetswithfewneuronseach…)


  • Login