1 / 40

Outline

RooStatsCms: a tool for analyses modelling, combination and statistical studies D. Piparo, G. Schott, G. Quast Institut f ür Experimentelle Kernphysik Universität Karlsruhe. Outline. The need for a tool RooStatsCms (RSC) A RooFit interlude The three parts Modelling The datacard

effie
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RooStatsCms: a tool for analyses modelling, combination and statistical studiesD. Piparo, G. Schott, G. QuastInstitut für Experimentelle KernphysikUniversität Karlsruhe

  2. Outline • The need for a tool • RooStatsCms (RSC) • A RooFit interlude • The three parts • Modelling • The datacard • Inspect your model • Statistical studies and limits • Profile Likelihood • Hypothesis separation and “modified frequentist approach” • Exclusion • Plotting classes 19.11.08

  3. The need for a tool • No prexisting structured statistic software framework in CMS: G. Quast, G. Schott and DP developed RooStatsCms NEEDS: • Reliable implementation of multiple statistical methods • Combine analyses: • Stronger limits on quantities like Higgs production cross section, mass ... • Do not replace existing analyses but complement their results • Easy user interface • Satisfactory documentation (no black boxes) • Examples and tutorials 19.11.08 3

  4. RooStatsCms • Originally thought for the CMS Higgs Working Group and a CMS (EKP) exclusive product • Based on RooFit (Part of the ROOT distribution) • Three parts: • Modelling and combination • Statistical methods • Advanced graphic routines • It comes with CINT dictionaries (macros, interactive root). • Available to CMS and EKP at: www-ekp.physik.uni-karlsruhe.de/~RooStatsCms • Visit our wikifor username and password • Statistical methods and graphic routines public: www-ekp.physik.uni-karlsruhe.de/~RooStatsKarlsruhe • Big effort for documentation: • RSC website and Doxygen of every class, method and member • Wikipages with links to RSC presentations (~15) and workshop • https://twiki.cern.ch/twiki/bin/view/CMS/HiggsWGRooStatsCms • http://www-ekp.physik.uni-karlsruhe.de/~twiki/bin/view/EkpCms/RooStatsCms 3. An internal CMS note in preparation 19.11.08 4

  5. RooStatsCms - structure 1/2 • Class design-wise structure • Already 33 classes! • All of them inherit from TObject: persistency and reflexion • Moreover: • Programs to compile • Macros for the interpreter • Various utilities in the Rsc namespace (TH1F median,..) 19.11.08 5

  6. RooStatsCms - structure 2/2 Directory-wise structure • Structure “À la CMSSW”: ready to compile in the CMS framework with a newer RooFit 19.11.08 6

  7. RooFit interlude: ouverture • Toolkit for data modeling • Model distribution of observable x in terms of • parameter of interest p • other parameters q to describe detector effects (resolution ,efficiency) • Probability density function (pdf) F (x;p,q) • normalized over range of observable x w.r.t. the parameters p and q • RooFit provides the functionality for • building these probability density functions • scalable to complex models • maximum likelihood fitting (binned and unbinned) • visualization of the pdf • toy MC generator 19.11.08 7

  8. RooFit interlude: functionality • Package developed, originally for BaBar analysis (by W. Verkerke and D. Kirkby) • actively maintained by W. Verkerke in view of LHC analysis • Web site: http://roofit.sourceforge.net • Much material shown taken from Wouter’s presentations • see 200 slides presented at French statistics school (http://sos.in2p3.fr) • Users Manual in the ROOT site:ftp://root.cern.ch/root/doc/RooFit_Users_Manual_2.91-33.pdf 19.11.08 8

  9. RooFit interlude: design • Mathematical entities are represented as C++ objects 19.11.08 9

  10. RooFit interlude : an example • Gaussian Pdf • MC data generation • Maximum likelihood fit on data 19.11.08 10

  11. RSC: A solid tool • RSC is in “production phase”: • Around since the beginning of the year 2008 • Workshop at CERN in June • Approved results: http://cms-physics.web.cern.ch/cms-physics/public/HIG-08-008-pas.pdf • Coming soon results: HIG-008-06 HWW • CMS statistics committee blessed the tool (internal note in preparation) • Grégory in permanent contact with them • Interest of other working groups • Negotiations for integration in CMS Software framework (CMSSW) • Base of a common tool with Atlas • Work in progress: firsts commits in ROOT are taking place • New manpower: Mario Pelliccioni (former BaBar) from Universita’ di Torino • Made in EKP (Quast, Schott, Piparo): • Personal assistance at 8th floor! 19.11.08 11

  12. RSC: Is it hard to try? Straightforward to get started on ekpcms3: wget -O RooStatsCms.tar.gz http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/ CMSSW/HiggsAnalysis/RooStatsCms.tar.gz?view=tar\&pathrev=V00-04-00 tar -zxf RooStatsCms.tar.gz cd RooStatsCms source /home/piparo/set_root_RSC_environment.sh source scripts/RSCenv.sh make make exe cd macros/examples/ root profilelikelihood_htt.cxx root qqhtt_-2lnQ_distributions.cxx See also: www-ekp.physik.uni-karlsruhe.de/~RooStatsCms for detailed instructions 19.11.08 12

  13. RSC in one slide A priori, I frequently believe I am in between ... Statisticians ... RooStatsCms tries to put you somehow “in between”... 19.11.08 13

  14. The Three Parts • Analyses modeling and combination • Statistical Methods and limits • Graphics routines 19.11.08 14

  15. Analyses modeling and combination • Modeling based on the datacard concept • Build a complete combined analysis model from ASCII datacards • Background and signal components of each analysis • Shapes from parametrisation or histos • Constraints and their correlations • Basic syntax: include, if ... • Two lines of C++ to produce the RooFit Pdf • Datacard advantages: • Automatic bookkeeping of what is done • Factorise model from C++ code • Easy to share ASCII Card 2 analyses RscCombinedModel mymodel ("hzz4l"); RooAbsPdf* sb_pdf=mymodel.getPdf(); 19.11.08 15

  16. RSC – Modelling 2/2 • Yields can be expressed as products of different terms. For example: • Branching Ratios • Efficiencies • Cross section • Luminosity • Each term: systematics can be included • Relate terms from one analysis to the other with correlations Yield = BR · ε· σH · Lumi 19.11.08 16

  17. An example datacard: counting ################################# # The combined model ################################# // Here we specify the names of the models // built down in the card that we want // to be combined include HZZ_4mu.rsc include HZZ_4e.rsc include HZZ_2mu2e.rsc [hzz4l] model = combined components = hzz_4mu, hzz_4e, hzz_2mu2e ################################# # H -> ZZ -> 4mu ################################# [hzz_4mu] variables = x x = 0 L(0 - 1) [hzz_4mu_sig] hzz_4mu_sig_yield = 62.78 L(0 - 200) [hzz_4mu_sig_x] model = yieldonly [hzz_4mu_bkg] yield_factors_number = 2 yield_factor_1 = scale scale = 1 L (0 - 3) scale_constraint = Gaussian,1,0.041 yield_factor_2 = bkg_4mu bkg_4mu = 19.93 C [hzz_4mu_bkg_x] model = yieldonly The variable Comment Basic syntax Signal component description: - Yield - Model The combined model Background component description: yield made of different terms. Constraints syntax: <type>,par1,par2 See RscBaseModel and RscCombinedModel documentation for a complete description 19.11.08 17

  18. An example datacard: shapes [hgg_cat0] variables = mh mh = 115 L(90 - 180) // [GeV/c^{2}] [hgg_cat0_sig] yield_factors_number = 3 yield_factor_1 = lumi lumi = 1 C yield_factor_2 = n_events_hgg_115_cat0_sig n_events_hgg_cat0_sig = 3.9577 yield_factor_3 = scale_sig scale_sig = 1 L (0 - 5) [hgg_cat0_sig_mh] model = fourGaussians hgg_115_cat0_sig_mh_mean1 = 114.654 +/- 0.107106 C hgg_115_cat0_sig_mh_mean2 = 115.146 +/- 2.37687 C hgg_115_cat0_sig_mh_mean3 = 114.12 +/- 0.581539 C hgg_115_cat0_sig_mh_mean4 = 109.979 +/- 11.036 C hgg_115_cat0_sig_mh_sigma1 = 0.6075 +/- 0.0888951 C hgg_115_cat0_sig_mh_sigma2 = 0.601995 +/- 129.141 C hgg_115_cat0_sig_mh_sigma3 = 2.1119 +/- 0.526549 C hgg_115_cat0_sig_mh_sigma4 = 8.16619 +/- 7.75118 C hgg_115_cat0_sig_mh_frac1 = 0.999893 +/- 0.500053 C hgg_115_cat0_sig_mh_frac2 = 0.762761 +/- 0.0870296 C hgg_115_cat0_sig_mh_frac3 = 0.98815 +/- 0.0207781 C [hgg_cat0_bkg] number_components = 2 yield_factors_number = 3 yield_factor_1 = lumi lumi = 1 C yield_factor_2 = n_events_hgg_115_cat0_bkg n_events_hgg_cat0_bkg = 988.389 yield_factor_3 = scale_bkg scale_bkg = 1 L (0 - 5) [hgg_cat0_bkg1] qqhtt_bkg1_yield = 1 C [hgg_cat0_bkg2] qqhtt_bkg2_yield = 1.35 C [hgg_cat0_bkg1_mh] model = doubleGaussian hgg_cat0_bkg_mh_mean1 = 52.3484 +/- 14.1593 C hgg_cat0_bkg_mh_mean2 = 158.962 +/- 3.21153 C hgg_cat0_bkg_mh_sigma1 = 27.1791 +/- 2.37455 C hgg_cat0_bkg_mh_sigma2 = 74.9328 +/- 70.6298 C hgg_cat0_bkg_mh_frac = 0.924937 +/- 0.0347411 C [hgg_cat0_bkg2_mh] model = histo hgg_cat0_bkg2_mh _fileName = htt_inputs.root hgg_cat0_bkg2_mh name = background Multiple components Histogram and parametric models mixed Comment // The combined model of HZZ and Hgg include hzz_combined.rsc Include hgg_12_categories.rsc [hgg_hzz_combined] model = combined components = hzz, hgg_cat0, hgg_cat1,..., hgg_cat11 • Combination of combined models • Counting combined with shape analyses 19.11.08 18

  19. A combination • Combination of CMS H→gg, H →ZZ (3 modes) 30 fb-1 • Perform a simutaneous analysis of Higgs channels: • for each analysis: each data sample is fitted simultaneously with it is own signal and background model • combination of number counting and distribution based analyses • Significance: sqrt(2lnQ) • Various analyses • Comparison between PTDR and RSC 19.11.08 19

  20. More on constraints • “Same name, same pointer” principle (100% correlation) • Same name in the card → Same object in the model • Common Luminosity, cross-sections • Partial correlation among Gaussian constraints: constraints block [combined_120_constraints_block_1] correlation_variable1 = hww_mm_120_bkg_yield correlation_variable2 = hww_ee_120_bkg_yield correlation_variable3 = hww_em_120_bkg_yield correlation_value1 = 0.80 C correlation_value2 = 0.72 C correlation_value3 = 0.15 C [combined_120_constraints_block_2] ............ Correlated Variables Correlation Coefficients As many blocks as needed! 19.11.08 20

  21. Statistical Methods RscMultiModel RscCompModel RscTotModel RscTotModel RscTotModel RscBaseModel RscCombinedModel The full analysis Basic distributions The full analysis Model for each discriminating variable The full analysis Different components for signal(s) and background(s) Analysis combination Combination Analysis 1 Analysis 1 Analysis 1 Signal Bkg1 Bkg2 Bkg3 Variable 2 Variable 1 Histo Gauss Poly My model Analyses model structure 19.11.08 21

  22. Inspect your model Two programs to use: • Model Diagram:creates a simple graph of the combined model • model_diagram.exe <cardname> <modelname> • Model Html: creates a website to browse your combined model • model_html.exe <cardname> <modelname> 19.11.08 22

  23. The Three Parts • Analyses modeling and combination • Statistical Methods and limits • Graphics routines 19.11.08 23

  24. Profile Likelihood - 1/2 19.11.08 24

  25. Profile Likelihood – 2/2 • Intersection with horizontal lines gives upper limits / two sided intervals • W.J. Metzger “Statistical Methods in Data Analysis”, Katholieke Universiteit Nijmegen, 2002. • Systematics taken into account with penalty terms in the Likelihoods (profiling) Likelihood scan: l maximised for each point Horizontal cuts Interpolated scan minimum θ0 at minimum: 7.16+8.1-5.37 • Minuit uses the technique to obtain the fitted parameters errors • Significance estimator: S=sqrt(2ln(Lsb/Lb)) → if θ0 is N signal, the scan value at 0 is directly related to S ! See PLCalcuator, PLResults, PLPlot documentation 19.11.08 25

  26. Systematics - 1/2 19.11.08 26

  27. Systematics - 2/2 19.11.08 27

  28. A PL prototype study • A prototype study: distribution of upper limits using PL and a coverage study • Many pseudo experiments performed for each mass hypothesis • Distribution of upper limits obtained • Coverage: fraction of experiments in which the upper limit is indeed greater than the parameter nominal value • Easy to do: store PLResults objects in a TTree and loop on it. Overcoverage for low yields: • Well known feature of the method (Cramér-Fréchet Bound) • “Calibrate” the Likelihood 19.11.08 28

  29. Separation of Hypotheses • Analysis of search results can be formulated as separation of hypotheses: • Identify observable which comprises the result • Specify a test statistic • Define rules for discovery and exclusion • Use the likelihoods ratio, Q=Lsb/Lb,assuming signal+background (“s+b”) and the background-only “b” hypotheses, as test statistic. • Consider “P-values” (also called CLS+B, 1-CLB) of -2lnQ distributions obtained from s+b and b samples • See: • progs/m2lnq_creator.cpp • qqhtt_-2lnQ_distributions.cxx in macros/examples/ Bayesian pseudo-integration of systematics: For every toy MC experiment, before the generation of the toy dataset, parameters affected by systematics are properly fluctuated once. 1-CLb CLsb Distributions built with toy MC experiments (LimitCalculator-HybridCalculator Class) 19.11.08 29

  30. Modified frequentist method – Significance • CLB : background CL, measure of the compatibility of the experiment with the B-only hypothesis • 1 – CLB : probability for a B-only experiment to give a more S+B-like likelihood ratio than the observed one • Correspondence between CLB and the resulting significance (Gaussian approximation): • # of standard deviations of an (assumed) Gaussian distribution of the background. • Take CLB assuming the expected s+b yield (i.e. median -2lnQ for s+b distribution) • CLS+B : measure of the compatibility of the experiment with the S+B hypothesis • if CL is small ( < 5% ) the S+B hypothesis can be excluded at more than 95% CL but it does not mean that the signal hypothesis is excluded at that level Modified frequentist approach: take CLS the signal significance, to be: CLS≡ CLS+B / CLB (heavily used by LEP, HERA and TEVATRON experiments) 19.11.08 30

  31. The benchmark analysis: H→ Used as benchmark for the tool Results approved by the CMS collaboration Vector boson fusion H→ @1 fb-1 Small signal on a significant background No discovery expected with this lumi Four mass hypotheses: 115,125,135,145 GeV 19.11.08 31

  32. H→: Significance • Significance calculated for the H→ analysis using CLb • In this case significance does not tell us much. • The question becomes: • “Which production cross section can I exclude with the data I have?” 19.11.08 CMS Week 32

  33. Modified Frequentist method – Exclusion Less exclusion power than expected Obtained with real data More exclusion power than expected • Assume to observe the expected background (i.e. median of the background distribution) and no signal • Amplify the SM production cross section by a factor necessary to obtain CLs=0.05 • → “95% exclusion” ~ 80 h on one CPU ExclusionBandPlot Class • Bands: • Assume to observe Nb + n · sqrt (Nb), where n=2,1,-1,-2 for the -2,-1,1,2 sigma band border respectively • Systematics taken into account in distributions of -2lnQ (marginalisation) 19.11.08 CMS Week 33

  34. How do I find the right ratio? • RSC provides help: • RatioFinder • RatioFinderResults • RatioFinderPlot • Just compile and launch the job(s)! CLs = 0.05 19.11.08 CMS Week 34

  35. Another representation of the information • Use the distributions of the test statistic. • At glance see how the hypotheses are separated. • For each mH projection of -2lnQ distribution in B only hypothesis. 19.11.08 CMS Week 35

  36. Statistical Methods: class structures Constraints Mother: NLLPenalty • Organisation of the classes of statistical methods: LimitCalculator Statistical Methods – Mother: StatisticalMethod Constraint ConstrBlock2 ConstrBlock3 ConstrBlockArray LimitCalculator PLScan FCCalculator Aka HybridCalculator “Sum” the results: batch/GRID jobs submission easier Statistical Methods Results – Mother: StatisticalResult LimitResults PLScanResults FCResults Aka HybridResults Statistical Plot – Mother: StatisticalPlot LimitPlot PLScanPlot (add also FC curves) + LEPBandPlot Aka HybridPlot ExclusionBandPlot 19.11.08

  37. The Three Parts • Analyses modeling and combination • Statistical Methods and limits • Graphics routines 19.11.08 37

  38. Plots collection 19.11.08 38

  39. Troubleshooting Q: I want to start now. Where do I find the examples? A: In the macros dir you find the macros for the interpreter while in the progs directory the programs to compile with the make exe command. Q: I think I do not know how to write a datacard. How can I do? A: In the macros directory you find some datacards to find the inspiration. Moreover check the scripts in the scripts directory. You have the create_card_skeleton.py to query for templated card components and TDR_HZZ_card_maker.py, to create the CMS PTDR H→ZZ→4l cards. Q: I compiled RSC but ROOT does not see the dynamic library libRooStatsCms.so. What do I do? A: Add to your LD_LIBRARY_PATH environmental variable the /RooStatsCms/lib dir. In the script directory you have the RSCenv.sh script to set up your environment. Then in the interpreter use the command gSystem->Load(“libRooStatsCms.so”). “Q”: Still.. I cannot get it work! A: Come down to the eight floor for support! 19.11.08 39

  40. Conclusions • Intuitive “model factory” • Build the analysis model from an ASCII configuration file, the datacard • Datacard also describes nuisance parameters (and correlations) • Building of a combined model for a combined analysis • Implementation of nuisance parameters and correlations • Can be marginalised or profiled • Statistical methods • LimitCalculator (CLb,CLsb,CLs) Complete* • PLScan (Profile Likelihood) Complete* • FCCalculator (fully frequentist approach) Validation to complete • Bayesian approach and Markov chains Being investigated * Strong implementation, tested and used by CMS analyses • Batch friendly: decomposition in sub-jobs; results stored in ROOT files • Results can be merged and exploited by results classes • Plots in a “presentation ready” form easily obtainable 19.11.08 40

More Related