Dc2 c decision trees
This presentation is the property of its rightful owner.
Sponsored Links
1 / 13

DC2 C++ decision trees PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on
  • Presentation posted in: General

DC2 C++ decision trees. Quick review of classification (or decision) trees Training and testing How Bill does it with Insightful Miner Applications to the “good-energy” trees: how does it compare?. Toby Burnett Frank Golf. predicate. predicate. purity. Quick Review of Decision Trees.

Download Presentation

DC2 C++ decision trees

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Dc2 c decision trees

DC2 C++ decision trees

  • Quick review of classification (or decision) trees

  • Training and testing

  • How Bill does it with Insightful Miner

  • Applications to the “good-energy” trees: how does it compare?

Toby Burnett

Frank Golf

DC2 at GSFC - 28 Jun 05 - T. Burnett


Quick review of decision trees

predicate

predicate

purity

Quick Review of Decision Trees

  • Introduced to GLAST by Bill Atwood, using InsightfulMiner

  • Each branch node is a predicate, or cut on a variable, likeCalCsIRLn > 4.222

  • If true, this defines the right branch, otherwise the left branch.

  • If there is no branch, the node is a leaf; a leaf contains the purity of the sample that reaches that point

  • Thus the tree defines a function of the event variables used, returning a value for the purity from the training sample

true: right

false: left

DC2 at GSFC - 28 Jun 05 - T. Burnett


Training and testing procedure

Training and testing procedure

  • Analyze a training sample containing a mixture of “good” and “bad” events: I use the even events in order to have an independent set for testing

  • Choose set of variables and find the optimal cut for such that the left and right subsets are purer than the orignal. Two standard criteria for this: “Gini” and entropy. I currently use the former.

    • WS : sum of signal weights

    • WB : sum of background weights

      Gini = 2 WS WB/(WS +WB)

      Thus Gini wants to be small.

      Actually maximize the improvement:Gini(parent)-Gini(left child)-Gini(right child)

  • Apply this recursively until too few events. (100 for now)

  • Finally test with the odd events: measure purity for each node

DC2 at GSFC - 28 Jun 05 - T. Burnett


Evaluate by comparing with im results

Evaluate by Comparing with IM results

From Bill’s Rome ’03 talk:

The “good cal” analysis

CAL-Low CT Probabilities

CAL-High CT Probabilities

All

Good

Good

All

Good

Bad

Bad

DC2 at GSFC - 28 Jun 05 - T. Burnett


Compare with bill cont

Compare with Bill, cont

  • Since Rome:

    • Three energy ranges, three trees.

      • CalEnergySum: 5-350; 350-3500; >3500

    • Resulting decision trees implemented in Gleam by a “classification” package: results saved to the IMgoodCal tuple variable.

  • Development until now for training, comparison: the all_gamma from v4r2

    • 760 K events E from 16 MeV to 160 GeV (uniform in log(E), and 0<  < 90 (uniform in cos() ).

    • Contains IMgoodCal for comparison

  • Now:

DC2 at GSFC - 28 Jun 05 - T. Burnett


Bill type plots for all gamma

Bill-type plots for all_gamma

DC2 at GSFC - 28 Jun 05 - T. Burnett


Another way of looking at performance

Another way of looking at performance

Define efficiency as fraction of good events after given cut on node purity; determine bad fraction for that cut

DC2 at GSFC - 28 Jun 05 - T. Burnett


The classifier package

The classifier package

  • Properties

    • Implements Gini separation (entropy now implemented)

    • Reads ROOT files

    • Flexible specification of variables to use

    • Simple and compact ascii representation of decision trees

    • Recently implemented: multiple trees, boosing

    • Not yet: pruning

  • Advantages vs IM:

    • Advanced techniques involving multiple trees, e.g. boosting, are available

    • Anyone can train trees, play with variables, etc.

DC2 at GSFC - 28 Jun 05 - T. Burnett


Growing trees

Growing trees

  • For simplicity, run a single tree: initially try all of Bill’s classification tree variables: list at right.

  • The run over the full 750 K events, trying each of the 70 variables takes only a few minutes!

DC2 at GSFC - 28 Jun 05 - T. Burnett


Performance of the truncated list

Performance of the truncated list

1691 leaf nodes

3381 total

Note: this is cheating: evaluation using the same events as for training

DC2 at GSFC - 28 Jun 05 - T. Burnett


Separate tree comparison

Separate tree comparison

IM

classifier

DC2 at GSFC - 28 Jun 05 - T. Burnett


Current status boosting works

Current status: boosting works!

Example using D0 data

DC2 at GSFC - 28 Jun 05 - T. Burnett


Current status next steps

Current status, next steps

  • The current classifier “goodcal” single-tree algorithm applied to all energies is slightly better than the three individual IM trees

    • Boosting will certainly improve the result

  • Done:

    • One-track vs. vertex: which estimate is better?

  • In progress in Seattle (as we speak)

    • PSF tail suppression: 4 trees to start.

  • In progress in Padova (see F. Longo’s summary)

    • Good-gamma prediction, or background rejection

  • Switch to new all_gamma run, rename variables.

DC2 at GSFC - 28 Jun 05 - T. Burnett


  • Login