dc2 c decision trees
Download
Skip this Video
Download Presentation
DC2 C++ decision trees

Loading in 2 Seconds...

play fullscreen
1 / 13

DC2 C++ decision trees - PowerPoint PPT Presentation


  • 169 Views
  • Uploaded on

DC2 C++ decision trees. Quick review of classification (or decision) trees Training and testing How Bill does it with Insightful Miner Applications to the “good-energy” trees: how does it compare?. Toby Burnett Frank Golf. predicate. predicate. purity. Quick Review of Decision Trees.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'DC2 C++ decision trees' - stella


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
dc2 c decision trees

DC2 C++ decision trees

  • Quick review of classification (or decision) trees
  • Training and testing
  • How Bill does it with Insightful Miner
  • Applications to the “good-energy” trees: how does it compare?

Toby Burnett

Frank Golf

DC2 at GSFC - 28 Jun 05 - T. Burnett

quick review of decision trees
predicate

predicate

purity

Quick Review of Decision Trees
  • Introduced to GLAST by Bill Atwood, using InsightfulMiner
  • Each branch node is a predicate, or cut on a variable, likeCalCsIRLn > 4.222
  • If true, this defines the right branch, otherwise the left branch.
  • If there is no branch, the node is a leaf; a leaf contains the purity of the sample that reaches that point
  • Thus the tree defines a function of the event variables used, returning a value for the purity from the training sample

true: right

false: left

DC2 at GSFC - 28 Jun 05 - T. Burnett

training and testing procedure
Training and testing procedure
  • Analyze a training sample containing a mixture of “good” and “bad” events: I use the even events in order to have an independent set for testing
  • Choose set of variables and find the optimal cut for such that the left and right subsets are purer than the orignal. Two standard criteria for this: “Gini” and entropy. I currently use the former.
    • WS : sum of signal weights
    • WB : sum of background weights

Gini = 2 WS WB/(WS +WB)

Thus Gini wants to be small.

Actually maximize the improvement:Gini(parent)-Gini(left child)-Gini(right child)

  • Apply this recursively until too few events. (100 for now)
  • Finally test with the odd events: measure purity for each node

DC2 at GSFC - 28 Jun 05 - T. Burnett

evaluate by comparing with im results
Evaluate by Comparing with IM results

From Bill’s Rome ’03 talk:

The “good cal” analysis

CAL-Low CT Probabilities

CAL-High CT Probabilities

All

Good

Good

All

Good

Bad

Bad

DC2 at GSFC - 28 Jun 05 - T. Burnett

compare with bill cont
Compare with Bill, cont
  • Since Rome:
    • Three energy ranges, three trees.
      • CalEnergySum: 5-350; 350-3500; >3500
    • Resulting decision trees implemented in Gleam by a “classification” package: results saved to the IMgoodCal tuple variable.
  • Development until now for training, comparison: the all_gamma from v4r2
    • 760 K events E from 16 MeV to 160 GeV (uniform in log(E), and 0<  < 90 (uniform in cos() ).
    • Contains IMgoodCal for comparison
  • Now:

DC2 at GSFC - 28 Jun 05 - T. Burnett

bill type plots for all gamma
Bill-type plots for all_gamma

DC2 at GSFC - 28 Jun 05 - T. Burnett

another way of looking at performance
Another way of looking at performance

Define efficiency as fraction of good events after given cut on node purity; determine bad fraction for that cut

DC2 at GSFC - 28 Jun 05 - T. Burnett

the classifier package
The classifier package
  • Properties
    • Implements Gini separation (entropy now implemented)
    • Reads ROOT files
    • Flexible specification of variables to use
    • Simple and compact ascii representation of decision trees
    • Recently implemented: multiple trees, boosing
    • Not yet: pruning
  • Advantages vs IM:
    • Advanced techniques involving multiple trees, e.g. boosting, are available
    • Anyone can train trees, play with variables, etc.

DC2 at GSFC - 28 Jun 05 - T. Burnett

growing trees
Growing trees
  • For simplicity, run a single tree: initially try all of Bill’s classification tree variables: list at right.
  • The run over the full 750 K events, trying each of the 70 variables takes only a few minutes!

DC2 at GSFC - 28 Jun 05 - T. Burnett

performance of the truncated list
Performance of the truncated list

1691 leaf nodes

3381 total

Note: this is cheating: evaluation using the same events as for training

DC2 at GSFC - 28 Jun 05 - T. Burnett

separate tree comparison
Separate tree comparison

IM

classifier

DC2 at GSFC - 28 Jun 05 - T. Burnett

current status boosting works
Current status: boosting works!

Example using D0 data

DC2 at GSFC - 28 Jun 05 - T. Burnett

current status next steps
Current status, next steps
  • The current classifier “goodcal” single-tree algorithm applied to all energies is slightly better than the three individual IM trees
    • Boosting will certainly improve the result
  • Done:
    • One-track vs. vertex: which estimate is better?
  • In progress in Seattle (as we speak)
    • PSF tail suppression: 4 trees to start.
  • In progress in Padova (see F. Longo’s summary)
    • Good-gamma prediction, or background rejection
  • Switch to new all_gamma run, rename variables.

DC2 at GSFC - 28 Jun 05 - T. Burnett

ad