comp 540 chapter 9 additive models trees and related methods
Download
Skip this Video
Download Presentation
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods

Loading in 2 Seconds...

play fullscreen
1 / 34

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods - PowerPoint PPT Presentation


  • 147 Views
  • Uploaded on

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods. Ryan King. Overview. 9.1 Generalized Additive Models 9.2 Tree-based Methods (CART) 9.4 MARS 9.6 Missing Data 9.7 Computational Considerations. Generalize Additive Models. Generally have the form:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Comp 540 Chapter 9: Additive Models, Trees, and Related Methods' - eldora


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview
Overview

9.1 Generalized Additive Models

9.2 Tree-based Methods (CART)

9.4 MARS

9.6 Missing Data

9.7 Computational Considerations

generalize additive models
Generalize Additive Models

Generally have the form:

Example: logistic regression

becomes additive Logistic regression:

link functions
Link Functions
  • The conditional mean related to an additive function of the predictors via a link function
  • Identity: (Gaussian)
  • Logit: (binomial)
  • Log: (Poisson)
9 1 1 fitting additive models
9.1.1 Fitting Additive Models
  • Ex: Additive Cubic Splines
  • Penalized Sum of Squares Criterion (PRSS)
9 1 backfitting algorithm
9.1 Backfitting Algorithm
  • Initialize:
  • Cycle:

Until the functions change less than a threshold

9 1 3 summary
9.1.3 Summary
  • Additive models extend linear models
  • Flexible, but still interpretable
  • Simple, modular, backfitting procedure
  • Limitations for large data-mining applications
9 2 tree based methods
9.2 Tree-Based Methods
  • Partition the feature space
  • Fit a simple model (constant) in each partition
  • Simple, but powerful
  • CART: Classification and Regression Trees, Breiman et al, 1984
9 2 regression trees
9.2 Regression Trees
  • CART is a top down (divisive) greedy procedure
  • Partitioning is a local decision for each node
  • A partition on variable j at value s creates regions:

and

9 2 regression trees1
9.2 Regression Trees
  • Each node chooses j,s to solve:
  • For any choice j,s the inner minimization is solved by:
  • Easy to scan through all choices of j,s to find optimal split
  • After the split, recur on and
9 2 cost complexity pruning
9.2 Cost-Complexity Pruning
  • How large do we grow the tree? Which nodes should we keep?
  • Grow tree out to fixed depth, prune back based on cost-complexity criterion.
9 2 terminology
9.2 Terminology
  • A subtree: implies is a pruned version of
  • Tree has M leaf nodes, each indexed by m
  • Leaf node m maps to region
  • denotes the number of leaf nodes of
  • is the number of data points in region
9 2 cost complexity pruning1
9.2 Cost-Complexity Pruning
  • We define the cost complexity criterion:
  • For , find to minimize
  • Choose by cross-validation
9 2 classification trees
9.2 Classification Trees
  • We define the same cost complexity criterion:
  • But choose different measure of node impurity
9 2 impurity measures
9.2 Impurity Measures
  • Misclassification Error
  • Gini index
  • Cross-entropy
9 2 categorical predictors
9.2 Categorical Predictors
  • How do we handle categorical variables?
  • In general, possible partitions of

q values into two groups

3. Trick for 0-1 case: sort the predictor classes by proportion falling in outcome class 1, then partition as normal

9 2 cart example
9.2 CART Example
  • Examples…
9 3 prim bump hunting
9.3 PRIM-Bump Hunting
  • Partition based, but not tree-based
  • Seeks boxes where the response average is high
  • Top-down algorithm
patient rule induction method
Patient Rule Induction Method
  • Start with all data, and maximal box
  • Shrink the box by compressing one face, to peel off factor alpha of observations. Choose peeling that produces highest response mean.
  • Repeat step 2 until some minimal number of observations remain
  • Expand the box along any face, as long as the resulting box mean increases.
  • Steps 1-4 give a sequence of boxes, use cross-validation to choose a member of the sequence, call that box B1
  • Remove B1 from dataset, repeat process to find another box, as desired.
9 3 prim summary
9.3 PRIM Summary
  • Can handle categorical predictors, as CART does
  • Designed for regression, can 2 class classification can be coded as 0-1
  • Non-trivial to deal with k>2 classes
  • More patient than CART
9 4 multivariate adaptive regression splines mars
9.4 Multivariate Adaptive Regression Splines (MARS)
  • Generalization of stepwise linear regression
  • Modification of CART to improve regression performance
  • Able to capture additive structure
  • Not tree-based
9 4 mars continued
9.4 MARS Continued
  • Additive model with adaptive set of basis vectors
  • Basis built up from simple piecewise linear functions
  • Set “C” represents candidate set of linear splines, with “knees” at each data point Xi. Models built with elements from C or their products.
9 4 mars procedure
9.4 MARS Procedure

Model has form:

  • Given a choice for the , the coefficients chosen by standard linear regression.
  • Start with All functions in C are candidate functions.
  • At each stage consider as a new basis function pair all products of a function in the model set M, with one of the reflected pairs in C.
  • We add add to the model terms of the form:
9 4 choosing number of terms
9.4 Choosing Number of Terms
  • Large models can overfit.
  • Backward deletion procedure: delete terms which cause the smallest increase in residual squared error, to give sequence of models.
  • Pick Model using Generalized Cross Validation:
  • is the effective number of parameters in the model. C=3, r is the number of basis vectors, and K knots
  • Choose the model which minimizes
9 4 mars summary
9.4 MARS Summary
  • Basis functions operate locally
  • Forward modeling is hierarchical, multiway products are built up only from existing terms
  • Each input appears only once in each product
  • Useful option is to set limit on order of operations. Limit of two allows only pairwise products. Limit of one results in an additive model
9 5 hierarchical mixture of experts hme
9.5 Hierarchical Mixture of Experts (HME)
  • Variant of tree based methods
  • Soft splits, not hard decisions
  • At each node, an observation goes left or right with probability depending on its input values
  • Smooth parameter optimization, instead of discrete split point search
9 5 hmes continued
9.5 HMEs continued
  • Linear (or logistic) regression model fit at each leaf node (Expert)
  • Splits can be multi-way, instead of binary
  • Splits are probabilistic functions of linear combinations of inputs (gating network), rather than functions of single inputs
  • Formally a mixture model
9 6 missing data
9.6 Missing Data
  • Quite common to have data with missing values for one or more input features
  • Missing values may or may not distort data
  • For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y)
9 6 missing data1
9.6 Missing Data
  • Quite common to have data with missing values for one or more input features
  • Missing values may or may not distort data
  • For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y), R is an indicator matrix for missing values
9 6 missing data2
9.6 Missing Data
  • Missing at Random(MAR):
  • Missing Completely at Random(MCAR)
  • MCAR is a stronger assumption
9 6 dealing with missing data
9.6 Dealing with Missing Data

Three approaches for handling MCAR data:

  • Discard observations with missing features
  • Rely on the learning algorithm to deal with missing values in its training phase
  • Impute all the missing values before training
9 6 dealing mcar
9.6 Dealing…MCAR
  • If few values are missing, (1) may work
  • For (2), CART can work well with missing values via surrogate splits. Additive models can assume average values.
  • (3) is necessary for most algorithms. Simplest tactic is to use the mean or median.
  • If features are correlated, can build predictive models for missing features in terms of known features
9 6 computational considerations
9.6 Computational Considerations
  • For N observations, p predictors
  • Additive Models:
  • Trees:
  • MARS:
  • HME, at each step:
ad