Comp 540 chapter 9 additive models trees and related methods
1 / 34

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods - PowerPoint PPT Presentation

  • Uploaded on

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods. Ryan King. Overview. 9.1 Generalized Additive Models 9.2 Tree-based Methods (CART) 9.4 MARS 9.6 Missing Data 9.7 Computational Considerations. Generalize Additive Models. Generally have the form:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Comp 540 Chapter 9: Additive Models, Trees, and Related Methods' - eldora

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Comp 540 chapter 9 additive models trees and related methods

Comp 540Chapter 9:Additive Models, Trees, and Related Methods

Ryan King


9.1 Generalized Additive Models

9.2 Tree-based Methods (CART)

9.4 MARS

9.6 Missing Data

9.7 Computational Considerations

Generalize additive models
Generalize Additive Models

Generally have the form:

Example: logistic regression

becomes additive Logistic regression:

Link functions
Link Functions

  • The conditional mean related to an additive function of the predictors via a link function

  • Identity: (Gaussian)

  • Logit: (binomial)

  • Log: (Poisson)

9 1 1 fitting additive models
9.1.1 Fitting Additive Models

  • Ex: Additive Cubic Splines

  • Penalized Sum of Squares Criterion (PRSS)

9 1 backfitting algorithm
9.1 Backfitting Algorithm

  • Initialize:

  • Cycle:

    Until the functions change less than a threshold

9 1 3 summary
9.1.3 Summary

  • Additive models extend linear models

  • Flexible, but still interpretable

  • Simple, modular, backfitting procedure

  • Limitations for large data-mining applications

9 2 tree based methods
9.2 Tree-Based Methods

  • Partition the feature space

  • Fit a simple model (constant) in each partition

  • Simple, but powerful

  • CART: Classification and Regression Trees, Breiman et al, 1984

9 2 binary recursive partitions



9.2 Binary Recursive Partitions













9 2 regression trees
9.2 Regression Trees

  • CART is a top down (divisive) greedy procedure

  • Partitioning is a local decision for each node

  • A partition on variable j at value s creates regions:


9 2 regression trees1
9.2 Regression Trees

  • Each node chooses j,s to solve:

  • For any choice j,s the inner minimization is solved by:

  • Easy to scan through all choices of j,s to find optimal split

  • After the split, recur on and

9 2 cost complexity pruning
9.2 Cost-Complexity Pruning

  • How large do we grow the tree? Which nodes should we keep?

  • Grow tree out to fixed depth, prune back based on cost-complexity criterion.

9 2 terminology
9.2 Terminology

  • A subtree: implies is a pruned version of

  • Tree has M leaf nodes, each indexed by m

  • Leaf node m maps to region

  • denotes the number of leaf nodes of

  • is the number of data points in region

9 2 cost complexity pruning1
9.2 Cost-Complexity Pruning

  • We define the cost complexity criterion:

  • For , find to minimize

  • Choose by cross-validation

9 2 classification trees
9.2 Classification Trees

  • We define the same cost complexity criterion:

  • But choose different measure of node impurity

9 2 impurity measures
9.2 Impurity Measures

  • Misclassification Error

  • Gini index

  • Cross-entropy

9 2 categorical predictors
9.2 Categorical Predictors

  • How do we handle categorical variables?

  • In general, possible partitions of

    q values into two groups

    3. Trick for 0-1 case: sort the predictor classes by proportion falling in outcome class 1, then partition as normal

9 2 cart example
9.2 CART Example

  • Examples…

9 3 prim bump hunting
9.3 PRIM-Bump Hunting

  • Partition based, but not tree-based

  • Seeks boxes where the response average is high

  • Top-down algorithm

Patient rule induction method
Patient Rule Induction Method

  • Start with all data, and maximal box

  • Shrink the box by compressing one face, to peel off factor alpha of observations. Choose peeling that produces highest response mean.

  • Repeat step 2 until some minimal number of observations remain

  • Expand the box along any face, as long as the resulting box mean increases.

  • Steps 1-4 give a sequence of boxes, use cross-validation to choose a member of the sequence, call that box B1

  • Remove B1 from dataset, repeat process to find another box, as desired.

9 3 prim summary
9.3 PRIM Summary

  • Can handle categorical predictors, as CART does

  • Designed for regression, can 2 class classification can be coded as 0-1

  • Non-trivial to deal with k>2 classes

  • More patient than CART

9 4 multivariate adaptive regression splines mars
9.4 Multivariate Adaptive Regression Splines (MARS)

  • Generalization of stepwise linear regression

  • Modification of CART to improve regression performance

  • Able to capture additive structure

  • Not tree-based

9 4 mars continued
9.4 MARS Continued

  • Additive model with adaptive set of basis vectors

  • Basis built up from simple piecewise linear functions

  • Set “C” represents candidate set of linear splines, with “knees” at each data point Xi. Models built with elements from C or their products.

9 4 mars procedure
9.4 MARS Procedure

Model has form:

  • Given a choice for the , the coefficients chosen by standard linear regression.

  • Start with All functions in C are candidate functions.

  • At each stage consider as a new basis function pair all products of a function in the model set M, with one of the reflected pairs in C.

  • We add add to the model terms of the form:

9 4 choosing number of terms
9.4 Choosing Number of Terms

  • Large models can overfit.

  • Backward deletion procedure: delete terms which cause the smallest increase in residual squared error, to give sequence of models.

  • Pick Model using Generalized Cross Validation:

  • is the effective number of parameters in the model. C=3, r is the number of basis vectors, and K knots

  • Choose the model which minimizes

9 4 mars summary
9.4 MARS Summary

  • Basis functions operate locally

  • Forward modeling is hierarchical, multiway products are built up only from existing terms

  • Each input appears only once in each product

  • Useful option is to set limit on order of operations. Limit of two allows only pairwise products. Limit of one results in an additive model

9 5 hierarchical mixture of experts hme
9.5 Hierarchical Mixture of Experts (HME)

  • Variant of tree based methods

  • Soft splits, not hard decisions

  • At each node, an observation goes left or right with probability depending on its input values

  • Smooth parameter optimization, instead of discrete split point search

9 5 hmes continued
9.5 HMEs continued

  • Linear (or logistic) regression model fit at each leaf node (Expert)

  • Splits can be multi-way, instead of binary

  • Splits are probabilistic functions of linear combinations of inputs (gating network), rather than functions of single inputs

  • Formally a mixture model

9 6 missing data
9.6 Missing Data

  • Quite common to have data with missing values for one or more input features

  • Missing values may or may not distort data

  • For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y)

9 6 missing data1
9.6 Missing Data

  • Quite common to have data with missing values for one or more input features

  • Missing values may or may not distort data

  • For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y), R is an indicator matrix for missing values

9 6 missing data2
9.6 Missing Data

  • Missing at Random(MAR):

  • Missing Completely at Random(MCAR)

  • MCAR is a stronger assumption

9 6 dealing with missing data
9.6 Dealing with Missing Data

Three approaches for handling MCAR data:

  • Discard observations with missing features

  • Rely on the learning algorithm to deal with missing values in its training phase

  • Impute all the missing values before training

9 6 dealing mcar
9.6 Dealing…MCAR

  • If few values are missing, (1) may work

  • For (2), CART can work well with missing values via surrogate splits. Additive models can assume average values.

  • (3) is necessary for most algorithms. Simplest tactic is to use the mean or median.

  • If features are correlated, can build predictive models for missing features in terms of known features

9 6 computational considerations
9.6 Computational Considerations

  • For N observations, p predictors

  • Additive Models:

  • Trees:

  • MARS:

  • HME, at each step: