1 / 34

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods - PowerPoint PPT Presentation

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods. Ryan King. Overview. 9.1 Generalized Additive Models 9.2 Tree-based Methods (CART) 9.4 MARS 9.6 Missing Data 9.7 Computational Considerations. Generalize Additive Models. Generally have the form:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about ' Comp 540 Chapter 9: Additive Models, Trees, and Related Methods' - eldora

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Comp 540Chapter 9:Additive Models, Trees, and Related Methods

Ryan King

9.2 Tree-based Methods (CART)

9.4 MARS

9.6 Missing Data

9.7 Computational Considerations

Generally have the form:

Example: logistic regression

• The conditional mean related to an additive function of the predictors via a link function

• Identity: (Gaussian)

• Logit: (binomial)

• Log: (Poisson)

• Penalized Sum of Squares Criterion (PRSS)

• Initialize:

• Cycle:

Until the functions change less than a threshold

• Additive models extend linear models

• Flexible, but still interpretable

• Simple, modular, backfitting procedure

• Limitations for large data-mining applications

• Partition the feature space

• Fit a simple model (constant) in each partition

• Simple, but powerful

• CART: Classification and Regression Trees, Breiman et al, 1984

x1

9.2 Binary Recursive Partitions

f

c

d

e

a

a

b

b

c

d

e

f

• CART is a top down (divisive) greedy procedure

• Partitioning is a local decision for each node

• A partition on variable j at value s creates regions:

and

• Each node chooses j,s to solve:

• For any choice j,s the inner minimization is solved by:

• Easy to scan through all choices of j,s to find optimal split

• After the split, recur on and

• How large do we grow the tree? Which nodes should we keep?

• Grow tree out to fixed depth, prune back based on cost-complexity criterion.

• A subtree: implies is a pruned version of

• Tree has M leaf nodes, each indexed by m

• Leaf node m maps to region

• denotes the number of leaf nodes of

• is the number of data points in region

• We define the cost complexity criterion:

• For , find to minimize

• Choose by cross-validation

• We define the same cost complexity criterion:

• But choose different measure of node impurity

• Misclassification Error

• Gini index

• Cross-entropy

• How do we handle categorical variables?

• In general, possible partitions of

q values into two groups

3. Trick for 0-1 case: sort the predictor classes by proportion falling in outcome class 1, then partition as normal

• Examples…

• Partition based, but not tree-based

• Seeks boxes where the response average is high

• Top-down algorithm

• Shrink the box by compressing one face, to peel off factor alpha of observations. Choose peeling that produces highest response mean.

• Repeat step 2 until some minimal number of observations remain

• Expand the box along any face, as long as the resulting box mean increases.

• Steps 1-4 give a sequence of boxes, use cross-validation to choose a member of the sequence, call that box B1

• Remove B1 from dataset, repeat process to find another box, as desired.

• Can handle categorical predictors, as CART does

• Designed for regression, can 2 class classification can be coded as 0-1

• Non-trivial to deal with k>2 classes

• More patient than CART

• Generalization of stepwise linear regression

• Modification of CART to improve regression performance

• Able to capture additive structure

• Not tree-based

• Basis built up from simple piecewise linear functions

• Set “C” represents candidate set of linear splines, with “knees” at each data point Xi. Models built with elements from C or their products.

Model has form:

• Given a choice for the , the coefficients chosen by standard linear regression.

• At each stage consider as a new basis function pair all products of a function in the model set M, with one of the reflected pairs in C.

• Large models can overfit.

• Backward deletion procedure: delete terms which cause the smallest increase in residual squared error, to give sequence of models.

• Pick Model using Generalized Cross Validation:

• is the effective number of parameters in the model. C=3, r is the number of basis vectors, and K knots

• Choose the model which minimizes

• Basis functions operate locally

• Forward modeling is hierarchical, multiway products are built up only from existing terms

• Each input appears only once in each product

• Useful option is to set limit on order of operations. Limit of two allows only pairwise products. Limit of one results in an additive model

• Variant of tree based methods

• Soft splits, not hard decisions

• At each node, an observation goes left or right with probability depending on its input values

• Smooth parameter optimization, instead of discrete split point search

• Linear (or logistic) regression model fit at each leaf node (Expert)

• Splits can be multi-way, instead of binary

• Splits are probabilistic functions of linear combinations of inputs (gating network), rather than functions of single inputs

• Formally a mixture model

• Quite common to have data with missing values for one or more input features

• Missing values may or may not distort data

• For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y)

• Quite common to have data with missing values for one or more input features

• Missing values may or may not distort data

• For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y), R is an indicator matrix for missing values

• Missing at Random(MAR):

• Missing Completely at Random(MCAR)

• MCAR is a stronger assumption

Three approaches for handling MCAR data:

• Discard observations with missing features

• Rely on the learning algorithm to deal with missing values in its training phase

• Impute all the missing values before training

• If few values are missing, (1) may work

• For (2), CART can work well with missing values via surrogate splits. Additive models can assume average values.

• (3) is necessary for most algorithms. Simplest tactic is to use the mean or median.

• If features are correlated, can build predictive models for missing features in terms of known features

9.6 Computational Considerations

• For N observations, p predictors