Additive Models and Trees

/ 34 []
Download Presentation
(1077) |   (0) |   (0)
Views: 105 | Added:
Rate Presentation: 2 0
Additive Models and Trees. Lecture Notes for CMPUT 466/551 Nilanjan Ray. Principal Source: Department of Statistics, CMU. Topics to cover. GAM: Generalized Additive Models CART: Classification and Regression Trees MARS: Multiple Adaptive Regression Splines. Generalized Additive Models.
Additive Models and Trees

An Image/Link below is provided (as is) to

Download Policy: Content on the Website is provided to you AS IS for your information and personal use only and may not be sold or licensed nor shared on other sites. SlideServe reserves the right to change this policy at anytime. While downloading, If for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Additive models and trees l.jpgSlide 1

Additive Models and Trees

Lecture Notes for CMPUT 466/551

Nilanjan Ray

Principal Source: Department of Statistics, CMU

Topics to cover l.jpgSlide 2

Topics to cover

  • GAM: Generalized Additive Models

  • CART: Classification and Regression Trees

  • MARS: Multiple Adaptive Regression Splines

Generalized additive models l.jpgSlide 3

Generalized Additive Models

What is GAM?

The functions fjare smoothing functions in general, such as splines, kernel

functions, linear functions, and so on…

Each function could be different, e.g., f1 can be linear, f2 can be a natural

spline, etc.

Compare GAM with Linear Basis Expansions (Ch. 5 of [HTF])

Similarities? Dissimilarities?

Any similarity (in principle) with Naïve Bayes model?

Smoothing functions in gam l.jpgSlide 4

Smoothing Functions in GAM

  • Non-parametric functions (linear smoother)

    • Smoothing splines (Basis expansion)

    • Simple k-nearest neighbor (raw moving average)

    • Locally weighted average by using kernel weighting

    • Local linear regression, local polynomial regression

  • Linear functions

  • Functions of more than one variables (interaction term)

  • Example:

Learning gam backfitting l.jpgSlide 5

Learning GAM: Backfitting

Backfitting algorithm

  • Initialize:

  • Cycle: j = 1,2,…, p,…,1,2,…, p,…, (m cycles)

    Until the functions change less than a prespecified threshold

Backfitting points to ponder l.jpgSlide 6

Backfitting: Points to Ponder

Computational Advantage?


How to choose fitting functions?

Example generalized logistic regression l.jpgSlide 7

Example: Generalized Logistic Regression


Slide8 l.jpgSlide 8

Additive Logistic Regression: Backfitting

Fitting logistic regression (P99)

Fitting additive logistic regression (P262)

1. where











Using weighted least squares to fit a linear model to zi with weights wi, give new estimates

c. Using weighted backfitting algorithm to fit an additive model to zi with weights wi, give new estimates

3. Continue step 2 until converge

3.Continue step 2 until converge

Spam detection via additive logistic regression l.jpgSlide 9

SPAM Detection via Additive Logistic Regression

  • Input variables (predictors):

    • 48 quantitative variables: percentage of words in the email that match a given word. Examples include business, address, internet, etc.

    • 6 quantitative variables: percentage of characters in the email that match a given character, such as ‘ch;’, ch(, etc.

    • The average length of uninterrupted sequences of capital letters

    • The length of the longest uninterrupted sequence of capital letters

    • The sum of length of uninterrupted length of capital letters

  • Output variable: SPAM (1) or Email (0)

  • fj’s are taken as cubic smoothing splines

Spam detection results l.jpgSlide 10

SPAM Detection: Results

Sensitivity: Probability of predicting spam given true state is spam =

Specificity: Probability of predicting email given true state is email =

Gam summary l.jpgSlide 11

GAM: Summary

  • Useful flexible extensions of linear models

  • Backfitting algorithm is simple and modular

  • Interpretability of the predictors (input variables) are not obscured

  • Not suitable for very large data mining applications (why?)

Slide12 l.jpgSlide 12


  • Overview

    • Principle behind: Divide and conquer

    • Partition the feature space into a set of rectangles

      • For simplicity, use recursive binary partition

    • Fit a simple model (e.g. constant) for each rectangle

    • Classification and Regression Trees (CART)

      • Regress Trees

      • Classification Trees

    • Popular in medical applications

Slide13 l.jpgSlide 13


  • An example (in regression case):

Basic issues in tree based methods l.jpgSlide 14

Basic Issues in Tree-based Methods

  • How to grow a tree?

  • How large should we grow the tree?

Regression trees l.jpgSlide 15

Regression Trees

  • Partition the space into M regions: R1, R2, …, RM.

Note that this is still an additive model

Regression trees grow the tree l.jpgSlide 16

Regression Trees– Grow the Tree

  • The best partition: to minimize the sum of squared error:

  • Finding the global minimum is computationally infeasible

  • Greedy algorithm: at each level choose variable j and value s as:

  • The greedy algorithm makes the tree unstable

    • The error made at the upper level will be propagated to the lower level

Regression tree how large should we grow the tree l.jpgSlide 17

Regression Tree – how large should we grow the tree ?

  • Trade-off between bias and variance

    • Very large tree: overfit (low bias, high variance)

    • Small tree (low variance, high bias): might not capture the structure

  • Strategies:

    • 1: split only when we can decrease the error (usually short-sighted)

    • 2: Cost-complexity pruning (preferred)

Regression tree pruning l.jpgSlide 18

Penalty on the complexity/size of the tree

Cost: sum of squared errors

Regression Tree - Pruning

  • Cost-complexity pruning:

    • Pruning: collapsing some internal nodes

    • Cost complexity:

    • Choose best alpha: weakest link pruning (p.270, [HTF])

      • Each time collapse an internal node which add smallest error

      • Choose from this tree sequence the best one by cross-validation

Classification trees l.jpgSlide 19

Classification Trees

  • Classify the observations in node m to the major class in the node:

    • Pmk is the proportion of observation of class k in node m

  • Define impurity for a node:

    • Misclassification error:

    • Entropy:

    • Gini index :

Classification trees20 l.jpgSlide 20

Classification Trees

  • Entropy and Gini are more sensitive

  • To grow the tree: use Entropy or Gini

  • To prune the tree: use Misclassification rate (or any other method)

Node impurity measures versus class proportion for 2-class problem

Tree based methods discussions l.jpgSlide 21

Tree-based Methods: Discussions

  • Categorical Predictors

    • Problem: Consider splits of sub tree t into tL and tR based on categorical predictor x which has q possible values: 2(q-1)-1 ways !

    • Treat the categorical predictor as ordered by say proportion of class 1

Tree based methods discussions22 l.jpgSlide 22

Tree-based Methods: Discussions

  • Linear Combination Splits

    • Split the node based on

    • Improve the predictive power

    • Hurt interpretability

  • Instability of Trees

    • Inherited from the hierarchical nature

    • Bagging (section 8.7 of [HTF]) can reduce the variance

Bootstrap trees l.jpgSlide 23

Bootstrap Trees

Construct B number of trees from B bootstrap samples– bootstrap trees

Bootstrap trees24 l.jpgSlide 24

Bootstrap Trees

Bagging the bootstrap trees l.jpgSlide 25

Bagging The Bootstrap Trees

is computed from the bth bootstrap sample

in this case a tree

Bagging reduces the variance of the original tree by aggregation

Bagged tree performance l.jpgSlide 26

Bagged Tree Performance

Majority vote


Slide27 l.jpgSlide 27


  • In multi-dimensional spline the basis functions grow exponentially– curse of dimensionality

  • A partial remedy is a greedy forward search algorithm

    • Create a simple basis-construction dictionary

    • Construct basis functions on-the-fly

    • Choose the best-fit basis function at each step

Basis functions l.jpgSlide 28

Basis functions

  • 1-dim linear spline (t represents the knot)

  • Basis collections C:

    |C| = 2 * N * p

The mars procedure 1 st stage l.jpgSlide 29

The MARS procedure (1st stage)

  • Initialize basis set M with a constant function

  • Form candidates (cross-product of M with set C)

  • Add the best-fit basis pair (decrease residual error the most) into M

  • Repeat from step 2 (until e.g. |M| >= threshold)

M (new)

M (old)


The mars procedure 2 nd stage l.jpgSlide 30

The MARS procedure (2nd stage)

The final model M typically overfits the data

=>Need to reduce the model size (# of terms)

Backward deletion procedure

  • Remove term which causes the smallest increase in residual error

  • Compute

  • Repeat step 1

    Choose the model size with minimum GCV.

Generalized cross validation gcv l.jpgSlide 31

Generalized Cross Validation (GCV)

  • M(.) measures effective # of parameters:

    • r: # of linearly independent basis functions

    • K: # of knots selected

    • c = 3

Discussion l.jpgSlide 32


  • Piecewise linear reflected basis

    • Allow operation on local region

    • Fitting N reflected basis pairs takes O(N) instead of O(N^2)

      • Left-part is zero, right-part differs by a constant





Discussion continue l.jpgSlide 33

Discussion (continue)

  • Hierarchical model (reduce search computation)

    • High-order term exists => some lower-order “footprints” exist

  • Restriction: Each input appear at most once in a product:

    e.g. (Xj - t1) * (Xj - t1) is not considered

  • Set upper limit on order of interaction

    • Upper limit of 1 => additive model

  • MARS for classification

    • Use multi-response Y (N*K indicator matrix)

    • Masking problem may occur

    • Better solution: “optimal scoring” (Chapter 12.5 of [HTF])

Mars cart relationship l.jpgSlide 34

MARS & CART relationship


  • replace piecewise linear basis by step functions

  • keep only the newly formed product terms in M (leaf nodes of a binary tree)


    MARS forward procedure

    = CART tree growing procedure

Copyright © 2015 SlideServe. All rights reserved | Powered By DigitalOfficePro