1 / 23

Implementation In Tree

Implementation In Tree. Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye. Data Background. From SPSS Answer Tree program, we use its credit scoring example There are 323 data points The target variable is credit ranking (good [48%], bad [52%]) The four predictor variables are

cale
Download Presentation

Implementation In Tree

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementation In Tree Stat 6601 November 24, 2004 Bin Hu, Philip Wong, Yu Ye

  2. Data Background • From SPSS Answer Tree program, we use its credit scoring example • There are 323 data points • The target variable is credit ranking (good [48%], bad [52%]) • The four predictor variables are age categorical (young [58%], middle [24%], old[18%]) has AMEX card (yes [48%], no [52%]) paid weekly/monthly (weekly pay [51%], monthly salary [49%]) social class (management [12%], professional [49%], clerical [15%], skilled [13%], unskilled [12%])

  3. Data Background • Useful to see how the target variable is distributed by each of the predictor variable

  4. Data Background Pearson Correlation Coefficients, N = 323 Prob > |r| under H0: Rho=0 CREDIT_R PAY_WEEK AGE AMEX CREDIT_R 1.00000 0.70885 0.66273 0.02653 CREDIT_R <.0001 <.0001 0.6348 PAY_WEEK 0.70885 1.00000 0.51930 0.08292 PAY_WEEK <.0001 <.0001 0.1370 AGE 0.66273 0.51930 1.00000 -0.00172 AGE <.0001 <.0001 0.9755 AMEX 0.02653 0.08292 -0.00172 1.00000 AMEX 0.6348 0.1370 0.9755 • Correlation Matrix:

  5. Objective • To create a predictive model of good credit risks • To assess the performance of the model, we randomly split data into two parts: a training set to develop the model (60%) and the rest (40%) to validate. • This is done to avoid possible “over fitting” since the validation set was not involve in deriving the model • Using the same data, we compare the results using R’s Tree, Answer Tree’s CART, and SAS’ Proc Logistic

  6. Logistic Regression • Let x be a vector of explanatory variables • Let y be a binary target variable (0 or 1) • p = Pr(Y=1|x) is the target probability • The linear logistic model has the form • Predicted probability, phat = 1/(1+exp(-(α+β’x))) • Note that the range for p is (0,1), but logit(p) is the whole real line

  7. Logistic Results • Using the training set, the maximum likelihood estimate failed to converge using the social class and age variables • Only the paid weekly/monthly and has AMEX card variables could be estimated • The AMEX variable was highly insignificant and so was dropped • Apparently, the tree algorithm does a better job in handling all the variables • SAS output of model: Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 1.5856 0.2662 35.4756 <.0001 PAY_WEEK 1 1 -3.6066 0.4169 74.8285 <.0001 • So, the odds of weekly pay to be a good risk over a monthly salary person is exp(-3.6) ≈ 0.027 to 1 or 36 to 1 against.

  8. Validation Results • With only one variable in our predicted model, there are only two possible predicted probabilities: 0.117 and 0.830 • Taking the higher probability as predicting a “good” account, our results are below Validation Set Training Set Actual BadActual GoodActual BadActual Good Predicted Bad 60 11 83 11 Predicted Good 8 50 17 83 Percent Agreement 85.3% 85.6% • The better measure is use the validation set results. Note that the results are very similar, so overfitting does not appear to be a problem.

  9. Growing a Tree in R(Based on training data) > credit_data<- read.csv(file="training.csv") > library(tree) > credit_tree<-tree(CREDIT_R ~ CLASS + PAY_METHOD + AGE + AMEX, data=credit_data, split=c("gini")) > tree.pr<-prune.tree(credit_tree) > plot(tree.pr) # figure 1 > plot(credit_tree, type="u"); text(credit_tree, pretty=0) # figure 2 > tree.1<-prune.tree(credit_tree, best=5) > plot(tree.1, type="u"); text(tree.1, pretty=0) # figure 3, 4, 5 > summary(tree.1)

  10. Figure 1

  11. Figure 2

  12. Figure 3

  13. Figure 4

  14. Figure 5

  15. Tree Based on Validation Data

  16. Implementing using SPSS ANSWER TREETraining sample – C&RT (Min impunity change .01)

  17. Implementing using SPSS ANSWER TREETraining sample – CHAID (Pearson Chi2, p=.05)

  18. Summary classification for training data

  19. Summary of Validation datagrouped by training data classification

  20. Crosstabulation of predicted and actual classification

  21. Summary of result • Similar trees were generated from R and SPSS ANSWER TREE • Similar results were derived using different tree generation methods – C&RT and CHAID • Classification tree has higher percentage of agreement between predicted values and actual values than logistic regression on training data • Utilizing the grouping criteria derived from training data, logistic regression has higher percentage of agreement than classification tree

  22. Conclusion • Classification tree is a non-parametric method to select predictive variables sequentially and group cases to homogenous clusters to derive the highest predictive probability • Classification tree can be implemented in different software and using different tree growing methodologies • Classification tree normally performs better than parametric models with higher percentage of agreement between predicted values and actual values • Classification tree has special advantages in industries like credit card and marketing research by 1) grouping individuals by homogenous clusters 2) assigning not only the predicted values, but also the probability of predicting error

  23. Conclusion – con’d • As a non-parametric method, no function form is specified and no parameter will be estimated and tested • As showed in this small study, the lower percentage of agreement for validation data shows “overfitting” might be a potential problem in classification tree

More Related