1 / 63

More Bayes, Interpreating Regression, Decision trees, and Cross-validation

More Bayes, Interpreating Regression, Decision trees, and Cross-validation. Peter Fox Data Analytics – ITWS-4600/ITWS-6600/MATP-4450 Group 2 Module 7, February 20, 2018. Contents. Numeric v. non-numeric. In R – data frame and types.

crete
Download Presentation

More Bayes, Interpreating Regression, Decision trees, and Cross-validation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More Bayes, Interpreating Regression, Decision trees, and Cross-validation Peter Fox Data Analytics – ITWS-4600/ITWS-6600/MATP-4450 Group 2 Module 7, February 20, 2018

  2. Contents

  3. Numeric v. non-numeric

  4. In R – data frame and types • Almost always at input R sees categorical data as “strings” or “character” • You can test for membership (as a type) using is.<x> (x=number, factor, etc.) • You can “coerce” it (i.e. change the type) using as.<x> (same x) • To tell R you have categorical types (also called enumerated types), R calls them “factor”…. Thus – as.factor()

  5. In R factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA), ordered(x, ...), is.factor(x), is.ordered(x), as.factor(x), as.ordered(x), addNA(x, ifany = FALSE) levels - values that x might have taken. labels - either an optional vector of labels for the levels (in the same order as levels after removing those in exclude), or a character string of length 1. Exclude - Ordered - Nmax - upper bound on the number of levels; Ifany -

  6. Relate to the datasets… • Abalone - Sex = {F, I, M} • Eye color? • EPI - GeoRegions? • Sample v. population – levels and names IN the dataset versus all possible levels/names

  7. Digging into iris classifier<-naiveBayes(iris[,1:4], iris[,5]) table(predict(classifier, iris[,-5]), iris[,5], dnn=list('predicted','actual')) classifier$apriori classifier$tables$Petal.Length plot(function(x) dnorm(x, 1.462, 0.1736640), 0, 8, col="red", main="Petal length distribution for the 3 different species") curve(dnorm(x, 4.260, 0.4699110), add=TRUE, col="blue") curve(dnorm(x, 5.552, 0.5518947 ), add=TRUE, col = "green")

  8. http://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.htmlhttp://www.ugrad.stat.ubc.ca/R/library/mlbench/html/HouseVotes84.html require(mlbench) data(HouseVotes84) model <- naiveBayes(Class ~ ., data = HouseVotes84) predict(model, HouseVotes84[1:10,-1]) predict(model, HouseVotes84[1:10,-1], type = "raw") pred <- predict(model, HouseVotes84[,-1]) table(pred, HouseVotes84$Class)

  9. nbayes1 > table(pred, HouseVotes84$Class) pred democrat republican democrat 238 13 republican 29 155

  10. > predict(model, HouseVotes84[1:10,-1], type = "raw") democratrepublican [1,] 1.029209e-07 9.999999e-01 [2,] 5.820415e-08 9.999999e-01 [3,] 5.684937e-03 9.943151e-01 [4,] 9.985798e-01 1.420152e-03 [5,] 9.666720e-01 3.332802e-02 [6,] 8.121430e-01 1.878570e-01 [7,] 1.751512e-04 9.998248e-01 [8,] 8.300100e-06 9.999917e-01 [9,] 8.277705e-08 9.999999e-01 [10,] 1.000000e+00 5.029425e-11

  11. NYC Housing data • http://aquarius.tw.rpi.edu/html/DA/NYhousing/rollingsales_bronx.xls

  12. Bronx 1 = Regression > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) You were reminded that log(0) is … not fun  THINK through what you are doing… Filtering is somewhat inevitable: > bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) • Lab5b_bronx1.R

  13. Interpreting this! Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max -14.4529 0.0377 0.4160 0.6572 3.8159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0271 0.3088 22.75 <2e-16 *** log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229 F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16

  14. Plots – tell me what they tell you!

  15. Solution model 2 > m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a))

  16. How do you interpret this residual plot?

  17. Solution model 3 and 4 > m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m3) > plot(resid(m3)) # > m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m4) > plot(resid(m4))

  18. And this one?

  19. Bronx 2 = complex example See lab1_bronx2.R Manipulation Mapping knn Kmeans Needed for A4!! Due March 2.

  20. Now, returning to tree-methods

  21. General idea behind trees • Although the basic philosophy of all the classifiers based on decision trees is identical, there are many possibilities for its construction. • Among all the key points in the selection of an algorithm to build decision trees some of them should be highlighted for their importance: • Criteria for the choice of feature to be used in each node • How to calculate the partition of the set of examples • When you decide that a node is a leaf • What is the criterion to select the class to assign to each leaf

  22. Some important advantages can be pointed to the decision trees, including: • Can be applied to any type of data • The final structure of the classifier is quite simple and can be stored and handled in a graceful manner • Handles very efficiently conditional information, subdividing the space into sub-spaces that are handled individually • Reveal normally robust and insensitive to misclassification in the training set • The resulting trees are usually quite understandable and can be easily used to obtain a better understanding of the phenomenon in question. This is perhaps the most important of all the advantages listed

  23. Stopping – leaves on the tree • A number of stopping conditions can be used to stop the recursive process. The algorithm stops when any one of the conditions is true: • All the samples belong to the same class, i.e. have the same label since the sample is already "pure" • Stop if most of the points are already of the same class. This is a generalization of the first approach, with some error threshold • There are no remaining attributes on which the samples may be further partitioned • There are no samples for the branch test attribute

  24. Recursive partitioning • Recursive partitioning is a fundamental tool in data mining. It helps us explore the structure of a set of data, while developing easy to visualize decision rules for predicting a categorical (classification tree) or continuous (regression tree) outcome. • The rpart programs build classification or regression models of a very general structure using a two stage procedure; the resulting models can be represented as binary trees.

  25. Recursive partitioning • The tree is built by the following process: • first the single variable is found which best splits the data into two groups ('best' will be defined later). The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made. • second stage of the procedure consists of using cross-validation to trim back the full tree.

  26. Why are we careful doing this? • Because we will USE these trees, i.e. apply them to make decisions about what things are and what to do with them!

  27. Ionosphere… > printcp(rpart.model) Classification tree: rpart(formula = Type ~ ., data = trainset) Variables actually used in tree construction: [1] Al Ba Mg RI Root node error: 92/143 = 0.64336 n= 143 CP nsplitrel error xerrorxstd 1 0.206522 0 1.00000 1.00000 0.062262 2 0.195652 1 0.79348 0.92391 0.063822 3 0.050725 2 0.59783 0.63043 0.063822 4 0.043478 5 0.44565 0.64130 0.063990 5 0.032609 6 0.40217 0.57609 0.062777 6 0.010000 7 0.36957 0.51087 0.061056

  28. plotcp(rpart.model)

  29. > rsq.rpart(rpart.model) Classification tree: rpart(formula = Type ~ ., data = trainset) Variables actually used in tree construction: [1] Al Ba Mg RI Root node error: 92/143 = 0.64336 n= 143 CP nsplitrel error xerrorxstd 1 0.206522 0 1.00000 1.00000 0.062262 2 0.195652 1 0.79348 0.92391 0.063822 3 0.050725 2 0.59783 0.63043 0.063822 4 0.043478 5 0.44565 0.64130 0.063990 5 0.032609 6 0.40217 0.57609 0.062777 6 0.010000 7 0.36957 0.51087 0.061056 Warning message: In rsq.rpart(rpart.model) : may not be applicable for this method

  30. rsq.rpart

  31. > print(rpart.model) n= 143 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 143 92 2 (0.3 0.36 0.091 0.056 0.049 0.15) 2) Ba< 0.335 120 70 2 (0.35 0.42 0.11 0.058 0.058 0.0083) 4) Al< 1.42 71 33 1 (0.54 0.28 0.15 0.014 0.014 0) 8) RI>=1.517075 58 22 1 (0.62 0.28 0.086 0.017 0 0) 16) RI< 1.518015 21 1 1 (0.95 0 0.048 0 0 0) * 17) RI>=1.518015 37 21 1 (0.43 0.43 0.11 0.027 0 0) 34) RI>=1.51895 25 10 1 (0.6 0.2 0.16 0.04 0 0) 68) Mg>=3.415 18 4 1 (0.78 0 0.22 0 0 0) * 69) Mg< 3.415 7 2 2 (0.14 0.71 0 0.14 0 0) * 35) RI< 1.51895 12 1 2 (0.083 0.92 0 0 0 0) * 9) RI< 1.517075 13 7 3 (0.15 0.31 0.46 0 0.077 0) * 5) Al>=1.42 49 19 2 (0.082 0.61 0.041 0.12 0.12 0.02) 10) Mg>=2.62 33 6 2 (0.12 0.82 0.061 0 0 0) * 11) Mg< 2.62 16 10 5 (0 0.19 0 0.37 0.37 0.062) * 3) Ba>=0.335 23 3 7 (0.043 0.043 0 0.043 0 0.87) *

  32. Tree plot > plot(rpart.model,compress=TRUE) > text(rpart.model, use.n=TRUE) plot(object, uniform=FALSE, branch=1, compress=FALSE, nspace, margin=0, minbranch=.3, args)

  33. And if you are brave summary(rpart.model) … pages….

  34. Remember to LOOK at the data > names(Glass) [1] "RI" "Na" "Mg" "Al" "Si" "K" "Ca" "Ba" "Fe" "Type" > head(Glass) RI Na Mg Al Si K Ca Ba Fe Type 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1

  35. rpart.pred > rpart.pred 91 163 138 135 172 20 1 148 169 206 126 157 107 39 150 203 151 110 73 104 85 93 144 160 145 89 204 7 92 51 1 1 2 1 5 2 1 1 5 7 2 1 7 1 1 7 2 2 2 1 2 2 2 2 1 2 7 1 2 1 186 14 190 56 184 82 125 12 168 175 159 36 117 114 154 62 139 5 18 98 27 183 42 66 155 180 71 83 123 11 7 1 7 2 2 2 1 1 5 5 2 1 1 1 1 7 2 1 1 1 1 5 1 1 1 5 2 1 2 2 195 101 136 45 130 6 72 87 173 121 3 7 2 1 1 5 2 1 2 5 1 2 Levels: 1 2 3 5 6 7

  36. plot(rpart.pred)

  37. New dataset to work with trees fitK <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis) printcp(fitK) # display the results plotcp(fitK) # visualize cross-validation results summary(fitK) # detailed summary of splits # plot tree plot(fitK, uniform=TRUE, main="Classification Tree for Kyphosis") text(fitK, use.n=TRUE, all=TRUE, cex=.8) # create attractive postscript plot of tree post(fitK, file = “kyphosistree.ps", title = "Classification Tree for Kyphosis") # might need to convert to PDF (distill)

  38. > pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"]) > plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis") > text(pfitK, use.n=TRUE, all=TRUE, cex=.8) > post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)

  39. > fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis) > plot(fitK, main="Conditional Inference Tree for Kyphosis”)

  40. > plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")

  41. randomForest > require(randomForest) > fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis) > print(fitKF) # view results Call: randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 20.99% Confusion matrix: absent present class.error absent 59 5 0.0781250 present 12 5 0.7058824 > importance(fitKF) # importance of each predictor MeanDecreaseGini Age 8.654112 Number 5.584019 Start 10.168591 Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).

  42. Trees for the Titanic data(Titanic) # have you found this dataset? rpart, ctree, hclust, etc. for Survived ~ .

  43. More on another dataset. # Regression Tree Example library(rpart) # build the tree fitM <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary) printcp(fitM) # display the results …. Root node error: 1354.6/60 = 22.576 n=60 (57 observations deleted due to missingness) CP nsplitrel error xerrorxstd 1 0.622885 0 1.00000 1.03165 0.176920 2 0.132061 1 0.37711 0.51693 0.102454 3 0.025441 2 0.24505 0.36063 0.079819 4 0.011604 3 0.21961 0.34878 0.080273 5 0.010000 4 0.20801 0.36392 0.075650

  44. Mileage… plotcp(fitM) # visualize cross-validation results summary(fitM) # detailed summary of splits <we will look more at this in a future lab>

  45. par(mfrow=c(1,2)) rsq.rpart(fitM) # visualize cross-validation results

  46. # plot tree plot(fitM, uniform=TRUE, main="Regression Tree for Mileage ") text(fitM, use.n=TRUE, all=TRUE, cex=.8) # prune the tree pfitM<- prune(fitM, cp=0.01160389) # from cptable # plot the pruned tree plot(pfitM, uniform=TRUE, main="Pruned Regression Tree for Mileage") text(pfitM, use.n=TRUE, all=TRUE, cex=.8) post(pfitM, file = ”ptree2.ps", title = "Pruned Regression Tree for Mileage”)

  47. # Conditional Inference Tree for Mileage fit2M <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary))

More Related