1 / 12

Decision Trees in R

Decision Trees in R. Connecticut R Users Group Illya Mowerman March 26 th , 2013. Summary. Decision trees have many uses: exploratory data analysis, variable selection, modeling and more. In today’s discussion we will cover:

khoi
Download Presentation

Decision Trees in R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Trees in R Connecticut R Users Group Illya Mowerman March 26th, 2013

  2. Summary Decision trees have many uses: exploratory data analysis, variable selection, modeling and more. In today’s discussion we will cover: • What are decision trees. Decision trees have many uses, are extremely versatile, easy to interpret, and require little data preparation. • Decision tree packages in R. rpart (package used today), C50, Cubist • Enhancing tree outputs. One of the attractive features of trees is that they are easy to interpret. However, in the rpart package the output could use a little enhancing.

  3. What are Trees • Some packages in R • Enhancing Tree Outputs • References

  4. A decision tree is an algorithm the can have a continuous or categorical dependent (DV) and independent variables (IV).

  5. There are many advantages to using trees1. • Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. • Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. • Able to handle both numerical and categorical data. • Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by booleanlogic • Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. • Performs well with large data in a short time.

  6. Some things to consider when coding the model… • Splits. Gini or information. • Type of DV (method). Classification (class), regression (anova), count (poison), survival (exp). • Minimum of observations for a split (minsplit). • Minimum if observations in a node (minbucket). • Cross validation (xval). Used more in model building rather than in exploration. • Complexity parameter (Cp). This value is used for pruning. A smaller tree is perhaps less detailed, but with less error.

  7. What are Trees • Some packages in R • Enhancing Tree Outputs • References

  8. R has many packages for similar/same endeavors. • rpart. Comes with R. • C50. • Cubists. • rpart.plot. Makes rpart plots much nicer.

  9. What are Trees • Some packages in R • Enhancing Tree Outputs • References

  10. An alternative to the rpart plots is the prp function in the rpart.plotpackage. • extras. Values 1~9 displays extra “stuff” • boxcol. Define colors in the leafs. • xflip. Rotate the tree 180o • nn. Add node numbers for easier interpretation

  11. What are Trees • Some packages in R • Enhancing Tree Outputs • References

  12. References • http://en.wikipedia.org/wiki/Decision_tree_learning • http://www.stanford.edu/class/stats315b/minitech.pdf • http://www.milbo.org/rpart-plot/prp.pdf

More Related