Classification: Decision Trees & Naïve Bayes

Chap 4: ClassificationChapter Sections • Decision Trees • Naïve Bayes • Diagnostics of Classifiers • Additional Classification Models • Summary

Classification • Classification is widely used for prediction • Most classification methods are supervised • This chapter focuses on two fundamental classification methods • Decision trees • Naïve Bayes

Decision Trees • Tree structure specifies sequence of decisions • Given input X={x1, x2,…, xn}, predict output Y • Input attributes/features can be categorical or continuous • Node = tests a particular input variable • Root node, internal nodes, leaf nodes return class labels • Depth of node = minimum steps to reach node • Branch (connects two nodes) = specifies decision • Two varieties of decision trees • Classification trees: categorical output, often binary • Regression trees: numeric output

Decision TreesOverview of a Decision Tree • Example of a decision tree • Predicts whether customers will buy a product

Decision TreesOverview of a Decision Tree • Example: will bank client subscribe to term deposit?

Decision TreesThe General Algorithm • Construct a tree T from training set S • Requires a measure of attribute information • Simplistic method (data from previous Fig.) • Purity = probability of corresponding class • E.g., P(no)=1789/2000=89.45%, P(yes)=10.55% • Entropy methods • Entropymeasures the impurity of an attribute • Information gain measures purity of an attribute

Decision TreesThe General Algorithm • Entropy methods of attribute information • Hx = the entropy of X • Information gain of an attribute = base entropy – conditional entropy

Decision TreesThe General Algorithm • Construct a tree T from training set S • Choose root node = most informative attribute A • Partition S according to A’s values • Construct subtrees T1, T2… for the subsets of S recursively until one of following occurs • All leaf nodes satisfy minimum purity threshold • Tree cannot be further split with min purity threshold • Other stopping criterion satisfied – e.g., max depth

Decision TreesDecision Tree Algorithms • ID3 Algorithm T=training set, P=output variable, A=attribute

Decision TreesDecision Tree Algorithms • C4.5 Algorithm • Handles missing data • Handles both categorical and sontinuous variables • Uses bottom-up pruning to address overfitting • CART (Classification And Regression Trees) • Also handles continuous variables • Uses Gini diversity index as info measure

Decision TreesEvaluating a Decision Tree • Decision trees are greedy algorithms • Best option at each step, maybe not best overall • Addressed by ensemble methods: random forest • Model might overfit the data Blue = training set Red = test set Overcome overfitting: Stop growing tree early Grow full tree, then prune

Decision TreesEvaluating a Decision Tree • Decision trees -> rectangular decision regions

Decision TreesEvaluating a Decision Tree • Advantages of decision trees • Computationally inexpensive • Outputs are easy to interpret – sequence of tests • Show importance of each input variable • Decision trees handle • Both numerical and categorical attributes • Categorical attributes with many distinct values • Variables with nonlinear effect on outcome • Variable interactions

Decision TreesEvaluating a Decision Tree • Disadvantages of decision trees • Sensitive to small variations in the training data • Overfitting can occur because each split reduces training data for subsequent splits • Poor if dataset contains many irrelevant variables

Decision TreesDecision Trees in R # install packages rpart,rpart.plot # put this code into Rstudio source and execute lines via Ctrl/Enter library("rpart") library("rpart.plot") setwd("c:/data/rstudiofiles/") banktrain <- read.table("bank-sample.csv",header=TRUE,sep=",") ## drop a few columns to simplify the tree drops<-c("age", "balance", "day", "campaign", "pdays", "previous", "month") banktrain <- banktrain [,!(names(banktrain) %in% drops)] summary(banktrain) # Make a simple decision tree by only keeping the categorical variables fit <- rpart(subscribed ~ job + marital + education + default + housing + loan + contact + poutcome,method="class",data=banktrain,control=rpart.control(minsplit=1), parms=list(split='information')) summary(fit) # Plot the tree rpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=3)

Naïve Bayes • The naïve Bayes classifier • Based on Bayes’ theorem (or Bayes’ Law) • Assumes the features contribute independently • Features (variables) are generally categorical • Discretization of continuous variables is the process of converting continuous variables into categorical ones • Output is usually class label plus probability score • Log probability often used instead of probability

Naïve BayesBayes Theorem • Bayes’ Theorem where C = class, A = observed attributes • Typical medical example • Used because doctor’s frequently get this wrong

Naïve Bayes Classifier • Conditional independence assumption • And dropping common denominator, we get Find cj that maximizes P(cj|A)

Naïve Bayes Classifier • Example: client subscribes to term deposit? • The following record is from a bank client. Is this client likely to subscribe to the term deposit?

Naïve Bayes Classifier • Compute probabilities for this record

Naïve Bayes Classifier • Compute Naïve Bayes classifier outputs: yes/no • The client is assigned the label subscribed = yes • The scores are small, but the ratio is what counts • Using logarithms helps avoid numerical underflow

Smoothing • A smoothing technique assigns a small nonzero probability to rare events that are missing in the training data • E.g., Laplace smoothing assumes every output occurs once more than occurs in the dataset • Smoothing is essential – without it, a zero conditional probability results in P(cj|A)=0

Diagnostics • Naïve Bayes advantages • Handles missing values • Robust to irrelevant variables • Simple to implement • Computationally efficient • Handles high-dimensional data efficiently • Often competitive with other learning algorithms • Reasonably resistant to overfitting • Naïve Bayes disadvantages • Assumes variables are conditionally independent • Therefore, sensitive to double counting correlated variables • In its simplest form, used only for categorical variables

Naïve Bayes in R • This section explores two methods of using the naïve Bayes Classifier • Manually compute probabilities from scratch • Tedious with many R calculations • Use naïve Bayes function from e1071 package • Much easier – starts on page 222 • Example: subscribing to term deposit

Naïve Bayes in R • Get data and e1071 package > setwd("c:/data/rstudio/chapter07") > sample<-read.table("sample1.csv",header=TRUE,sep=",") > traindata<-as.data.frame(sample[1:14,]) > testdata<-as.data.frame(sample[15,]) > traindata #lists train data > testdata #lists test data, no Enrolls variable > install.packages("e1071", dep = TRUE) > library(e1071) #contains naïve Bayes function

Naïve Bayes in R • Perform modeling > model<-naiveBayes(Enrolls~Age+Income+JobSatisfaction+Desire,traindata) > model # generates model output > results<-predict(model,testdata) > Results # provides test prediction Using a Laplace parameter gives same result

Diagnostics of Classifiers • The book covered three classifiers • Logistic regression, decision trees, naïve Bayes • Tools to evaluate classifier performance • Confusion matrix

Diagnostics of Classifiers • Bank marketing example • Training set of 2000 records • Test set of 100 records, evaluated below

Diagnostics of Classifiers • Evaluation metrics

Diagnostics of Classifiers • Evaluation metrics on bank marketing 100 test set poor poor

Diagnostics of Classifiers • ROC curve: good for evaluating binary detection Bank marketing: 2000 training set + 100 test set > banktrain<-read.table("bank-sample.csv",header=TRUE,sep=",") > drops<-c("balance","day","campaign","pdays","previous","month") > banktrain<-banktrain[,!(names(banktrain) %in% drops)] > banktest<-read.table("bank-sample-test.csv",header=TRUE,sep=",") > banktest<-banktest[,!(names(banktest) %in% drops)] > nb_model<-naiveBayes(subscribed~.,data=banktrain) > nb_prediction<-predict(nb_model,banktest[,-ncol(banktest)],type='raw') > score<-nb_prediction[,c("yes")] > actual_class<-banktest$subscribed=='yes' > pred<-prediction(score,actual_class) # code problem

Diagnostics of Classifiers • ROC curve: good for evaluating binary detection • Bank marketing: 2000 training set + 100 test set

Additional Classification Methods • Ensemble methods that use multiple models • Bagging: bootstrap method that uses repeated sampling with replacement • Boosting: similar to bagging but iterative procedure • Random forest: uses ensemble of decision trees • These models usually have better performance than a single decision tree • Support Vector Machine (SVM) • Linear model using small number of support vectors

Summary • How to choose a suitable classifier among • Decision trees, naïve Bayes, & logistic regression

Classification: Decision Trees & Naïve Bayes

Classification: Decision Trees & Naïve Bayes

Presentation Transcript

Bankruptcy Law LAP 235 Mike Brigner, J.D.

Assemblers

Modified from Stanford CS276 slides Chap. 13: Text Classification; The Naive Bayes algorithm

Reactions of a- Hydrogens: Condensation Reactions

Chap.12: GML Programming 5 Extra Points

Classification

Chapter 6 Classification and Prediction (1)

Building System Models for RE

Nehemiah Chapter 4

CHAPTER FOUR

HOMEWORK:

Reading: Chapter 16 sections 4-end HW 15: 7/16/14

Addendum to Chap 2 and Chap 3

Master Thesis

Chap. 9 Competition

Classification Chapter 17

Homework Assignment #1

COMP 416b Internet Protocols and Software

Chapter 6. Classification and Prediction

HEBREWS CHAPTER FOUR VS. 1-10 HOWARD BUCK FEBRUARY 23, 2014

Simple Methods Chap. 4 Study Sections 4.1 – 4.4