1 / 50

Classification II

Classification II. Training Dataset. Example. Output: A Decision Tree for “ buys_computer ”. age?. overcast. <=30. >40. 30..40. student?. credit rating?. yes. no. yes. fair. excellent. no. yes. no. yes. +. –. –. –. –. –. +. –. +. +. –. –. –. +. +. +. –. –. –.

lacey
Download Presentation

Classification II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification II

  2. Training Dataset Example

  3. Output: A Decision Tree for “buys_computer” age? overcast <=30 >40 30..40 student? credit rating? yes no yes fair excellent no yes no yes

  4. + – – – – – + – + + – – – + + + – – – + – – – + – – – – – – – + + – + – + + – + – – – + + – + + + – + – – – – + – – – – – + + – + – – + + + – – Choice of attribute – + – – – – – + + – + – – + + + – – We prefer splits that lead to “pure” partitions purity: class labels are homogeneous

  5. Selecting the best split • Best splitis selected based on the degree of impurity of the child nodes • Class distribution (0,1) has high purity • Class distribution (0.5,0.5) has the smallest purity (highest impurity) • Intuition:high purity  small value of impurity measures  better split

  6. Algorithm for Decision Tree Induction (pseudocode) Algorithm GenDecTree(Sample S, Attlist A) • create a node N • If all samples are of the same class C then label N with C; terminate; • If A is empty then label N with the most common class C in S (majority voting); terminate; • Select aA, with the highest impurity reduction; Label N with a; • For each value v of a: • Grow a branch from N with condition a=v; • Let Sv be the subset of samples in S with a=v; • If Sv is empty then attach a leaf labeled with the most common class in S; • Else attach the node generated by GenDecTree(Sv, A-a)

  7. Attribute Selection Measure: Information Gain • Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| • - where Ci, Ddenotes the set of tuples that belong to class Ci • Expected information (entropy) needed to classify a tuple in D: • - where m is the number of classes

  8. Attribute Selection Measure: Information Gain • Informationneeded (after using A to split D into v partitions) to classify D: • Information gained by branching on attribute A

  9. Attribute Selection: Information Gain samples:yes no • Class P: buys_computer = “yes” • Class N: buys_computer = “no”

  10. Splitting the samples using age age? >40 <=30 30...40 labeled yes

  11. Giniindex • If a data set D contains examples from n classes, gini index, gini(D) is defined as - where pj is the relative frequency of class j in D • If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as

  12. Giniindex • Reduction in Impurity: • The attribute that provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node

  13. Comparing Attribute Selection Measures • The two measures, in general, return good results but • Both are biased towards multivalued attributes • Gini may have difficulty when # of classes is large • Gini tends to favor test sets that result in equal-sized partitions and purity in both partitions

  14. Is minimizing impurity/ maximizing Δ enough? • The Δ gain function favors attributes with large number of values • A test condition with large number of outcomes may not be desirable • # of records in each partition may become too small to make predictions

  15. Gain ratio • Amodification that reduces its bias on high-branch attributes • Gain ratio should be • Large when data is evenly spread to many branches • Small when all data belongs to one branch • Takes number and size of branches into account when choosing an attribute • It corrects Δ by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to)

  16. Gain ratio • Gain ratio = Δ/Splitinfo • SplitInfo = -Σi=1…kP(vi)log(P(vi)) • k: total number of branches (splits) • P(vi): probability that an instance belongs to a branch • If each branch of the split has the same number of records: • P(vi) = 1/kand SplitInfo = logk • Large number of splits  largeSplitInfo small gain ratio

  17. Decision boundary for decision trees • Border line between two neighboring regions of different classes is known as decision boundary • Decision boundary in decision trees is parallel to axes because test condition involves a single attribute at-a-time

  18. Oblique Decision Trees ? Not all datasets can be partitioned optimally using test conditions using single attributes!

  19. Oblique Decision Trees Test on multiple attributes If x+y< 1 then red class Not all datasets can be partitioned optimally using test conditions using single attributes!

  20. Oblique Decision Trees 500 circular and 500 triangular data points. Circular points: 0.5  sqrt(α2+y2)  1 Triangular points: sqrt(α2+y2) >1or sqrt(α2+y2) < 0.5

  21. Overfitting due to noise Decision boundary is distorted by noise point

  22. Overfitting due to insufficient samples x: class 1 : class 2 o: test samples Why?

  23. Overfitting due to insufficient samples x: class 1 : class 2 o: test samples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

  24. Overfitting and Tree Pruning • Overfitting: An induced tree may overfit the training data • Too many branches, some may reflect anomalies due to noise or outliers • Poor accuracy for unseen samples • Two approaches to avoid overfitting

  25. Overfitting and Tree Pruning • Two approaches to avoid overfitting • Prepruning: Halt tree construction early • do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold • Postpruning: Remove branches from a “fully grown” tree • prune a sub-tree if the classification error is smaller after pruning • get a sequence of progressively pruned trees

  26. Pros and Cons of decision trees • Cons • Cannot handle complicated relationships between features • Simple decision boundaries • Problems with lots of missing data • Constructing the optimal decision tree: NP-complete • Pros • Reasonable training time • Fast application • Easy to interpret • Easy to implement • Can handle large number of features

  27. Some well-known decisiontree learning implementations CARTBreiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Wadsworth ID3Quinlan JR (1986) Induction of decision trees. Machine Learning 1:81–106 C4.5 Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann J48 Implementation of C4.5 in WEKA

  28. Imputation: • An estimation of the missing value or of its distribution is used to generate predictions from a given model: • a missing value is replaced with an estimation of the value, or • the distribution of possible missing values is estimated and corresponding model predictions are combined probabilistically. Handling missing values

  29. Remove attributes with missing values • Remove examples with missing values • Assume most frequent value • Assume most frequent value given a class • Learn the distribution of a given attribute • Induce relationships between the available attribute values and the missing feature Handling missing values

  30. Imputing Missing Values • Expectation Minimization (EM): • Build model of data values (ignore missing values) • Use model to estimate missing values • Build new model of data values (including estimated values from previous step) • Use new model to re-estimate missing values • Re-estimate model • Repeat until convergence

  31. Potential Problems • Imputed values may be inappropriate: • in medical databases, if missing values not imputed separately for male and female patients, may end up with male patients with 1.3 prior pregnancies, and female patients suffering from prostate infection • many of these situations will not be so obvious • If some attributes are difficult to predict, filled-in values may be random (or worse)

  32. What is Bayesian Classification? • Bayesian classifiers are statistical classifiers • For each new sample they provide a probability that the sample belongs to a class (for all classes)

  33. Bayes’ Theorem: Basics • Let X be a data sample (“evidence”): class label is unknown • Let H be a hypothesis that X belongs to class C • Classification is to determine P(H|X) • the probability that the hypothesis holds given the observed data sample X

  34. Bayes’ Theorem: Basics • P(H) (prior probability): • the initial probability • E.g., anyX will buy computer, regardless of age, income, … • P(X): probability that sample data Xis observed • P(X|H) (posteriori probability): • the probability of observing the sample X, given that the hypothesis holds • E.g.,Given that X will buy computer, the prob. that X is 31..40, medium income

  35. Bayes’ Theorem • Given training dataX, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem • Informally, this can be written as posteriori = likelihood x prior/evidence • Predicts X belongs to Ciiff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes • Practical difficulty: require initial knowledge of many probabilities, significant computational cost

  36. Towards Naïve Bayesian Classifiers • D: training set of tuples and their associated class labels • X = (α1, α2, …, αn): each tuple is represented by an n-dimensional attribute vector • Suppose there are m classes C1, C2, …, Cm. • Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)

  37. Towards Naïve Bayesian Classifiers • Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) • This can be derived from Bayes’ theorem • Since P(X) is constant for all classes, only needs to be maximized

  38. Derivation of Naïve Bayesian Classifier • A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): • Each xkis a potential value of attribute αk • This greatly reduces the computation cost: Only counts the class distribution

  39. Derivation of Naïve Bayesian Classifier • If Ak is categorical • P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) • If Ak is continous-valued: • P(xk|Ci) is computed based on a Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci): • where

  40. Naive Bayesian Classifier Example play tennis?

  41. Naive Bayesian Classifier Example 9 5

  42. Naive Bayesian Classifier Example • Given the training set, we compute the following probabilities P(xk | Ci) : • We also have the prior class probabilities • P(Ci = P) = 9/14 • P(Ci = Ν) = 5/14

  43. Example • To classify a new sample X: < outlook = sunny, temperature = cool, humidity = high, windy = false > • Prob(Ci = P|X) = Prob(P) * Prob(sunny|P)*Prob(cool|P)* Prob(high|P)*Prob(false|P) = 9/14*2/9*3/9*3/9*6/9 = 0.01 • Prob(Ci = N|X) = Prob(N) * Prob(sunny|N)*Prob(cool|N)* Prob(high|N)*Prob(false|N) = 5/14*3/5*1/5*4/5*2/5 = 0.013 • Therefore X takes class label N

  44. Avoiding the 0-Probability Problem • Naïve Bayesian prediction requires each conditional probability to be non-zero • Otherwise, the predicted probability will be zero:

  45. Avoiding the 0-Probability Problem • Ex. Suppose a dataset with: • 1000 tuples • income=low (0) • income= medium (990) • income = high (10) • Use Laplacian correction (or Laplacian estimator) • Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 • The “corrected” probability estimates are close to their “uncorrected” counterparts

  46. NBC: Comments • Advantages • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence, therefore loss of accuracy • Practically, dependencies exist among variables • How to deal with these dependencies? • Bayesian Belief Networks

  47. The perceptron Input: each example xi has a set of attributes xi = {α1, α2, …, αm} and is of class yi Estimated classification output:ui Task: express each sample xi as a weighted combination (linear combination) of the attributes <w,x>: the inner or dotproduct of w and x How to learn the weights ?

  48. The perceptron online + - f(x) can also be written as a linear combination of all training examples +

  49. The perceptron - - - + + - - - + - + - - + - + + + + - + + + + - + + + The perceptron learning algorithm is guaranteed to find a separating hyperplane– if there is one. separating hyperplane

More Related