1 / 96

Data Mining Classification: Alternative Techniques

Rule-Based Classifier. Classify records by using a collection of ifthen" rulesRule: (Condition) ? ywhere Condition is a conjunctions of attributes y is the class labelLHS: rule antecedent or conditionRHS: rule consequentExamples of classification rules: (Blood Type=Warm) ? (Lay Eggs

istas
Download Presentation

Data Mining Classification: Alternative Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Data Mining Classification: Alternative Techniques

    2. Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: (Condition) ? y where Condition is a conjunctions of attributes y is the class label LHS: rule antecedent or condition RHS: rule consequent Examples of classification rules: (Blood Type=Warm) ? (Lay Eggs=Yes) ? Birds (Taxable Income < 50K) ? (Refund=Yes) ? Evade=No

    3. Rule-based Classifier (Example) R1: (Give Birth = no) ? (Can Fly = yes) ? Birds R2: (Give Birth = no) ? (Live in Water = yes) ? Fishes R3: (Give Birth = yes) ? (Blood Type = warm) ? Mammals R4: (Give Birth = no) ? (Can Fly = no) ? Reptiles R5: (Live in Water = sometimes) ? Amphibians

    4. Application of Rule-Based Classifier A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule

    5. Rule Coverage and Accuracy Coverage of a rule: Fraction of records that satisfy the antecedent of a rule Accuracy/Confidence of a rule: Fraction of records for which the rule “fires” (antecedent is satisfied) that yields correct classification

    6. How does Rule-based Classifier Work?

    7. Characteristics of Rule-Based Classifier Rule Learners may or may not have these properties: Mutually exclusive rules Every record is covered by at most one rule Exhaustive rules Classifier has exhaustive coverage if it accounts for every possible combination of attribute values Each record is covered by at least one rule If both mutually exclusive and exhaustive, then every record covered by exactly one rule

    8. From Decision Trees To Rules

    9. Rules Can Be Simplified

    10. Effect of Rule Simplification Rules are no longer mutually exclusive A record may trigger more than one rule Solution? Ordered rule set Unordered rule set – use voting schemes Rules are no longer exhaustive A record may not trigger any rules Solution? Use a default class

    11. Ordered Rule Set Rules are rank ordered according to their priority An ordered rule set is known as a decision list When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has triggered If none of the rules fired, it is assigned to the default class

    12. Rule Ordering Schemes Rule-based ordering Individual rules are ranked based on their quality Class-based ordering Rules that belong to the same class appear together

    13. Building Classification Rules Direct Method: Extract rules directly from data e.g.: RIPPER, CN2, Holte’s 1R Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, etc). e.g: C4.5rules

    14. Direct Method: Sequential Covering Start from an empty rule Grow a rule using the Learn-One-Rule function Remove training records covered by the rule Repeat Step (2) and (3) until stopping criterion is met

    15. Example of Sequential Covering

    16. Example of Sequential Covering…

    17. Aspects of Sequential Covering Rule Growing Instance Elimination Rule Evaluation Stopping Criterion Rule Pruning

    18. Rule Growing Two common strategies

    19. Instance Elimination Why do we need to eliminate instances? Otherwise, the next rule is identical to previous rule Why do we remove positive instances? Ensure that the next rule is different Why do we remove negative instances? Prevent underestimating accuracy of rule Compare rules R2 and R3 in the diagram

    20. Rule Evaluation Metrics: Accuracy Laplace Laplace moves the probability estimate towards the probability you would use with no information and assumed all classes were equally likely We will work through an example when we get to Naïve Bayes Classifiers

    21. Rule Pruning Rule Pruning Similar to post-pruning of decision trees Reduced Error Pruning: Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning If error improves, prune the conjunct

    22. Summary of Direct Method Grow a single rule Remove Instances from rule Prune the rule (if necessary) Add rule to Current Rule Set Repeat

    23. Direct Method: RIPPER For 2-class problem, have minority class be the positive class and the majority class the negative class Learn rules for positive class Negative class will be default class Good for data sets with unbalanced class distributions For multi-class problem Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative class Repeat with next smallest class as positive class

    24. Direct Method: RIPPER Growing a rule: Start from empty rule Add conjuncts as long as they improve FOIL’s information gain Stop when rule no longer covers negative examples Prune the rule immediately using incremental reduced error pruning Measure for pruning: v = (p-n)/(p+n) p: number of positive examples covered by the rule in the validation set n: number of negative examples covered by the rule in the validation set Pruning method: delete any final sequence of conditions that maximizes v

    25. Direct Method: RIPPER Building a Rule Set: Use sequential covering algorithm Finds the best rule that covers the current set of positive examples Eliminate both positive and negative examples covered by the rule Each time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far

    26. Indirect Methods

    27. Indirect Method: C4.5rules Extract rules from an unpruned decision tree For each rule, r: A ? y, consider an alternative rule r’: A’ ? y where A’ is obtained by removing one of the conjuncts in A Compare the pessimistic error rate for r against all r’s Prune if one of the r’s has lower pessimistic error rate Repeat until we can no longer improve generalization error

    28. Indirect Method: C4.5rules Instead of ordering the rules, order subsets of rules (class ordering) Each subset is a collection of rules with the same rule consequent (class) Compute description length of each subset Order classes with lowest Minimum Description Length first

    29. Advantages of Rule-Based Classifiers As highly expressive as decision trees Are they more expressive? Can every DT be turned into a Rule Set? Can every Rule Set be turned into a DT? Easy to interpret Easy to generate Can classify new instances rapidly Performance comparable to decision trees

    30. Instance-Based Classifiers

    31. Instance Based Classifiers Examples: Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly No generalization Nearest neighbor Uses k “closest” points (nearest neighbors) for performing classification

    32. Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck

    33. Nearest-Neighbor Classifiers

    34. Definition of Nearest Neighbor

    35. 1 nearest-neighbor Vornoi Diagram

    36. Nearest Neighbor Classification Compute distance between two points: Euclidean distance Determine the class from nearest neighbor list take the majority vote of class labels among the k-nearest neighbors Weigh the vote according to distance weight factor, w = 1/d2

    37. Nearest Neighbor Classification… Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes

    38. Nearest Neighbor Classification… Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M

    39. Nearest Neighbor Classification… Problem with Euclidean measure: High dimensional data curse of dimensionality Can produce counter-intuitive results if space is very sparse

    40. Nearest neighbor Classification… k-NN classifiers are lazy learners It does not build models explicitly Unlike eager learners such as decision tree induction and rule-based systems Classifying unknown records are relatively expensive

    41. Remarks on Lazy vs. Eager Learning Instance-based learning: lazy evaluation Decision-tree and Rule-Based classification: eager evaluation Key differences Lazy method may consider query instance xq when deciding how to generalize beyond the training data D Eager method cannot since they have already chosen global approximation when seeing the query Efficiency: Lazy - less time training but more time predicting Accuracy Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function Eager: must commit to a single hypothesis that covers the entire instance space

    42. Bayes Classifier A probabilistic framework for classification problems Often appropriate because the world is noisy and also some relationships are probabilistic in nature Is predicting who will win a baseball game probabilistic in nature? Before getting the heart of the matter, we will go over some basic probability. I will be much more “gentle” than the text and go over all of the details and reasoning You may find a few of these basic probability questions on your exam and also should understand how Bayesian classifiers work Stop me if you have questions!!!!

    43. Conditional Probability The conditional probability of an event C given that event A has occurred is denoted P(C|A) Here are some practice questions that you should be able to answer based on your knowledge from CS1100/1400 Given a standard six-sided die: What is P(roll a 1|roll an odd number)? What is P(roll a 1|roll an even number)? What is P(roll a number >2)? What is P(roll a number >2|roll a number > 1)? Given a standard deck of cards: What is P(pick an Ace|pick a red card)? What is P(pick ace of clubs|pick an Ace)? What is P(pick an ace on second card|pick an ace on first card)? Do you get it? Any questions before we move on?

    44. Conditional Probability Continued P(A,C) is the probability that A and C both occur What is P(pick an Ace, pick a red card) = ??? Note that this is the same a the probability of picking a red ace Two events are independent if one occurring does not impact the occurrence or non-occurrence of the other one Please give me some examples of independent events You can think of some examples from flipping coins if that helps In the example from above, are A and C independent events? If two events A and B are independent, then: P(A, B) = P(A) x P(B) Note that this helps us answer the P(A,C) above, although most of you probably solved it directly.

    45. Conditional Probability Continued

    46. An Example Let’s use the example from before, where: A = “pick and Ace” and C = “pick a red card” Using the previous equation: P(C|A) = P(A,C)/P(A) We now get: P(red card|Ace) = P(red ace)/P(Ace) P(red card|Ace) = (2/52)/(4/52) = 2/4 = .5 Hopefully this makes sense, just like the Venn Diagram should

    47. Bayes Theorem Bayes Theorem states that: P(C|A) =[ P(A|C) P(C)] / P(A) Prove this equation using the two equations we informally showed to be true using the Venn Diagrams: P(C|A) = P(A,C)/P(A) and P(A|C) = P(A,C)/P(C) Start with P(C|A) = P(A,C)/P(A) Then notice that we are part way there and only need to replace P(A,C) By rearranging the second equation (after the “and”) to isolate P(A,C), we can substitute in P(A|C)P(C) for P(A,C) and the proof is done!

    48. Some more terminology The Prior Probability is the probability assuming no specific information. Thus we would refer to P(A) as the prior probability of even A occurring We would not say that P(A|C) is the prior probability of A occurring The Posterior probability is the probability given that we know something We would say that P(A|C) is the posterior probability of A (given that C occurs)

    49. Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis?

    50. Bayesian Classifiers Given a record with attributes (A1, A2,…,An) The goal is to predict class C Actually, we want to find the value of C that maximizes P(C| A1, A2,…,An ) Can we estimate P(C| A1, A2,…,An ) directly (w/o Bayes)? Yes, we simply need to count up the number of times we see A1, A2,…,An and then see what fraction belongs to each class For example, if n=3 and the feature vector “4,3,2” occurs 10 times and 4 of these belong to C1 and 6 to C2, then: What is P(C1|”4,3,2”)? What is P(C2|”4,3,2”)? Unfortunately, this is generally not feasible since not every feature vector will be found in the training set

    51. Bayesian Classifiers Indirect Approach: Use Bayes Theorem compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem Choose value of C that maximizes P(C | A1, A2, …, An) Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) Since the denominator is the same for all values of C

    52. Naïve Bayes Classifier How can we estimate P(A1, A2, …, An |C)? We can measure it directly, but only if the training set samples every feature vector. Not practical! So, we must assume independence among attributes Ai when class is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) Then we can estimate P(Ai| Cj) for all Ai and Cj from training data This is reasonable because now we are looking only at one feature at a time. We can expect to see each feature value represented in the training data. New point is classified to Cj if P(Cj) ? P(Ai| Cj) is maximal.

    53. How to Estimate Probabilities from Data? Class: P(C) = Nc/N e.g., P(No) = 7/10, P(Yes) = 3/10 For discrete attributes: P(Ai | Ck) = |Aik|/ Nc where |Aik| is number of instances having attribute Ai and belongs to class Ck Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0

    54. How to Estimate Probabilities from Data? For continuous attributes: Discretize the range into bins Two-way split: (A < v) or (A > v) choose only one of the two splits as new attribute Creates a binary feature Probability density estimation: Assume attribute follows a normal distribution and use the data to fit this distribution Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) We will not deal with continuous values on HW or exam Just understand the general ideas above For the example tax cheating example, we will assume that “Taxable Income” is discrete Each of the 10 values will therefore have a prior probability of 1/10

    55. Example of Naïve Bayes We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No? Here is the feature vector: Refund = No, Married, Income = 120K Now what do we do? First try writing out the thing we want to measure

    56. Example of Naïve Bayes We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No? Here is the feature vector: Refund = No, Married, Income = 120K Now what do we do? First try writing out the thing we want to measure P(Evade|[No, Married, Income=120K]) Next, what do we need to maximize?

    57. Example of Naïve Bayes We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No? Here is the feature vector: Refund = No, Married, Income = 120K Now what do we do? First try writing out the thing we want to measure P(Evade|[No, Married, Income=120K]) Next, what do we need to maximize? P(Cj) ? P(Ai| Cj)

    58. Example of Naïve Bayes Since we want to maximize P(Cj) ? P(Ai| Cj) What quantities do we need to calculate in order to use this equation? Someone come up to the board and write them out, without calculating them Recall that we have three attributes: Refund: Yes, No Marital Status: Single, Married, Divorced Taxable Income: 10 different “discrete” values While we could compute every P(Ai| Cj) for all Ai, we only need to do it for the attribute values in the test example

    59. Values to Compute Given we need to compute P(Cj) ? P(Ai| Cj) We need to compute the class probabilities P(Evade=No) P(Evade=Yes) We need to compute the conditional probabilities P(Refund=No|Evade=No) P(Refund=No|Evade=Yes) P(Marital Status=Married|Evade=No) P(Marital Status=Married|Evade=Yes) P(Income=120K|Evade=No) P(Income=120K|Evade=Yes)

    60. Computed Values Given we need to compute P(Cj) ? P(Ai| Cj) We need to compute the class probabilities P(Evade=No) = 7/10 = .7 P(Evade=Yes) = 3/10 = .3 We need to compute the conditional probabilities P(Refund=No|Evade=No) = 4/7 P(Refund=No|Evade=Yes) 3/3 = 1.0 P(Marital Status=Married|Evade=No) = 4/7 P(Marital Status=Married|Evade=Yes) =0/3 = 0 P(Income=120K|Evade=No) = 1/7 P(Income=120K|Evade=Yes) = 0/7 = 0

    61. Finding the Class Now compute P(Cj) ? P(Ai| Cj) for both classes for the test example [No, Married, Income = 120K] For Class Evade=No we get: .7 x 4/7 x 4/7 x 1/7 = 0.032 For Class Evade=Yes we get: .3 x 1 x 0 x 0 = 0 Which one is best? Clearly we would select “No” for the class value Note that these are not the actual probabilities of each class, since we did not divide by P([No, Married, Income = 120K])

    62. Naïve Bayes Classifier If one of the conditional probability is zero, then the entire expression becomes zero This is not ideal, especially since probability estimates may not be very precise for rarely occurring values We use the Laplace estimate to improve things. Without a lot of observations, the Laplace estimate moves the probability towards the value assuming all classes equally likely

    63. Naïve Bayes (Summary) Robust to isolated noise points Robust to irrelevant attributes Independence assumption may not hold for some attributes But works surprisingly well in practice for many problems

    64. More Examples There are two more examples coming up Go over them before trying the HW, unless you are clear on Bayesian Classifiers You are not responsible for Bayesian Belief Networks

    66. Play-tennis example: classifying X An unseen sample X = <rain, hot, high, false> P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 Sample X is classified in class n (don’t play)

    67. Example of Naïve Bayes Classifier

    68. Artificial Neural Networks (ANN) The study of ANNs was inspired by human brain ANNs do not try to exactly mimic the brain Relevant terms: Neurons: nerve cells in the brain synapse: gap between neurons over which electrical signal travels The brain learns by changing the strength of synaptic connections upon repeated stimulation by same impulse How do you think the brain and a ANN differ? Neurons very slow but brain massively parallel Can you think of tasks that show this difference? Adding numbers: sequential so brain slower Seeing an image: brain can exploit parallelism

    69. Artificial Neural Networks (ANN)

    70. Artificial Neural Networks (ANN)

    71. Artificial Neural Networks (ANN) Model is an assembly of inter-connected nodes and weighted links Output node sums up each of its input value according to the weights of its links Compare output node against some threshold t Perceptron model has no hidden units and simple activation function What is Y for X=1,0,1 and 0,1,1?

    72. Algorithm for learning ANN Initialize the weights (w0, w1, …, wk) Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples Objective function: Find the weights wi’s that minimize the above objective function e.g., backpropagation algorithm Make small corrections to the weights for each example Iterate over the data many times. Each cycle is called an epoch. Will make many passes over the training data.

    73. Weight Update formula Weight update formula for iteration k wj(k+1) = wj(k) + ?(yi -y'(k))xij xij is the value for the jth attribute of training example xi ? is the learning rate and controls how big the correction is What are the advantages of making ? big or small?

    74. General Structure of ANN

    75. Learning Power of a ANN Perceptron is guaranteed to converge if data is linearly separable It will learn a hyperplane that separates the classes The XOR function on page 250 is not linearly separable A mulitlayer ANN has no such guarantee of convergence but can learn functions that are not linearly separable An ANN with a hidder layer can learn the XOR function by constructing two hyperplanes (see page 253)

    76. History of ANNs 1940's McCulloch and Pitts introduced the first neural network computing model In 1950's, Rosenblatt introduced perceptron Because perceptrons have limited expressive power, the initial interest in ANNs waned Later people realized that the lack of convergence isn't necessarily such a big deal and that multilayer perceptrons are powerful referred to as universal approximators Great interest in 1980's in ANNs

    77. Characteristics of an ANN Very expressive and can approximate any function Much more expressive than decision trees Relatively time consuming Much slower than decision trees But classification of examples is very quick Do you think that neural networks are interpretable? Try looking at someone’s brain to see what they are thinking! Handle continues inputs quite well and naturally Seem popular with sensory applications (e.g., recognizing objects)

    78. Final Notes on ANNs More complex ANNs exist Recurrent ANNs allows links from one layer back to the previous layers or to same layer If this is not allowed, then we have a feedforward ANN. Perceptrons are feedforward only. Most ANNs use an activation function that is more complex than the sign function. Sigmoid functions are popular (see page 252) These functions compute values that are non-linear in terms of the input values

    79. Genetic Algorithms (in 1 Slide) GA: based on an analogy to biological evolution Each rule is represented by a string of bits An initial population is created consisting of randomly generated rules e.g., If A1 and Not A2 then C2 encoded as 100 Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation GA’s are a general search/optimization method, not just a classification method. This can be contrasted with other methods

    80. Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

    81. General Idea

    82. Why does it work? Suppose there are 25 base classifiers Each classifier has error rate, ? = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction: Practice has shown that even when independence does not hold results are good

    83. Methods for generating Multiple Classifiers Manipulate the training data Sample the data differently each time Examples: Bagging and Boosting Manipulate the input features Sample the featurres differently each time Makes especially good sense if there is redundancy Example: Random Forest Manipulate Class labels Convert multi-class problem to ensemble of binary classifiers Manipulate the learning algorithm Vary some parameter of the learning algorithm For example: amount of pruning, ANN network topology, etc.

    84. Background Classifier performance can be impacted by: Bias: assumptions made to help with generalization "Simpler is better" is a bias Variance: a learning method will give different results based on small changes (e.g., in training data). When I run experiments and use random sampling with repeated runs, I get different results each time. Noise: measurements may have errors or the class may be inherently probabilistic

    85. How Ensembles Help Ensemble methods can assist with the bias and variance Averaging the results over multiple runs will reduce the variance I observe this when I use 10 runs with random sampling and see that my learning curves are much smoother Ensemble methods especially helpful for unstable classifier algorithms Decision trees are unstable since small changes in the training data can greatly impact the structure of the learned decision tree If you combine different classifier methods into an ensemble, then you are using methods with different biases You are more likely to use a classifier with a bias that is a good match for the problem You may even be able to identify the best methods and weight them more

    86. Examples of Ensemble Methods How to generate an ensemble of classifiers? Bagging Boosting These methods have been shown to be quite effective A technique ignored by the textbook is to combine classifiers built separately By simple voting By voting and factoring in the reliability of each classifier Note that Enterprise Miner allows you to do this by combining diverse classifiers

    87. Bagging Sampling with replacement Build classifier on each bootstrap sample Each sample has probability (1 – 1/n)n of being selected (about 63% for large n) Some values will be picked more than once Combine the resulting classifiers, such as by majority voting Greatly reduces the variance when compared to a single base classifier

    88. Boosting An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records Initially, all N records are assigned equal weights Unlike bagging, weights may change at the end of boosting round

    89. Boosting Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased

    90. Ensembles and Enterprise Miner Enterprise Miner has support for: Bagging Boosting Combining different models For example, a ANN and DT model C5 also has support for boosting C4.5 has no support for boosting

    91. Class Imbalance Class imbalance occurs when the classes in the distribution are very unevenly distributed Examples would include fraud prediction and identification of rare diseases If there is class imbalance, accuracy may be high even if the rare class is never predicted This could be okay, but only if both classes are equally important This is usually not the case. Typically the rare class is more important The cost of a false negative is usually much higher than the cost of a false positive The rare class is designated the positive class

    92. Confusion Matrix The following abbreviations are standard: FP: false positive TP: True Positive FN: False Negative TN: True Negative

    93. Classifier Evaluation Metrics Accuracy = (TP + TN)/(TP + TN + FN + FP) Precision and Recall Recall = TP/(TP + FN) Recall measure the fraction of positive class examples that are correctly identified Precision = TP/(TP + FP) Precision is essentially the accuracy of the examples that are classified as positive We would like to maximize precision and recall They are usually competing goals These measure are appropriate for class imbalance since recall explicitly addresses the rare class

    94. Classifier Evaluation Metrics Cont. One issue with precision and recall is that they are two numbers, not one That makes simple comparisons more difficulty and it is not clear how to determine the best classifier Solution: combine the two F-measure combines precision and recall The F1 measure is defined as: (2 x recall x precision)/(recall + precision)

    95. ROC Curves Plots TP Rate on y-axis vs. FP Rate on x –axis TP Rate = TP/(TP + FN) FP Rate = FP/(TN + FP) Interpretation of critical points (0,0): Model always predicts negative (1,1): Model always predicts positive (1,0): Ideal mode: all positive predicted correctly and no negatives mistakenly classified as positive ROC Curves can be used to compare classifiers Area under the ROC curve (AUC) generates a single statistic of overall robustness of a model AUC becoming almost as common as accuracy The ROC curve will not depend to the class distribution of the population

    96. What you need to know For evaluation metrics, you need to know What the confusion matrix is and how to generate it How to define accuracy and compute it from a confusion matrix You should be able to figure out the formula for precision and recall and compute the values from a confusion matrix Understand the F-meaure I would provide formula Understand the basics of the ROC curve but not the details

    97. Cost-Sensitive Learning Cost-sensitive learning will factor in cost information For example, you may be given the relative cost of a FN vs. FP You may be given the costs or utilities associated with each quadrant in the confusion matrix We did this with Enterprise Miner to figure out the optimal classifier Cost-sensitive learning can be implemented by sampling the data to reflect the costs If the cost of a FN is twice that of a FP, then you can increase the ratio of positive examples by a factor of 2 when constructing the training set This is a good general area for research and relates to some of my research interests

More Related