1 / 19

Chapter 8 – Naïve Bayes

Chapter 8 – Naïve Bayes. DM for Business Intelligence. Naïve Bayes Characteristics. Theorem named after Reverend Thomas Bayes (1702 – 1761) Data-driven, not model-driven (i.e., lets data “create” the training model) Is Supervised Learning (i.e., defined Output Variable)

cstine
Download Presentation

Chapter 8 – Naïve Bayes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 8 – Naïve Bayes DM for Business Intelligence

  2. Naïve Bayes Characteristics Theorem named after Reverend Thomas Bayes (1702 – 1761) Data-driven, not model-driven (i.e., lets data “create” the training model) Is Supervised Learning (i.e., defined Output Variable) Is naive since it makes the assumption that each of its inputs are independent of each other (an assumption which rarely holds true in real life) Example: an illness may be diagnosed as Flu if a patient has a virus, a fever, and congestion (even though the virus may actually be causing the other two symptoms). Also naïve because it ignores the existence of other symptoms.

  3. Naïve Bayes: The Basic Idea For a given new record to be classified, find other records like it (i.e., same values for input variables) What is the prevalent class (i.e., Output Variable value) among those records? Assign that class to your new record

  4. Usage • Requires Categorical input variables • Numerical input variables must be binned and converted to Categorical variables • Can be used with very large data sets • “Exact” Bayes requires that you have exact records in your dataset matching the thing you are trying to classify • Bayes can be “modified” to calculate probability of a match in your dataset

  5. Exact Bayes Classifier Relies on finding other records that share same input variable values as record-to-be-classified. Want to find “probability of belonging to class C, given specified values of predictors/input variables” Even with large data sets, may be hard to find other records that exactly match your record, in terms of predictor values.

  6. Solution – Modified Naïve Bayes • Assume independence of predictor variables (within each class) • Use multiplication rule • Find same probability that record belongs to class C, given predictor values, without limiting calculation to records that share all those same values

  7. Modified Bayes Calculations • Take a record, and note its predictor values • Find the probabilities those predictor values occur across all records in C1 • Multiply them together, then by proportion of records belonging to C1 • Same for C2, C3, etc. • Prob. of belonging to C1 is value from step (3) divide by sum of all such values C1 … Cn • Establish & adjust a “cutoff” prob. for class of interest

  8. Example: Financial Fraud Target variable: Audit finds “Fraud,” or “No fraud” Predictors: Prior pending legal charges (yes/no) Size of firm (small/large) Goal: Classify a firm as being Fraudulent or Truthful, if Size = “Small” and Prior Charges = “Yes”

  9. Prior

  10. Exact Bayes Calculations Goal: classify (as “fraudulent” or as “truthful”) a small firm with prior charges filed There are 2 firms like that, one fraudulent and the other truthful There is no “majority” class, so . . . P(fraud | prior charges=y, size=small) = ½ = 0.50 Note: calculation is limited to the two firms matching those exact characteristics

  11. Modified Naïve Bayes Calculations Proportion of each Input Variable within Frauds Proportion of Frauds (Output) within Total Records Page 172 formula . . . P(Ci|x1…,xp) = P(x1 . . .,xp|Ci) P(Ci) P(x1 . . .,xp|Ci) P(Ci) + P(x1 . . .,xp|Cm) P(Cm) P(Fraud|Prior Charges & Small Firm) = Proportion of “Prior Charges,” Small,” & “Fraud” to Total Records Divided by . . . {same} + Proportion of “Prior Charges,” “Small,” & “Truthful” to Total Records Proportion of each Input Variable within Truthful Proportion of Truthful (Output) within Total Records

  12. Prior 4 Fraudulent Firms and 6 Truthful Firms (out of 10 total records)

  13. Modified Naïve Bayes Calculations Same goal as before Compute 2 quantities: Proportion of “prior charges = y” among frauds, times proportion of “small” among frauds, times proportion frauds = 3/4 * 1/4 * 4/10 = 0.075 Proportion of “prior charges = y” among truthfuls, times proportion of “small” among truthfuls, times proportion of truthfuls = 1/6 * 4/6 * 6/10 = 0.067 P(fraud | prior charges, small) = 0.075 / (0.075 + 0.067) = 0.53

  14. Modified Naïve Bayes, cont. • Note that probability estimate does not differ greatly from exact result of .50 • But, ALL records are used in calculations, not just those matching predictor values • This makes calculations practical in most circumstances • Relies on assumption of independence between predictor variables within each class

  15. Independence Assumption • Not strictly justified (variables often correlated with one another) • Often “good enough”

  16. Advantages • Handles purely categorical data well • Works well with very large data sets • Simple & computationally efficient

  17. Shortcomings • Requires large number of records • Problematic when a predictor category is not present in training data Assigns 0 probability of response, ignoring information in other variables

  18. On the other hand… • Probability rankings are more accurate than the exact probability estimates Good for applications using lift (e.g., response to mailing), less so for applications requiring probabilities (e.g., credit scoring)

  19. Summary • No statistical models involved • Naïve Bayes (like k-NN) pays attention to complex interactions and local structure • Computational challenges remain

More Related