1 / 46

Naïve Bayes and Logistic Regression & Classification

Naïve Bayes and Logistic Regression & Classification. B.Ramamurthy. Outline. Review last class’s methods We will answer Jim’s question about K-means More details on K-means Issues with K-means

atkinsone
Download Presentation

Naïve Bayes and Logistic Regression & Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Naïve Bayes and Logistic Regression & Classification B.Ramamurthy Rich's Big Data Analytics Training

  2. Outline • Review last class’s methods • We will answer Jim’s question about K-means • More details on K-means • Issues with K-means • Partitioning Around Medoids (PAM): we will work through the data science process using data from world bank; • Supervised machine learning approaches • Naïve Bayes • Logistic regression • Discuss with several applications • We will review at a high level the concept of “classification” and its relevance to business intelligence • We will also introduce the Shiny package of R for building web applications • I am also going to provide some data strategy recommendations as we move along. • This is the most code-intensive session of all; we are done with R by the end of this session. Rich's Big Data Analytics Training

  3. K-means: Issues Popular clustering methods for data using some distance measure Clusters around centers that are means of the clusters: need not be a data point! As you observed the clusters may not be unique between runs of K-means since the cluster analysis starts with k randomly chosen centroids, a different solution can be obtained each time the function is invoked. Use the set.seed() function to guarantee that the results are reproducible. Additionally, K-means clustering approach can be sensitive to the initial selection of centroids. Thekmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate 25 initial configurations. This approach is often recommended. Rich's Big Data Analytics Training

  4. K-means Solution #set.seed(100) age<-c(23, 25, 24, 23, 21, 31, 32, 30,31, 30, 37, 35, 38, 37, 39, 42, 43, 45, 43, 45) #clust<-kmeans(age,centers=3) clust<-kmeans(age, center=3, nstart=25) plotcluster(age,clust$cluster) clust # try this with set.seed() and nstart approach # you should see the same cluster centers and clusters Rich's Big Data Analytics Training

  5. Categorical data K-means clustering does not work with “categorical data” Example: Cluster the countries around the world by categories decided by many attributes. This data by world bank contains numerical information as well as categorical data such as income levels, regions etc. We will work through a complete example1. Outcome of this exercise is countries clustered into 12 clusters; decided by the combination of various economic indicators. Observe and study the clusters for different years 2013, 2011 On to PAM algorithm details. Rich's Big Data Analytics Training

  6. PAM4 • Initialize: randomly select (without replacement) k of the n data points as the medoids • Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) • For each medoidm • For each non-medoid data point o • Swap m and o and compute the total cost of the configuration • Select the configuration with the lowest cost. • Repeat steps 2 to 4 until there is no change in the medoid. Rich's Big Data Analytics Training

  7. Exercise 1: World Bank Data • For this exercise we will work with World Bank Data already available in package WDI. • This data has been downloaded and is available in the data folder of today’s zip file. • Our goal is to make 12 clusters out of the countries based on factors such as income level, lending level, region etc. • We will spend a lot of time cleaning up and filtering the data before we do a one-liner PAM clustering. • Data Strategy: designate a team member as data-wrangler who will “tame” the data to the form that can be processed easily • Clustering results in a model; • Simple plot of cluster to too complex to read or for visual communication/discussion Rich's Big Data Analytics Training

  8. World Bank Data • Since we are dealing with country information, we can map the clusters on a world map. • We get the world map for this from another data source of world bank. The files are included in your zip folder. • We will develop a data frame form the clusters and plot it using ggplot on a world map for quick engaging display of the clusters. • ggplot2 is a highly useful and popular plotting /graphing package • Once R script is developed for a year (say 2011), you can reuse it for any other year just by changing the parameter year. Rich's Big Data Analytics Training

  9. Clustering Countries by World Bank Data 2011 2013 Rich's Big Data Analytics Training

  10. More on Classification Classification is placing things where they belong to discover patterns such as like-minded people, customers with similar tastes… Classification relies on a priori reference structures that divide the space of all possible data points into a set of classes that are not overlapping. Rich's Big Data Analytics Training

  11. Classification examples in daily life Restaurant menu: appetizers, salads, soups, entrée, dessert, drinks,…products Library of congress (LIC) system classifies books according to a standard scheme Injuries and diseases classification is physicians and healthcare workers Classification of all living things: eg., Home Sapiens (genus, species) Classifications of products by UPC code or some such attribute Rich's Big Data Analytics Training

  12. Categories of classification algorithms • With respect to underlying technique two broad categories: • Statistical algorithms • Regression for forecasting • Bayes classifier depicts the dependency of the various attributes of the classification problem. • Structural algorithms • Rule-based algorithms: if-else, decision trees • Distance-based algorithm: similarity, nearest neighbor • Neural networks Rich's Big Data Analytics Training

  13. Classifiers Rich's Big Data Analytics Training

  14. Life Cycle of a classifier: training, testing and production Rich's Big Data Analytics Training

  15. Training Stage Provide classifier with data points for which we have already assigned an appropriate class. Purpose of this stage is to determine the parameters Rich's Big Data Analytics Training

  16. Validation Stage Testing or validation stage we validate the classifier to ensure credibility for the results. Primary goal of this stage is to determine the classification errors. Quality of the results should be evaluated using various metrics Training and testing stages may be repeated several times before a classifier transitions to the production stage. Rich's Big Data Analytics Training

  17. Production stage The classifier(s) is used here in a live production system. It is possible to enhance the production results by allowing human-in-the-loop feedback. The three steps are repeated as we get more data from the production system. Data Strategy: Configuring these three stages will be the responsibility of a team member who is a domain expert knowledgeable about the data being classified, classes needed etc. Rich's Big Data Analytics Training

  18. Advantages and Disadvantages Distance-based ones work well for low-dimensionality space: one or two features How about classifying a data set with large number of features? Chapter 4 discusses two methods: Naïve Bayes and Logistic regression Rich's Big Data Analytics Training

  19. Naïve Bayes Naïve Bayes classifier One of the most celebrated and well-known classification algorithms of all time. Probabilistic algorithm Typically applied and works well with the assumption of independent attributes, but also found to work well even with some dependencies. Handles multiple features (Think of this as columns in your relational table) Rich's Big Data Analytics Training

  20. Overview2 • Two classes: Binary classification • Our goal is to learn how to correctly classify into two types of classes: • Yes or No, 0 or 1, will click or not click, will buy this product or not, recommend or not, good or bad • First step: we need to devise a model of the function f : we use Naïve Bayes or logistic regression • Given this model f , classify data into two classes {0,1} • Binary classification • If f(x) = p, p>0.5 then class 0, else class 1 is a typical application of the method. Rich's Big Data Analytics Training

  21. Bayesian Inference Intuition: H - Hypothesis, E - Evidence = Posterior probability is proportional to likelihood times prior Can be extended to multiple features. likelihood prior posterior Rich's Big Data Analytics Training

  22. Example1 for Naïve Bayes • A rare disease with 1% probability (prior) • We have highly sensitive and specific test that is • 99% positive for sick patients • 99% negative for non-sick • If a patient tests positive, what is probability that he/she is sick? • Approach: patient is sick : sick, tests positive + • P(sick/+) = P(+/sick) P(sick)/P(+)= 0.99*0.01/(0.99*0.01+0.99*0.01) = ½ = 0.5 Rich's Big Data Analytics Training

  23. Example2 for NB: very popular and common use Enron emails: 1500 spam emails, 3672 good emails For a given word “meeting”, it is known that spam emails contain the word 16 times, good emails contain the word “meeting” 153 times “Learn” from this by applying Bayes rule. Now you get an email with the word “meeting”. What is the probability that this email with word “meeting” is a spam? When do you classify this as spam? Rich's Big Data Analytics Training

  24. Classification • Review: • Training set  design a model • Test set  validate the model • Classify data set using the model • Goal of classification: to label the items in the set to one of the given/known classes • For spam filtering it is binary class: spam or not spam(good) Rich's Big Data Analytics Training

  25. Why not use methods in Chapter 3? Linear regression is about continuous variables, not about binary classification K-means /PAM are for clustering where there is no prior information about classes K-NN cannot accommodate multi-features: curse of dimensionality (K-NN performs well for few dimensions) For spam classification: : 1 distinct word 1 feature 10000 words 10000 features! What are we going to use? Naïve Bayes Rich's Big Data Analytics Training

  26. Spam Filter for individual words Classifying mail into spam and not spam: binary classification Lets say if we get a mail with --- you have won a “lottery” right away you know it is a spam. We will assume that is if a word qualifies (?) to be a spam then the email is a spam… Rich's Big Data Analytics Training

  27. Further discussion Lets call good emails “good” P(good) = 1- P(spam) P(word) = P(word|spam)P(spam) + P(word|good)P(good) P(spam|word) = Rich's Big Data Analytics Training

  28. Sample data Enron data: https://www.cs.cmu.edu/~enron Enron employee emails A small subset chosen for EDA 1500 spam, 3672 ham Test word is “meeting”…that is, your goal is label a email with word “meeting” as spam or good (not spam) What is your intuition? Now prove it using Bayes Rich's Big Data Analytics Training

  29. Calculations P(spam) = 1500/(1500+3672) = 0.29 P(ham) = 0.71 P(meeting|spam) = 16/1500= 0.0106 P(meeting|ham) = 15/3672 = 0.0416 P(meeting) = P(meeting|spam)P(spam) + P(meeting|ham)P(ham) = 0.0106 *0.29 + 0.0416+0.71= 0.03261 P(spam|meeting) = P(meeting|spam)*P(spam)/P(meeting) = 0.0106*0.29/0.03261 = 0.094  9.4% Rich's Big Data Analytics Training

  30. Discussion • On to demo in R: see the lab handout Exercise 2. • Bayesian analysis determined that the email with word “meeting” is a spam with 9.4% probability. • Your data strategy? What is the threshold for qualification to be a spam? • At UB it is 50%; • your strategy could be 60%, a little relaxed • or somewhat stringent 40%, that is when a email classifies to be a spam with 40% you intercept it and throw it into spam folder. • The single word approach can be easily transformed into a multi-word phrase based Bayesian classification Rich's Big Data Analytics Training

  31. UB Strategy • Quarantined Spam: • Incoming email messages that are 50% - 98% likely to be spam are held in your Blocked Messages folder for 28 days before the spam quarantine service automatically deletes them. • Quarantined spam is not delivered to your mailbox, so it does not count toward your quota. • Discarded Spam • Incoming email messages that are 99% - 100% likely to be spam are automatically deleted. • Outgoing email messages (either generated at UB or forwarded through UB) that are 80% - 99% likely to be spam are automatically deleted before reaching their destinations. Rich's Big Data Analytics Training

  32. Exercise 3: Predicting the behavior of our congressional representatives • We studied: • Naïve Bayes Rule • Application to spam filtering in emails • Work the example/understand the example discussed in class: the disease one, a spam filter.. • Now lets look at an example using data on congressional votes on several issues (data from 1984, but nothing has changed!) • This model we develop could be applied any data that conforms to this template. • Once again this data is readily available in a package called mlbench (for machine learning benchmarks) Rich's Big Data Analytics Training

  33. Predicting behavior using Naïve Bayes We use existing record of congressional votes to build a model Goal is to label the voting record as belonging to a Democrat or Republican. Of course, we need to clean it up/reframe it Then compare the predicted classification (two classes: democrat or republican) to the actual class. Next we take a sample synthetic data containing a secret ballot and guess the voter class.. Is it still a secret ballot, when machines can learn who you are? We also plot (histogram) the voting record between two arbitrary issues: V10: V11 (missiles, immigration), with actual classes and predicted classes. (n:n, y:n, n:y, y:y). We can also do other complex data analytics: understand how they voted. On to the demo…exercise Rich's Big Data Analytics Training

  34. Logistic Regression • What is it? • It is an approach for calculating the odds of event happening vs other possibilities…Odds ratio is an important concept • We will discuss odds ratio with examples • Why are we studying it? • To use it for classification • It is a discriminative classification vs Naïve Bayes’ generative classification scheme (what is this?) • Linear (continuous).. Logistic (categorical): Logit function bridges this gap • According to experts [3] logistics regression classification has better error rates in certain situations than Naïve Bayes (eg. large data sets, in the context of big data?) Rich's Big Data Analytics Training

  35. Logistic Regression Predict if a patient has a given disease (we did this using Bayes) (binary classification using a variety of data like age, gender, BMI, blood tests etc.) if a person will vote Democratic or Republican the odds of a failure (or success) of a process, system or a product A customer’s propensity to purchase a product: they bought product {A, X and Y}, did not buy {B,C}, will they buy D? YES or NO ? Odds of a person staying in the workforce Odds of a homeowner defaulting on a loan Rich's Big Data Analytics Training

  36. Basics Basic function is: logit  logistic regression Definition: logit(p) = log() = log(p) – log(1-p) The logit function takes x values in the range [0,1] and transforms them to y values along the entire real line Inverse logit does the reverse, takes a x value along the real line and transforms them in the range [1,0] Rich's Big Data Analytics Training

  37. Demo on R Do an exploration (EDA) of the data Observe the outcome, if sigmoid S-shaped curve Fit the logistic regression model use the fit/plot to classify Exercise 4: We have collected data about brand recognition. Our sample of subjects is in the age group 19-30, and they answer “yes” or “no” to a question. (Somewhat like the taste test of sodas). The data shows, <age, # of yes, # of subjects in that age>: one set of data R1 before a marketing campaign, another set R2 after the campaign (Pre and Post) You will see repeated entry for age, since this is data collected from several places. We have two regression curves. Which one is better? What is your interpretation? This is for small data of 25, how about big data? Can replicate the model for big data too. Rich's Big Data Analytics Training

  38. Plot : Pre and Post Rich's Big Data Analytics Training

  39. R Code data1 <- read.csv(file.chooser(), header=T) summary(data1) head(data1) glm.out = glm(cbind(R2, Total-R2) ~Age, family=binomial(logit), data=data1) plot(R2/Total ~ Age, data = data1) p2<-lines(data1$Age, glm.out$fitted, col="red") title(main="Brand Recognition Data: Logistic Regression Line") grid(nx=NULL, ny=NULL) summary(glm.out) Rich's Big Data Analytics Training

  40. Understanding Probability vs odds Rich's Big Data Analytics Training

  41. Odds Ratio Example from 4/16/2014 news article • Woods is still favored to with the U.S. Open. He and Rory McIlroy are each 10/1 favorites on online betting site, Bovada. Adam Scott has the next best odds at 12/1….. • How to interpret this? • = • = • = • Woods is also the favorite to win the Open Championship at Hoylake in July. He's 7/1 there. = Rich's Big Data Analytics Training

  42. Multiple Features Here we are interested in finding out if the user will click or not click: predict based on training data.. Fit the model using the command fit1 <- glm(click~ url1+url2 + url3+url4+url5, data=“train”, family=binomial(logit)) It will give you a probability that can then be used to predict/classify Rich's Big Data Analytics Training

  43. How to select your classifier? If you have a continuous variable then use linear regression. If you have discrete values for variables (yes/no data) you may use logistic regression. You don’t have much information about the classes in your data: use clustering: K-means for numeric, PAM for categorical If you have information about the classes (training and test data) then use K-NN for one or two features, Bayesian for multi-features. Rich's Big Data Analytics Training

  44. Shiny Package of R Shiny package allows you to develop applications using R script. A shiny-based web application has two major components: ui.R and server.R UI is for specifying the user input layout, components and variables for transporting data between the UI and server. Server inputs the variable values from the UI, uses R’s capabilities (packages, commands) and computes the results and display the results on the UI. See http://shiny.rstudio.com/ for some amazing examples. Rich's Big Data Analytics Training

  45. Summary • We studied clustering categorical data in PAM. • Customer segmentation for targeted marketing • We discussed the acclaimed Naïve Bayesian and its application for classification. • Robust classification approach • Predicting diseases, text classification (good/bad chatter), sentiment analysis • We also discussed logistic regression. • Recommendation systems • What factors influence the sale of product • Data strategy: Identify major functions associated with data analytics and match it to a team member. Rich's Big Data Analytics Training

  46. References J. P. Lander. R For Everyone: Advanced Analytics and Graphics, Addison-Wesley. 2014. M. Hauskrecth, Supervised Learning, CS2710, Univ.of Pittsburgh, 2014. A.Ng and M.Jordon. On discriminative vs. generative classifiers: A comparison of logistic regression and naïve Bayes, NIPS 2001. Sergios Theodoridis & Konstantinos Koutroumbas (2006). Pattern Recognition 3rd ed. p. 635 Rich's Big Data Analytics Training

More Related