Advanced Data Analytics Training: Classification using Naïve Bayes and Logistic Regression

Naïve Bayes and Logistic Regression & Classification B.Ramamurthy Rich's Big Data Analytics Training

Outline • Review last class’s methods • We will answer Jim’s question about K-means • More details on K-means • Issues with K-means • Partitioning Around Medoids (PAM): we will work through the data science process using data from world bank; • Supervised machine learning approaches • Naïve Bayes • Logistic regression • Discuss with several applications • We will review at a high level the concept of “classification” and its relevance to business intelligence • We will also introduce the Shiny package of R for building web applications • I am also going to provide some data strategy recommendations as we move along. • This is the most code-intensive session of all; we are done with R by the end of this session. Rich's Big Data Analytics Training

K-means: Issues Popular clustering methods for data using some distance measure Clusters around centers that are means of the clusters: need not be a data point! As you observed the clusters may not be unique between runs of K-means since the cluster analysis starts with k randomly chosen centroids, a different solution can be obtained each time the function is invoked. Use the set.seed() function to guarantee that the results are reproducible. Additionally, K-means clustering approach can be sensitive to the initial selection of centroids. Thekmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate 25 initial configurations. This approach is often recommended. Rich's Big Data Analytics Training

K-means Solution #set.seed(100) age<-c(23, 25, 24, 23, 21, 31, 32, 30,31, 30, 37, 35, 38, 37, 39, 42, 43, 45, 43, 45) #clust<-kmeans(age,centers=3) clust<-kmeans(age, center=3, nstart=25) plotcluster(age,clust$cluster) clust # try this with set.seed() and nstart approach # you should see the same cluster centers and clusters Rich's Big Data Analytics Training

Categorical data K-means clustering does not work with “categorical data” Example: Cluster the countries around the world by categories decided by many attributes. This data by world bank contains numerical information as well as categorical data such as income levels, regions etc. We will work through a complete example1. Outcome of this exercise is countries clustered into 12 clusters; decided by the combination of various economic indicators. Observe and study the clusters for different years 2013, 2011 On to PAM algorithm details. Rich's Big Data Analytics Training

PAM4 • Initialize: randomly select (without replacement) k of the n data points as the medoids • Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) • For each medoidm • For each non-medoid data point o • Swap m and o and compute the total cost of the configuration • Select the configuration with the lowest cost. • Repeat steps 2 to 4 until there is no change in the medoid. Rich's Big Data Analytics Training

Exercise 1: World Bank Data • For this exercise we will work with World Bank Data already available in package WDI. • This data has been downloaded and is available in the data folder of today’s zip file. • Our goal is to make 12 clusters out of the countries based on factors such as income level, lending level, region etc. • We will spend a lot of time cleaning up and filtering the data before we do a one-liner PAM clustering. • Data Strategy: designate a team member as data-wrangler who will “tame” the data to the form that can be processed easily • Clustering results in a model; • Simple plot of cluster to too complex to read or for visual communication/discussion Rich's Big Data Analytics Training

World Bank Data • Since we are dealing with country information, we can map the clusters on a world map. • We get the world map for this from another data source of world bank. The files are included in your zip folder. • We will develop a data frame form the clusters and plot it using ggplot on a world map for quick engaging display of the clusters. • ggplot2 is a highly useful and popular plotting /graphing package • Once R script is developed for a year (say 2011), you can reuse it for any other year just by changing the parameter year. Rich's Big Data Analytics Training

Clustering Countries by World Bank Data 2011 2013 Rich's Big Data Analytics Training

More on Classification Classification is placing things where they belong to discover patterns such as like-minded people, customers with similar tastes… Classification relies on a priori reference structures that divide the space of all possible data points into a set of classes that are not overlapping. Rich's Big Data Analytics Training

Classification examples in daily life Restaurant menu: appetizers, salads, soups, entrée, dessert, drinks,…products Library of congress (LIC) system classifies books according to a standard scheme Injuries and diseases classification is physicians and healthcare workers Classification of all living things: eg., Home Sapiens (genus, species) Classifications of products by UPC code or some such attribute Rich's Big Data Analytics Training

Categories of classification algorithms • With respect to underlying technique two broad categories: • Statistical algorithms • Regression for forecasting • Bayes classifier depicts the dependency of the various attributes of the classification problem. • Structural algorithms • Rule-based algorithms: if-else, decision trees • Distance-based algorithm: similarity, nearest neighbor • Neural networks Rich's Big Data Analytics Training

Classifiers Rich's Big Data Analytics Training

Life Cycle of a classifier: training, testing and production Rich's Big Data Analytics Training

Training Stage Provide classifier with data points for which we have already assigned an appropriate class. Purpose of this stage is to determine the parameters Rich's Big Data Analytics Training

Validation Stage Testing or validation stage we validate the classifier to ensure credibility for the results. Primary goal of this stage is to determine the classification errors. Quality of the results should be evaluated using various metrics Training and testing stages may be repeated several times before a classifier transitions to the production stage. Rich's Big Data Analytics Training

Production stage The classifier(s) is used here in a live production system. It is possible to enhance the production results by allowing human-in-the-loop feedback. The three steps are repeated as we get more data from the production system. Data Strategy: Configuring these three stages will be the responsibility of a team member who is a domain expert knowledgeable about the data being classified, classes needed etc. Rich's Big Data Analytics Training

Advantages and Disadvantages Distance-based ones work well for low-dimensionality space: one or two features How about classifying a data set with large number of features? Chapter 4 discusses two methods: Naïve Bayes and Logistic regression Rich's Big Data Analytics Training

Naïve Bayes Naïve Bayes classifier One of the most celebrated and well-known classification algorithms of all time. Probabilistic algorithm Typically applied and works well with the assumption of independent attributes, but also found to work well even with some dependencies. Handles multiple features (Think of this as columns in your relational table) Rich's Big Data Analytics Training

Overview2 • Two classes: Binary classification • Our goal is to learn how to correctly classify into two types of classes: • Yes or No, 0 or 1, will click or not click, will buy this product or not, recommend or not, good or bad • First step: we need to devise a model of the function f : we use Naïve Bayes or logistic regression • Given this model f , classify data into two classes {0,1} • Binary classification • If f(x) = p, p>0.5 then class 0, else class 1 is a typical application of the method. Rich's Big Data Analytics Training

Bayesian Inference Intuition: H - Hypothesis, E - Evidence = Posterior probability is proportional to likelihood times prior Can be extended to multiple features. likelihood prior posterior Rich's Big Data Analytics Training

Example1 for Naïve Bayes • A rare disease with 1% probability (prior) • We have highly sensitive and specific test that is • 99% positive for sick patients • 99% negative for non-sick • If a patient tests positive, what is probability that he/she is sick? • Approach: patient is sick : sick, tests positive + • P(sick/+) = P(+/sick) P(sick)/P(+)= 0.99*0.01/(0.99*0.01+0.99*0.01) = ½ = 0.5 Rich's Big Data Analytics Training

Example2 for NB: very popular and common use Enron emails: 1500 spam emails, 3672 good emails For a given word “meeting”, it is known that spam emails contain the word 16 times, good emails contain the word “meeting” 153 times “Learn” from this by applying Bayes rule. Now you get an email with the word “meeting”. What is the probability that this email with word “meeting” is a spam? When do you classify this as spam? Rich's Big Data Analytics Training

Classification • Review: • Training set  design a model • Test set  validate the model • Classify data set using the model • Goal of classification: to label the items in the set to one of the given/known classes • For spam filtering it is binary class: spam or not spam(good) Rich's Big Data Analytics Training

Why not use methods in Chapter 3? Linear regression is about continuous variables, not about binary classification K-means /PAM are for clustering where there is no prior information about classes K-NN cannot accommodate multi-features: curse of dimensionality (K-NN performs well for few dimensions) For spam classification: : 1 distinct word 1 feature 10000 words 10000 features! What are we going to use? Naïve Bayes Rich's Big Data Analytics Training

Spam Filter for individual words Classifying mail into spam and not spam: binary classification Lets say if we get a mail with --- you have won a “lottery” right away you know it is a spam. We will assume that is if a word qualifies (?) to be a spam then the email is a spam… Rich's Big Data Analytics Training

Further discussion Lets call good emails “good” P(good) = 1- P(spam) P(word) = P(word|spam)P(spam) + P(word|good)P(good) P(spam|word) = Rich's Big Data Analytics Training

Sample data Enron data: https://www.cs.cmu.edu/~enron Enron employee emails A small subset chosen for EDA 1500 spam, 3672 ham Test word is “meeting”…that is, your goal is label a email with word “meeting” as spam or good (not spam) What is your intuition? Now prove it using Bayes Rich's Big Data Analytics Training

Calculations P(spam) = 1500/(1500+3672) = 0.29 P(ham) = 0.71 P(meeting|spam) = 16/1500= 0.0106 P(meeting|ham) = 15/3672 = 0.0416 P(meeting) = P(meeting|spam)P(spam) + P(meeting|ham)P(ham) = 0.0106 *0.29 + 0.0416+0.71= 0.03261 P(spam|meeting) = P(meeting|spam)*P(spam)/P(meeting) = 0.0106*0.29/0.03261 = 0.094  9.4% Rich's Big Data Analytics Training

Discussion • On to demo in R: see the lab handout Exercise 2. • Bayesian analysis determined that the email with word “meeting” is a spam with 9.4% probability. • Your data strategy? What is the threshold for qualification to be a spam? • At UB it is 50%; • your strategy could be 60%, a little relaxed • or somewhat stringent 40%, that is when a email classifies to be a spam with 40% you intercept it and throw it into spam folder. • The single word approach can be easily transformed into a multi-word phrase based Bayesian classification Rich's Big Data Analytics Training

UB Strategy • Quarantined Spam: • Incoming email messages that are 50% - 98% likely to be spam are held in your Blocked Messages folder for 28 days before the spam quarantine service automatically deletes them. • Quarantined spam is not delivered to your mailbox, so it does not count toward your quota. • Discarded Spam • Incoming email messages that are 99% - 100% likely to be spam are automatically deleted. • Outgoing email messages (either generated at UB or forwarded through UB) that are 80% - 99% likely to be spam are automatically deleted before reaching their destinations. Rich's Big Data Analytics Training

Exercise 3: Predicting the behavior of our congressional representatives • We studied: • Naïve Bayes Rule • Application to spam filtering in emails • Work the example/understand the example discussed in class: the disease one, a spam filter.. • Now lets look at an example using data on congressional votes on several issues (data from 1984, but nothing has changed!) • This model we develop could be applied any data that conforms to this template. • Once again this data is readily available in a package called mlbench (for machine learning benchmarks) Rich's Big Data Analytics Training

Predicting behavior using Naïve Bayes We use existing record of congressional votes to build a model Goal is to label the voting record as belonging to a Democrat or Republican. Of course, we need to clean it up/reframe it Then compare the predicted classification (two classes: democrat or republican) to the actual class. Next we take a sample synthetic data containing a secret ballot and guess the voter class.. Is it still a secret ballot, when machines can learn who you are? We also plot (histogram) the voting record between two arbitrary issues: V10: V11 (missiles, immigration), with actual classes and predicted classes. (n:n, y:n, n:y, y:y). We can also do other complex data analytics: understand how they voted. On to the demo…exercise Rich's Big Data Analytics Training

Logistic Regression • What is it? • It is an approach for calculating the odds of event happening vs other possibilities…Odds ratio is an important concept • We will discuss odds ratio with examples • Why are we studying it? • To use it for classification • It is a discriminative classification vs Naïve Bayes’ generative classification scheme (what is this?) • Linear (continuous).. Logistic (categorical): Logit function bridges this gap • According to experts [3] logistics regression classification has better error rates in certain situations than Naïve Bayes (eg. large data sets, in the context of big data?) Rich's Big Data Analytics Training

Logistic Regression Predict if a patient has a given disease (we did this using Bayes) (binary classification using a variety of data like age, gender, BMI, blood tests etc.) if a person will vote Democratic or Republican the odds of a failure (or success) of a process, system or a product A customer’s propensity to purchase a product: they bought product {A, X and Y}, did not buy {B,C}, will they buy D? YES or NO ? Odds of a person staying in the workforce Odds of a homeowner defaulting on a loan Rich's Big Data Analytics Training

Basics Basic function is: logit  logistic regression Definition: logit(p) = log() = log(p) – log(1-p) The logit function takes x values in the range [0,1] and transforms them to y values along the entire real line Inverse logit does the reverse, takes a x value along the real line and transforms them in the range [1,0] Rich's Big Data Analytics Training

Demo on R Do an exploration (EDA) of the data Observe the outcome, if sigmoid S-shaped curve Fit the logistic regression model use the fit/plot to classify Exercise 4: We have collected data about brand recognition. Our sample of subjects is in the age group 19-30, and they answer “yes” or “no” to a question. (Somewhat like the taste test of sodas). The data shows, <age, # of yes, # of subjects in that age>: one set of data R1 before a marketing campaign, another set R2 after the campaign (Pre and Post) You will see repeated entry for age, since this is data collected from several places. We have two regression curves. Which one is better? What is your interpretation? This is for small data of 25, how about big data? Can replicate the model for big data too. Rich's Big Data Analytics Training

Plot : Pre and Post Rich's Big Data Analytics Training

R Code data1 <- read.csv(file.chooser(), header=T) summary(data1) head(data1) glm.out = glm(cbind(R2, Total-R2) ~Age, family=binomial(logit), data=data1) plot(R2/Total ~ Age, data = data1) p2<-lines(data1$Age, glm.out$fitted, col="red") title(main="Brand Recognition Data: Logistic Regression Line") grid(nx=NULL, ny=NULL) summary(glm.out) Rich's Big Data Analytics Training

Understanding Probability vs odds Rich's Big Data Analytics Training

Odds Ratio Example from 4/16/2014 news article • Woods is still favored to with the U.S. Open. He and Rory McIlroy are each 10/1 favorites on online betting site, Bovada. Adam Scott has the next best odds at 12/1….. • How to interpret this? • = • = • = • Woods is also the favorite to win the Open Championship at Hoylake in July. He's 7/1 there. = Rich's Big Data Analytics Training

Multiple Features Here we are interested in finding out if the user will click or not click: predict based on training data.. Fit the model using the command fit1 <- glm(click~ url1+url2 + url3+url4+url5, data=“train”, family=binomial(logit)) It will give you a probability that can then be used to predict/classify Rich's Big Data Analytics Training

How to select your classifier? If you have a continuous variable then use linear regression. If you have discrete values for variables (yes/no data) you may use logistic regression. You don’t have much information about the classes in your data: use clustering: K-means for numeric, PAM for categorical If you have information about the classes (training and test data) then use K-NN for one or two features, Bayesian for multi-features. Rich's Big Data Analytics Training

Shiny Package of R Shiny package allows you to develop applications using R script. A shiny-based web application has two major components: ui.R and server.R UI is for specifying the user input layout, components and variables for transporting data between the UI and server. Server inputs the variable values from the UI, uses R’s capabilities (packages, commands) and computes the results and display the results on the UI. See http://shiny.rstudio.com/ for some amazing examples. Rich's Big Data Analytics Training

Summary • We studied clustering categorical data in PAM. • Customer segmentation for targeted marketing • We discussed the acclaimed Naïve Bayesian and its application for classification. • Robust classification approach • Predicting diseases, text classification (good/bad chatter), sentiment analysis • We also discussed logistic regression. • Recommendation systems • What factors influence the sale of product • Data strategy: Identify major functions associated with data analytics and match it to a team member. Rich's Big Data Analytics Training

References J. P. Lander. R For Everyone: Advanced Analytics and Graphics, Addison-Wesley. 2014. M. Hauskrecth, Supervised Learning, CS2710, Univ.of Pittsburgh, 2014. A.Ng and M.Jordon. On discriminative vs. generative classifiers: A comparison of logistic regression and naïve Bayes, NIPS 2001. Sergios Theodoridis & Konstantinos Koutroumbas (2006). Pattern Recognition 3rd ed. p. 635 Rich's Big Data Analytics Training

Advanced Data Analytics Training: Classification using Naïve Bayes and Logistic Regression

Advanced Data Analytics Training: Classification using Naïve Bayes and Logistic Regression

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7