1 / 80

Machine Learning (BE Computer 2015 PAT) A.Y. 2018-19 SEM-II Prepared by Mr. Dhomse G.P.

Machine Learning (BE Computer 2015 PAT) A.Y. 2018-19 SEM-II Prepared by Mr. Dhomse G.P. Unit-4 Naïve bayes & SVM Syllabus. Bayes ‟ Theorom , Naïve Bayes ‟ Classifiers, Naïve Bayes in Scikit - learn- Bernoulli Naïve Bayes,Multinomial Naïve Bayes , and Gaussian Naïve Bayes .

ricek
Download Presentation

Machine Learning (BE Computer 2015 PAT) A.Y. 2018-19 SEM-II Prepared by Mr. Dhomse G.P.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning (BE Computer 2015 PAT) A.Y. 2018-19 SEM-II Prepared by Mr. Dhomse G.P.

  2. Unit-4 Naïve bayes & SVM Syllabus • Bayes‟ Theorom, Naïve Bayes‟ Classifiers, Naïve Bayes in Scikit- learn- Bernoulli Naïve Bayes,Multinomial Naïve Bayes, and Gaussian Naïve Bayes. Support Vector Machine(SVM)- Linear Support Vector Machines, Scikit- learn implementationLinear Classification, Kernel based classification, Non- linear Examples. Controlled Support Vector Machines, Support Vector Regression.

  3. Naive Bayes • It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors.  • In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature • For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. • Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’. ?

  4. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. • Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).Look at the equation below:

  5. Above, • P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes). • P(c) is the prior probability of class. • P(x|c) is the likelihood which is the probability of predictor given class. • P(x) is the prior probability of predictor. How Naive Bayes algorithm works? • Let’s understand it using an example. Below I have a training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather condition. • Let’s follow the below steps to perform it.

  6. Step 1: Convert the data set into a frequency table • Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64.

  7. Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction. Problem: Players will play if weather is sunny. Is this statement is correct? • We can solve it using above discussed method of posterior probability. • P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny) • Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64 • Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability. • Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.

  8. Pros: • It is easy and fast to predict class of test data set. It also perform well in multi class prediction • When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data. • It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption). Cons: • If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation. • On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously. • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

  9. Applications of Naive Bayes Algorithms • Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time. • Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable. • Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments) • Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

  10. Naive Bayes Classifier • The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.

  11. To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN or RED. • Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects. • Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. • In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.

  12. Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:

  13. Having formulated our prior probability, we are now ready to classify a new object (WHITE circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood:

  14. From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:

  15. Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). • In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).

  16. Finally, we classify X as RED since its class membership achieves the largest posterior probability. Note- The above probabilities are not normalized. However, this does not affect the classification outcome since their normalizing constants are the same.

  17. Technical Notes- • Naive Bayes classifiers can handle an arbitrary number of independent variables whether continuous or categorical. Given a set of variables, X = {x1,x2,x...,xd}, we want to construct the posterior probability for the event Cj among a set of possible outcomes C = {c1,c2,c...,cd}. • In a more familiar language, X is the predictors and C is the set of categorical levels present in the dependent variable. Using Bayes' rule:

  18. where p(Cj | x1,x2,x...,xd) is the posterior probability of class membership, i.e., the probability that X belongs to Cj. Since Naive Bayes assumes that the conditional probabilities of the independent variables are statistically independent we can decompose the likelihood to a product of terms: and rewrite the posterior as: Using Bayes' rule above, we label a new case X with a class level Cj that achieves the highest posterior probability.

  19. Naive Bayes in scikit-learn- scikit-learn implements three naive Bayes variants based on the same number of different probabilistic distributions: • Bernoulli, multinomial, and Gaussian. • The first one is a binary distribution, useful when a feature can be present or absent. • The second one is a discrete distribution and is used whenever a feature must be represented by a whole number (for example, in natural language processing, it can be the frequency of a term), • while the third is a continuous distribution characterized by its mean and variance.

  20. Bernoulli naive Bayes • If X is random variable and is Bernoulli-distributed, it can assume only two values (for simplicity, let's call them 0 and 1) and their probability is:

  21. we're going to generate a dummy dataset. Bernoulli naive Bayes expects binary feature vectors; however, the class BernoulliNB has a binarize parameter, which allows us to specify a threshold that will be used internally to transform the features: from sklearn.datasets import make_classification >>> nb_samples = 300 >>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0)

  22. We have decided to use 0.0 as a binary threshold, so each point can be characterized by the quadrant where it's located. from sklearn.naive_bayes import BernoulliNB from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25) >>> bnb = BernoulliNB(binarize=0.0) >>> bnb.fit(X_train, Y_train) >>> bnb.score(X_test, Y_test) 0.85333333333333339

  23. The score is rather good, but if we want to understand how the binary classifier worked, it's useful to see how the data has been internally binarized:

  24. Now, checking the naive Bayes predictions, we obtain: >>> data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) >>> bnb.predict(data) array([0, 0, 1, 1]) Multinomial Naive Bayes: Feature vectors represent the frequencies with which certain events have been generated by amultinomial distribution. This is the event model typically used for document classification.

  25. A multinomial distribution is useful to model feature vectors where each value represents, for example, the number of occurrences of a term or its relative frequency. • If the feature vectors have n elements and each of them can assume k different values with probability pk, then:

  26. The conditional probabilities P(xi|y) are computed with a frequency count (which corresponds to applying a maximum likelihood approach), but in this case, it's important to consider the alpha parameter (called Laplace smoothing factor). Its default value is 1.0 and it prevents the model from setting null probabilities when the frequency is zero. • It's possible to assign all non-negative values; however, larger values will assign higher probabilities to the missing features and this choice could alter the stability of the model. In our example, we're going to consider the default value of 1.0.

  27. from sklearn.feature_extraction import DictVectorizer >>> data = [ {'house': 100, 'street': 50, 'shop': 25, 'car': 100, 'tree': 20}, {'house': 5, 'street': 5, 'shop': 0, 'car': 10, 'tree': 500, 'river': 1} ] >>> dv = DictVectorizer(sparse=False) >>> X = dv.fit_transform(data) >>> Y = np.array([1, 0])  >>> X array([[ 100.,100., 0., 25., 50.,20.], [10.,5.,1.,0.,5., 500.]])

  28. Note that the term 'river' is missing from the first set, so it's useful to keep alpha equal to 1.0 to give it a small probability. The output classes are 1 for city and 0 for the countryside. • Now we can train a MultinomialNB instance: from sklearn.naive_bayes import MultinomialNB >>> mnb = MultinomialNB() >>> mnb.fit(X, Y) MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)  • To test the model, we create a dummy city with a river and a dummy countryside place without any river:

  29. >>> test_data = data = [ {'house': 80, 'street': 20, 'shop': 15, 'car': 70, 'tree': 10, 'river':1}, {'house': 10, 'street': 5, 'shop': 1, 'car': 8, 'tree': 300, 'river': 0}] • >>> mnb.predict(dv.fit_transform(test_data)) array([1, 0])  • As expected, the prediction is correct.

  30. Gaussian Naive Bayes classifier • In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution. • When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values as shown below:

  31. Gaussian naive Bayes is useful when working with continuous values whose probabilities can be modeled using a Gaussian distribution: • The conditional probabilities P(xi|y) are also Gaussian distributed; therefore, it's necessary to estimate the mean and variance of each of them using the maximum likelihood approach. This quite easy; in fact, considering the property of a Gaussian, we get:

  32. Here, the k index refers to the samples in our dataset and P(xi|y) is a Gaussian itself. # load the iris dataset from sklearn.datasets import load_iris iris = load_iris() # store the feature matrix (X) and response vector (y) X = iris.data y = iris.target # splitting X and y into training and testing sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

  33. # training the model on training set from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() gnb.fit(X_train, y_train)    # making predictions on the testing set y_pred = gnb.predict(X_test) # comparing actual response values (y_test) with predicted response values (y_pred) from sklearn import metrics print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100) Output Gaussian Naive Bayes model accuracy(in %): 95.0

  34. Support Vector Machines • in machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. • A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. • In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.

  35. What is Support Vector Machine? • An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. • In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces.

  36. “Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. • However,  it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. • Then, we perform classification by finding the hyper-plane that differentiate the two classes very well (look at the below snapshot).

  37. Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is a frontier which best segregates the two classes (hyper-plane/ line).

  38.  “How can we identify the right hyper-plane?” (Scenario-1): Here, we have three hyper-planes (A, B and C). Now, identify the right hyper-plane to classify star and circle. “Select the hyper-plane which segregates the two classes better”. In this scenario, hyper-plane “B” has excellently performed this job.

  39. (Scenario-2): Here, we have three hyper-planes (A, B and C) and all are segregating the classes well. Now, How can we identify the right hyper-plane?

  40. Here, maximizing the distances between nearest data point (either class) and hyper-plane will help us to decide the right hyper-plane. This distance is called as Margin. Let’s look at the below snapshot: Above, you can see that the margin for hyper-plane C is high as compared to both A and B. Hence, we name the right hyper-plane as C. Another lightning reason for selecting the hyper-plane with higher margin is robustness. If we select a hyper-plane having low margin then there is high chance of miss-classification.

  41. Some of you may have selected the hyper-plane B as it has higher margin compared to A. But, here is the catch, SVM selects the hyper-plane which classifies the classes accurately prior to maximizing margin. Here, hyper-plane B has a classification error and A has classified all correctly. Therefore, the right hyper-plane is A. (Scenario-4)?: Below, I am unable to segregate the two classes using a straight line, as one of star lies in the territory of other(circle) class as an outlier.

  42. (Scenario-5): In the scenario below, we can’t have linear hyper-plane between the two classes, so how does SVM classify these two classes? Till now, we have only looked at the linear hyper-plane and Kernal based

  43. What does SVM do? Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. Example about SVM classification of cancer UCI datasets using machine learning tools i.e. scikit-learn compatible with Python.

  44. # importing scikit learn with make_blobs from sklearn.datasets.samples_generator import make_blobs # creating datasets X containing n_samples # Y containing two classes X, Y = make_blobs(n_samples=500, centers=2,                   random_state=0, cluster_std=0.40) # plotting scatters  plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring'); plt.show() 

  45. What Support vector machines do, is to not only draw a line between two classes here, but consider a region about the line of some given width.

  46. What Support vector machines do, is to not only draw a line between two classes here, but consider a region about the line of some given width. Here’s an example of what it can look like: # creating line space between -1 to 3.5  xfit = np.linspace(-1, 3.5) # plotting scatter plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring') # plot a line between the different sets of data for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]: yfit = m * xfit + b plt.plot(xfit, yfit, '-k') plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',      color='#AAAAAA', alpha=0.4) plt.xlim(-1, 3.5); plt.show()

  47. Importing datasets This is the intuition of support vector machines, which optimize a linear discriminant model representing the perpendicular distance between the datasets. Now let’s train the classifier using our training data. Before training, we need to import cancer datasets as csv file where we will train two features out of all features. # importing required libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt # reading csv file and extracting class column to y. x = pd.read_csv("C:\...\cancer.csv") a = np.array(x) y  = a[:,30] # classes having 0 and 1 # extracting two features x = np.column_stack((x.malignant,x.benign)) x.shape # 569 samples and 2 features print (x),(y)

  48. [[ 122.8 1001. ] [ 132.9 1326. ] [ 130. 1203. ] ..., [ 108.3 858.1 ] [ 140.1 1265. ] [ 47.92 181. ]] array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 1., ...., 1.])

More Related