Supervised learning regression classification linear regression k nn classification
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Supervised Learning Regression, Classification Linear regression, k- NN classification PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

Supervised Learning Regression, Classification Linear regression, k- NN classification. Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 11, 2014. An Example: Size of Engine vs Power.

Download Presentation

Supervised Learning Regression, Classification Linear regression, k- NN classification

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Supervised learning regression classification linear regression k nn classification

Supervised LearningRegression, ClassificationLinear regression, k-NN classification

Debapriyo Majumdar

Data Mining – Fall 2014

Indian Statistical Institute Kolkata

August 11, 2014


An example size of engine vs power

An Example: Size of Engine vs Power

  • An unknown car has an engine of size 1800cc. What is likely to be the power of the engine?

Power (bhp)

Engine displacement (cc)


An example size of engine vs power1

An Example: Size of Engine vs Power

  • Intuitively, the two variables have a relation

  • Learn the relation from the given data

  • Predict the target variable after learning

Power (bhp)

Target

Variable

Engine displacement (cc)


Exercise on a simpler set of data points

Exercise: on a simpler set of data points

  • Predict y for x = 2.5

y

x


Linear regression

Linear Regression

  • Assume: the relation is linear

  • Then for a given x (=1800), predict the value of y

Training set

Power (bhp)

Engine displacement (cc)


Linear regression1

Linear Regression

  • Linear regression

  • Assume y = a . x + b

  • Try to find suitable a and b

Power (bhp)

Engine displacement (cc)

Optional exercise


Exercise using l inear r egression

Exercise: using Linear Regression

  • Define a regression line of your choice

  • Predict y for x = 2.5

y

x


Choosing the parameters right

Choosing the parameters right

  • The data points: (x1, y1), (x2, y2), … , (xm, ym)

  • The regression line: f(x) = y = a . x + b

  • Least-square cost function: J =Σi ( f(xi) – yi)2

  • Goal: minimize J over choices of a and b

Goal: minimizing the deviation from the actual data points

y

x


How to minimize the cost function

How to Minimize the Cost Function?

  • Goal: minimize J for all values of a and b

  • Start from some a = a0and b = b0

  • Compute: J(a0,b0)

  • Simultaneously change a and b towards the negative gradient and eventually hope to arrive an optimal

  • Question: Can there be more than one optimal?

b

a

Δ


Another example

Another example:

Y

  • Given that a person’s age is 24, predict if (s)he has high blood sugar

  • Discrete values of the target variable (Y / N)

  • Many ways of approaching this problem

Training set

High blood sugar

N

Age


Classification problem

Classification problem

Y

  • One approach: what other data points are nearest to the new point?

  • Other approaches?

High blood sugar

N

?

24

Age


Classification algorithms

Classification Algorithms

  • The k-nearest neighbor classification

  • Naïve Bayes classification

  • Decision Tree

  • Linear Discriminant Analysis

  • Logistics Regression

  • Support Vector Machine


Classification or regression

Classification or Regression?

Given data about some cars: engine size, number of seats, petrol / diesel, has airbag or not, price

  • Problem 1: Given engine size of a new car, what is likely to be the price?

  • Problem 2: Given the engine size of a new car, is it likely that the car is run by petrol?

  • Problem 3: Given the engine size, is it likely that the car has airbags?


Classification

Classification


Example age income and owning a flat

Example: Age, Income and Owning a flat

  • Training set

  • Owns a flat

  • Does not own a flat

Monthly income (thousand rupees)

Age

  • Given a new person’s age and income, predict – does (s)he own a flat?


Example age income and owning a flat1

Example: Age, Income and Owning a flat

  • Training set

  • Owns a flat

  • Does not own a flat

Monthly income (thousand rupees)

Age

  • Nearest neighbor approach

  • Find nearest neighbors among the known data points and check their labels


Example age income and owning a flat2

Example: Age, Income and Owning a flat

  • Training set

  • Owns a flat

  • Does not own a flat

Monthly income (thousand rupees)

Age

  • The 1-Nearest Neighbor (1-NN) Algorithm:

    • Find the closest point in the training set

    • Output the label of the nearest neighbor


The k nearest neighbor algorithm

The k-Nearest Neighbor Algorithm

  • Training set

  • Owns a flat

  • Does not own a flat

Monthly income (thousand rupees)

Age

  • The k-Nearest Neighbor (k-NN) Algorithm:

    • Find the closestk point in the training set

    • Majority vote among the labels of the k points


Distance measures

Distance measures

  • How to measure distance to find closest points?

  • Euclidean: Distance between vectors x = (x1, … , xk)and y = (y1, … , yk)

  • Manhattan distance:

  • Generalized squared interpoint distance: S is the covariance matrix

The Maholanobis distance (1936)


Classification setup

Classification setup

  • Training data / set: set of input data points and given answers for the data points

  • Labels: the list of possible answers

  • Test data / set: inputs to the classification algorithm for finding labels

    • Used for evaluating the algorithm in case the answers are known (but known to the algorithm)

  • Classification task: Determining labels of the data points for which the label is not known or not passed to the algorithm

  • Features: attributes that represent the data


Evaluation

Evaluation

  • Test set accuracy: the correct performance measure

  • Accuracy = #of correctanswer / #of allanswers

    • Need to know the true test labels

    • Option: usetrainingset itself

    • Parameterselection (fork-NN) byaccuracy on training set

    • Overfitting: a classifier performs too good on training set compared to new (unlabeled) test data


Better validation methods

Better validation methods

  • Leave one out:

    • For each training data point x of training set D

    • Construct training set D – x, test set {x}

    • Train on D – x, test on x

    • Overall accuracy = average over all such cases

    • Expensive to compute

  • Hold out set:

    • Randomly choose x% (say 25-30%) of the training data, set aside as test set

    • Train on the rest of training data, test on the test set

    • Easy to compute, but tends to have higher variance


The k fold cross validation method

The k-fold Cross Validation Method

  • Randomly divide the training data into k partitions D1,…,Dk : possibly equal division

  • For each fold Di

    • Train a classifier with training data = D – Di

    • Test and validate with Di

  • Overall accuracy: average accuracy over all cases


References

References

  • Lecture videos by Prof. Andrew Ng, Stanford University

    Available on Coursera (Course: Machine Learning)

  • Data Mining Map: http://www.saedsayad.com/


  • Login