Chapter 11 supervised learning statistical methods
Download
1 / 131

Chapter 11 Supervised Learning: STATISTICAL METHODS - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

Chapter 11 Supervised Learning: STATISTICAL METHODS. Cios / Pedrycz / Swiniarski / Kurgan. Outline. Bayesian Methods Basics of Bayesian Methods Bayesian Classification – General Case Classification that Minimizes Risk Decision Regions and Probability of Errors Discriminant Functions

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Chapter 11 Supervised Learning: STATISTICAL METHODS' - virgo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Chapter 11 supervised learning statistical methods

Chapter 11Supervised Learning:STATISTICAL METHODS

Cios / Pedrycz / Swiniarski / Kurgan


Outline
Outline

  • Bayesian Methods

    • Basics of Bayesian Methods

    • Bayesian Classification – General Case

    • Classification that Minimizes Risk

    • Decision Regions and Probability of Errors

    • Discriminant Functions

    • Estimation of Probability Densities

    • Probabilistic Neural Network

    • Constraints in Classifier Design

Cios / Pedrycz / Swiniarski / Kurgan


Outline1
Outline

  • Regression

    • Data Models

    • Simple Linear Regression

    • Multiple Regression

    • General Least Squares and Multiple Regression

    • Assessing Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan


Bayesian methods
Bayesian Methods

Statistical processing based on the Bayes decision theory is a fundamental technique for pattern recognition and classification.

The Bayes decision theory provides a framework for statistical methods for classifying patterns into classes based on probabilities of patterns and their features.

Cios / Pedrycz / Swiniarski / Kurgan


Basics of bayesian methods
Basics of Bayesian Methods

Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk.

States of nature C = { “ an eagle ”, “ a hawk ” }

Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” }

We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”)

and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N)

Cios / Pedrycz / Swiniarski / Kurgan


Basics of bayesian methods1
Basics of Bayesian Methods

  • A priori (prior) probability P(ci):

  • Estimation of a prior P(ci):

  • P(ci)denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object.

Cios / Pedrycz / Swiniarski / Kurgan


Basics of bayesian methods2
Basics of Bayesian Methods

The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears.

  • Natural and best decision:

    “Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a bird to a class c2 ”

  • The probability of classification error:

    P(classification error) = P(c2) if we decide C = c1

    P(c1) if we decide C = c2

Cios / Pedrycz / Swiniarski / Kurgan


Involving object features in classification
Involving Object Features in Classification

  • Feature variable / feature x

    • It characterizes an object and allows for better discrimination between one class from another

    • We assume it to be a continuous random variable taking continuous values from a given range

    • The variability of a random variable x can be expressed in probabilistic terms

    • We represent a distribution of a random variable xby the class conditional probability density function (the state conditional probability density function):

Cios / Pedrycz / Swiniarski / Kurgan


Involving object features in classification1
Involving Object Features in Classification

Examples of probability densities

Cios / Pedrycz / Swiniarski / Kurgan


Involving object features in classification2
Involving Object Features in Classification

  • Probability density function p(x|ci)

    • also called the likelihood of a class ciwith respect to the valuexof a feature variable

    • the likelihood that an object belongs to class ciis bigger if p(x|ci)is larger

    • joint probability density function p(ci , x)

    • A probability density that an object is in a class ci and has a feature variable value x.

    • A posteriori (posterior) probability P(x|ci)

    • The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x.

Cios / Pedrycz / Swiniarski / Kurgan


Involving object features in classification3
Involving Object Features in Classification

  • Bayes’ rule / Bayes’ theorem

    • From probability theory (see Appendix B)

    • An unconditional probability density function

Cios / Pedrycz / Swiniarski / Kurgan


Involving object features in classification4
Involving Object Features in Classification

  • Bayes’ rule

    • “The conditional probability P(ci|x) can be expressed in terms of the a priori probability function P(ci), together with the class conditional probability density function p(ci|x).”

Cios / Pedrycz / Swiniarski / Kurgan


Involving object features in classification5
Involving Object Features in Classification

  • Bayes’ decision rule

    • P(c2|x) if we decide C = c1

  • P(classification error | x) =

    • P(c1|x) if we decide C = c2

    • “This statistical classification rule is best in the sense of minimizing the probability of misclassification (the probability of classification error)”

    • Bayes’ classification rule guarantees minimization of the average probability of classification error

Cios / Pedrycz / Swiniarski / Kurgan


Involving object features in classification6
Involving Object Features in Classification

  • Example

    • Let us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2).

    • Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2and p(45|c2) = 1.1053 ∙ 10-2.

    • Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is

Cios / Pedrycz / Swiniarski / Kurgan


Bayesian classification general case
Bayesian Classification – General Case

  • Bayes’ Classification Rule for Multiclass Multifeature Objects

    • Real-valued features of an object as n-dimensional column vector x  Rn:

    • The object may belong to l distinct classes (l distinct states of nature):

Cios / Pedrycz / Swiniarski / Kurgan


Bayesian classification general case1
Bayesian Classification – General Case

  • Bayes’ Classification Rule for Multiclass Multifeature Objects

    • Bayes’ theorem

      • A priori probability: P(ci) (i = 1, 2…,l)

      • Class conditional probability density function : p(x|ci)

      • A posteriori (posterior) probability: P(ci|x)

      • Unconditional probability density function:

Cios / Pedrycz / Swiniarski / Kurgan


Bayesian classification general case2
Bayesian Classification – General Case

  • Bayes’ Classification Rule for Multiclass Multifeature Objects

    • Bayes classification rule:

    • A given object with a given value x of a feature vector can be classified as belonging to class cj when:

    • Assign an object with a given value x of a feature vector to class cj when:

Cios / Pedrycz / Swiniarski / Kurgan


Classification that minimizes risk
Classification that Minimizes Risk

  • Basic Idea

    • To incorporate the fact that misclassifications of some classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature

  • A loss function

    • Cost (penalty, weight) due to the fact of assigning an object to class cjwhen in fact the true class is ci

Cios / Pedrycz / Swiniarski / Kurgan


Classification that minimizes risk1
Classification that Minimizes Risk

  • A loss matrix

    • We denote a loss function by Lijmatrix for l-class classification problems

  • Expected (average) conditional loss

    • In short,

Cios / Pedrycz / Swiniarski / Kurgan


Classification that minimizes risk2
Classification that Minimizes Risk

  • Overall Risk

    • The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision.

  • Bayes risk

    • Minimal overall risk Rleads to the generalization of Bayes’ rule for minimization of probability of the classification error.

Cios / Pedrycz / Swiniarski / Kurgan


Classification that minimizes risk3
Classification that Minimizes Risk

  • Bayes’ classification rule with Bayes risk

    • Choose a decision (a class) ci for which:

Cios / Pedrycz / Swiniarski / Kurgan


Classification that minimizes risk4
Classification that Minimizes Risk

  • Bayesian Classification Minimizing the Probability of Error

    • Symmetrical zero-one conditional loss function

    • The conditional risk R(cj| x) criterion is the same as the average probability of classification error:

    • An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision

Cios / Pedrycz / Swiniarski / Kurgan


Classification that minimizes risk5
Classification that Minimizes Risk

  • Generalization of the Maximum Likelihood Classification

    • Generalized likelihood ratio for classes ci and cj

    • Generalized threshold value

    • The maximum likelihood classification rule

    • “Decide a class cj if

Cios / Pedrycz / Swiniarski / Kurgan


Decision regions and probability of errors
Decision Regions and Probability of Errors

  • Decision regions

    • A classifier divides the feature space into l disjoint decision subspaces R1,R2, … Rl

    • The region Ri is a subspace such that each realization x of a feature vector of an object falling into this region will be assigned to a class ci

Cios / Pedrycz / Swiniarski / Kurgan


Decision regions and probability of errors1
Decision Regions and Probability of Errors

  • Decision boundaries (decision surfaces)

    • The regions intersect, and boundaries between adjacent regions

    • “The task of a classifier design is to find classification rules that will guarantee division of a feature space into optimal decision regions R1,R2, … Rl (with optimal decision boundaries) that will minimize a selected classification performance criterion”

Cios / Pedrycz / Swiniarski / Kurgan


Decision regions and probability of errors2
Decision Regions and Probability of Errors

  • Decision boundaries

Cios / Pedrycz / Swiniarski / Kurgan


Decision regions and probability of errors3
Decision Regions and Probability of Errors

  • Optimal classification with decision regions

    • Average probability of correct classification

  • “Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion”

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions
Discriminant Functions

  • Discriminant functions:

  • Discriminant type classifier

    • It assigns an object with a given value x of a feature vector to a class cj if

  • Classification rule for a discriminant function-based classifier

    • Compute numerical values of all discriminant functions for x

    • Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest:

      • Select a class cj for which dj(x) = max(di(x)); i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions1
Discriminant Functions

  • Discriminant classifier

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions2
Discriminant Functions

  • Discriminant type classifier for Bayesian classification

    • The natural choice for the discriminant function is the a posteriori conditional probability P(ci|x):

    • Practical versions using Bayes’ theorem

    • Bayesian discriminant in a natural logarithmic form

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions3
Discriminant Functions

  • Characteristics of discriminant function

    • Discriminant functions define the decision boundaries that separate the decision regions

    • Generally, the decision boundaries are defined by neighboring decision regions when the corresponding discriminant function values are equal

    • The decision boundaries are unaffected by the increasingly monotonic transformation of discriminant functions

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions4
Discriminant Functions

  • Bayesian Discriminant Functions for Two Classes

    • General case

      • Two discriminant functions: d1(x) and d2(x).

      • Two decision regions: R1andR2.

      • The decision boundary: d1(x) = d2(x).

    • Using dichotomizer

      • Single discriminant function: d(x) = d1(x) -d2(x).

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions5
Discriminant Functions

  • Quadratic and Linear Discriminants Derived from the Bayes Rule

    • Quadratic Discriminant

      • Assumption:

      • A multivariate normal Gaussian distribution of the feature vector x within each class

      • The Bayesian discriminant( in the previous section):

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions6
Discriminant Functions

  • Quadratic and Linear Discriminants Derived from the Bayes Rule

    • Quadratic Discriminant

      • Gaussian distribution of the probability density function

      • Quadratic Discriminant function

      • Decision boundaries:

      • hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.)

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions7
Discriminant Functions

  • Quadratic and Linear Discriminants Derived from the Bayes Rule

  • Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci)

    • Compute values of the mean vectors i and the covariance matrices i for all classes i = 1, 2, …, l based on the training set

    • Compute values of the discriminant function for all classes

    • Choose a class ci as a prediction of true class for which a value of the associated discriminant function dj(x) is largest:

      • Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions8
Discriminant Functions

  • Quadratic and Linear Discriminants Derived from the Bayes Rule

    • Linear Discriminant:

      • Assumption: equal covariances for all classes i= 

      • The Quadratic Discriminant:

      • A linear form of discriminant functions:

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions9
Discriminant Functions

  • Quadratic and Linear Discriminants Derived from the Bayes Rule

    • Linear Discriminant:

Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions10
Discriminant Functions

  • Quadratic and Linear Discriminants Derived from the Bayes Rule

    • The classification process using linear discriminants

      • Compute, for a given x, numerical values of discriminant functions for all classes:

      • Choose a class ci for which a value of the discriminant function dj(x) is largest:

        • Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions11
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Example

    • Let us assume that the following two-feature patterns xR2 from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution:

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions12
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Example

      • The estimates of the symmetric covariance matrices for both classes

      • The linear discriminant functions for both classes

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions13
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Example

      • Two-class two-feature pattern dichotomizer.

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions14
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Minimum Mahalanobis Distance Classifier

      • Assumption

        • Equal covariances for all classes i= ( i = 1, 2, …, l )

        • Equal a priori probabilities for all classes P(ci) = P

      • Discriminant function

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions15
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Minimum Mahalanobis Distance Classifier

      • A classifier selects the class cj for which a value x is nearest, in the sense of Mahalanobis distance, to the corresponding mean vector j . This classifier is called a minimum Mahalanobis distance classifier.

      • Linear version of the minimum Mahalanobis distance classifier

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions16
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Minimum Mahalanobis Distance Classifier

    • Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector

      • Compute numerical values of the Mahalanobis distances between x and means i for all classes.

      • Choose a class cjas a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum:

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions17
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Linear Discriminant for Statistically Independent Features

      • Assumption

        • Equal covariances for all classes i= ( i = 1, 2, …, l )

        • Features are statistically independent

      • Discriminant function

      • where

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions18
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Linear Discriminant for Statistically Independent Features

      • Discriminants

      • Quadratic discriminant formula

      • Linear discriminant formula

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions19
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Linear Discriminant for Statistically Independent Features

      • “Neural network” style as a linear threshold machine

      • where

      • The decision surfaces for the linear discriminants are pieces of hyperplanes defined by equations di(x)-dj(x).

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions20
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Minimum Euclidean Distance Classifier

      • Assumption

        • Equal covariances for all classes i= ( i = 1, 2, …, l )

        • Features are statistically independent

        • Equal a priori probabilities for all classes P(ci) = P

      • Discriminants

      • or

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions21
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Minimum Euclidean Distance Classifier

      • The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j.

      • Linear version of the minimum distance classifier

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions22
Discriminant Functions

  • Quadratic and Linear Discriminants

  • Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector

    • Compute numerical values of Euclidean distances between x and means i for all classes:

    • Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest:

Cios / Pedrycz / Swiniarski / Kurgan


Discriminant functions23
Discriminant Functions

  • Quadratic and Linear Discriminants

    • Characteristics of Bayesian Normal Discriminant

      • Assumptions

        • multivariate normality within classes

        • equal covariance matrices between classes

      • The linear discriminant is equivalent to the optimal classifier

      • These assumptions are satisfied only approximately

      • Due to its simple structure, the linear discriminant tends not to overfit the training data set, which may lead to stronger generalization ability for unseen cases

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities
Estimation of Probability Densities

  • Basic Idea

    • In Bayesian classifier design, it is necessary to estimate a priori probabilities and conditional probability densities due to the limited number of a priori observed objects. This estimation should be optimal according to the well-defined estimation criterion.

  • Estimates of a priori probabilities

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities1
Estimation of Probability Densities

  • Estimation of the class conditional probability densities p(x|ci)

    • Parametric methods

    • with the assumption of a specific functional form of a probability density function

    • Nonparametric methods

    • without the assumption of a specific functional form of a probability density function

    • Semiparametric method

    • a combination of parametric and nonparametric methods

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities2
Estimation of Probability Densities

  • Parametric Methods

    • A priori observations of objects and corresponding patterns:

    • Split set of all patterns X according to a class into l disjoint sets:

    • Assume that the parametric form of the class conditional probability density is given as a function:

    • where

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities3
Estimation of Probability Densities

  • Parametric Methods

    • If the probability density has a normal (Gaussian) form:

    • where

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities4
Estimation of Probability Densities

  • The Maximum Likelihood Estimation of Parameters

    • Assumption

      • we are given a limited-size set of N patterns xi:

      • we know a parametric form p(x|) of a conditional probability density function

    • Goal

      • The task of estimation is to find the optimal (the best according to the used criterion) value of the parameter vector of a given dimension m.

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities5
Estimation of Probability Densities

  • The Maximum Likelihood Estimation of Parameters

    • Likelihood

      • The joint probability density L( ) is a function of a parameter vector  for a given set of patterns X.

      • It is called the likelihood of  for a given set of patterns X.

    • Maximum Likelihood Estimation

    • The function L( ) can be chosen as a criterion for finding the optimal estimate of . It is called the maximum likelihood estimation of parameters 

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities6
Estimation of Probability Densities

  • The Maximum Likelihood Estimation of Parameters

    • Minimizing the negative natural logarithm of the likelihood L( ) :

    • For the differentiable function p(xi| ):

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities7
Estimation of Probability Densities

  • The Maximum Likelihood Estimation of Parameters

    • For the normal form of a probability density function N(µ,) with unknown parameters µ and  constituting vector  :

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities8
Estimation of Probability Densities

  • The Maximum Likelihood Estimation of Parameters

    • Example of Maximum Likelihood Estimation

      • for

      • The maximum likelihood estimation criterion

      • The maximum likelihood estimates for the parameters:

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities9
Estimation of Probability Densities

  • Nonparametric Methods

    • “Nonparametric methods are more general methods of probability density estimation that  based on existing data, but without an assumption about  a functional  form  of the probability density function.”

    • Nonparametric techniques:

      • Histogram

      • Kernel-based method

      • k-nearest neighbors

      • Nearest neighbors

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities10
Estimation of Probability Densities

  • Nonparametric Methods

    • General Idea

    • Determine an estimate of a true probability density p(x) based on the available limited-size samples

      • The probability that a new pattern x will fall inside a region R

      • Approximation of the probability for a small region and for continuous p(x), with almost the same values within a region R

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities11
Estimation of Probability Densities

  • Nonparametric Methods

    • General Idea

      • The probability that for N sample patterns set k of them will fall in a region R

      • Estimate of the probability P

      • Approximation for a probability density function for a given pattern x

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities12
Estimation of Probability Densities

  • Nonparametric Methods

    • Kernel-based Method and Parzen Window

      • Kernel-based method is based on fixing around a pattern vector x a region R (and thus a region volume V ) and counting a number k of given training patterns falling in this region by using a special kernel function associated with the region.

      • Such a kernel function is also called a Parzen window

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities13
Estimation of Probability Densities

  • Nonparametric Methods

    • Hypercube-type Parzen window

      • Volume of the hypercube:

      • Kernel (window) function:

      • Total number of patterns falling within the hypercube

      • The estimate of the probability density function

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities14
Estimation of Probability Densities

  • Nonparametric Methods

    • Smooth estimate of the probability density function

      • A kernel function must satisfy two conditions:

      • and

      • For example, the radial symmetric multivariate Gaussian (normal) kernel:

      • The estimate of the probability density function:

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities15
Estimation of Probability Densities

  • Nonparametric Methods

    • Smooth estimate of the probability density function

      • The estimate of the class-dependent p(x|ck) probability density:

      • The estimate of the class-dependent p(x|ck) probability density for the Gaussian kernel:

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities16
Estimation of Probability Densities

  • Nonparametric Methods

    • Design issues

      • The selection of a kernel function:

        • Parzen window, Gaussian kernel, etc.

      • The selection of a smoothing parameter

    • The generalization ability of the kernel-based density estimation depends on the training set and on smoothing parameters

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities17
Estimation of Probability Densities

  • Nonparametric Methods

    • K-nearest Neighbors

      • “A method of probability density estimation with variable size regions”

      • First, a small n-dimensional sphere is located in the pattern space centered at the point x.

      • Second, a radius of this sphere is extended until the sphere contains exactly the fixed number k of patterns from a given training set.

      • Then an estimate of the probability density for x is computed as

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities18
Estimation of Probability Densities

  • Nonparametric Methods

    • K-nearest Neighbors Classification Rule

      • First, for a given x, the first k-nearest neighbors from a training set should be found (regardless of a class label) based on a defined pattern distance measure.

      • Second, among the selected k nearest neighbors, numbers ni of patterns belonging to each class ci are computed.

      • Then, the predicted class cj assigned to x corresponds to a class for which nj is the largest.

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities19
Estimation of Probability Densities

  • Nonparametric Methods

    • Nearest Neighbors Classification Rule

      • “The simple version of the k-nearest neighbors classification is for a number of neighbors k equal to one”

    • Algorithm

    • Given: A training set Ttra of N patterns x1, x2, …, xNlabeled by l classes. A new pattern x.

      • Compute for a given x the nearest neighbor xj from a whole training set based on the defined pattern distance measure distance(x,xi).

      • Assign to x a class cjof nearest neighbors to x.

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities20
Estimation of Probability Densities

  • Semiparametric Methods

    • “Combination of parametric and nonparametric methods”

    • Two semiparametric methods

      • Functional approximation

      • Mixture models (mixtures of probability densities)

    • Major advantage

    • It is able to precisely fit component functions locally to specific regions of a feature space, based on discoveries about probability distributions and their modalities from the existing data

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities21
Estimation of Probability Densities

  • Semiparametric Methods

    • Functional Approximation

      • Approximation of density by the linear combination of m basis functions i(x):

      • Using a symmetric radial basis function

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities22
Estimation of Probability Densities

  • Semiparametric Methods

    • Functional Approximation

      • Gaussian radial function: “The most commonly used basis function”

      • Optimization criterion for the functional approximation of density

      • Optimal estimates for parameters:

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities23
Estimation of Probability Densities

  • Semiparametric Methods

    • The algorithm for functional approximation

    • Given : A training set Ttra of N patterns x1, x2, …, xN. The m orthonormal radial basis functions i(x) (i = 1, 2,…,m), along with their parameters.

      • Compute the estimates of unknown parameters

      • Form the model of the probability density as a functional approximation

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities24
Estimation of Probability Densities

  • Semiparametric Methods

    • Mixture Models (Mixtures of Probability Densities)

    • “These models are based on linear parametric combination of known probability density functions (for example, normal densities) localized in certain regions of data”

      • The linear mixture distribution

      • Simplified version:

Cios / Pedrycz / Swiniarski / Kurgan


Estimation of probability densities25
Estimation of Probability Densities

  • Distance Between Probability Densities and the Kullback-Leibler Distance

    • Distance

    • “We can define distance between two densities, with true density p(x) and its approximate estimate ”

    • Kullback-Leibler distance

Cios / Pedrycz / Swiniarski / Kurgan


Probabilistic neural network
Probabilistic Neural Network

  • Probabilistic Neural Network

  • “The PNN is a hardware implementation of the kernel-based method of density estimation and Bayesian optimal classification (providing minimization of the average probability of the classification error”

    • Optimal Bayes’ classification rule

    • Kernel-based estimation of a probability density function

Cios / Pedrycz / Swiniarski / Kurgan


Probabilistic neural network1
Probabilistic Neural Network

  • Topology

Cios / Pedrycz / Swiniarski / Kurgan


Probabilistic neural network2
Probabilistic Neural Network

  • Details

    • An input layer (weightless) consists of n neurons (units), each receiving one element xi (i = 1,2,…, n) of the n-dimensional input pattern vector x.

    • A pattern layer consists of N neurons (units, nodes), each representing one reference pattern from the training set Ttra .

    • The transfer function of the pattern layer neuron implements a kernel function(a Parzen window)

Cios / Pedrycz / Swiniarski / Kurgan


Probabilistic neural network3
Probabilistic Neural Network

  • Details

    • The weightless second hidden layer is the summation layer. The number of neurons in the summation layer is equal to the number of classes l.

    • The output activation function of the summation layer neuron is generally equal to but may be modified for different kernel functions.

    • The output layer is the classification decision layer that implements Bayes’ classification rule by selecting the largest value and thus decides a class cj for the pattern x

Cios / Pedrycz / Swiniarski / Kurgan


Probabilistic neural network4
Probabilistic Neural Network

  • Pattern Processing

  • “Processing of patterns by the already-designed PNN network is performed in the feedforward manner. The input pattern is presented to the network and processed forward by each layer. The resulting output is the predicted class”

  • PNN with the Radial Gaussian Kernel

    • Kernel function:

    • Transfer function:

    • Output activation function:

Cios / Pedrycz / Swiniarski / Kurgan


Probabilistic neural network5
Probabilistic Neural Network

  • PNN with the Radial Gaussian Normal Kernel and Normalized Patterns

    • Transfer function:

    • Normalization of patterns allows for a simpler architecture of the pattern-layer neurons, containing here also input weights and an exponential output activation function.

    • The transfer function of a pattern neuron can be divided into a neuron’s transfer function and an output activation function

    • The pattern-neuron output activation function:

Cios / Pedrycz / Swiniarski / Kurgan


Constraints in classifier design
Constraints in Classifier Design

  • Problems

    • Will a classifier guarantee minimization of the average probability of the classification error?

    • Does a training set well represent patterns generated by a physical phenomenon?

    • Are patterns drawn according to the characteristic of underlying phenomenon probability density?

    • Is the average probability of a classification error difficult to calculate?

Cios / Pedrycz / Swiniarski / Kurgan


Constraints in classifier design1
Constraints in Classifier Design

  • Suboptimal solutions of Bayesian classifier design

    • The estimation of class conditional probabilities is based on a limited sample

    • The samples are frequently collected randomly, and not by use of a well-planned experimental procedure

Cios / Pedrycz / Swiniarski / Kurgan


Regression
REGRESSION

  • Data Models

  • Simple Linear Regression Analysis

  • Multiple Regression

  • General Least Squares and Multiple Regression

  • Assessing the Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan


Data models
Data Models

  • Mathematical models

  • “They are useful approximate representations of phenomena that generate data and may be used for prediction, classification, compression, or control design.”

  • Black-box models

    • Mathematical models obtained by processing existing data without using laws of physics governing data-generating phenomena

  • Regression analysis

    • Data analysis and model design are based on a sample from a given population

Cios / Pedrycz / Swiniarski / Kurgan


Data models1
Data Models

  • Categories of regression models

    • Simple linear regression

    • Multiple linear regression

    • Neural network-based linear regression

    • Polynomial regression

    • Logistic regression

    • Log-linear regression

    • Local piecewise linear regression

    • Nonlinear regression (with a nonlinear model)

    • Neural network-based nonlinear regression

Cios / Pedrycz / Swiniarski / Kurgan


Data models2
Data Models

  • Static and dynamic models

    • A static model produces outcomes based only on the current input (no internal memory).

    • A dynamic model produces outcomes based on the current input and the past history of the model behavior (internal memory)

Cios / Pedrycz / Swiniarski / Kurgan


Data models3
Data Models

  • Data gathering

    • Random sample from a certain population

    • N pairs of the experimental data set named Torig

Cios / Pedrycz / Swiniarski / Kurgan


Data models4
Data Models

  • Regression analysis

  • “A statistical method used to discover the relationship between variables and to design a data model that can be used to predict variable values based on other variables”

Cios / Pedrycz / Swiniarski / Kurgan


Data models5
Data Models

  • Regression analysis

    • A simple linear regression

      • To find the linear relationship between two variables, x and y, and to discover a linear model, i.e., a line equation y = b+ax, which is the best fit to given data in order to predict values of data

      • This modeling line is called the regression line of y on x

      • The equation of that line is called a regression equation (regression model)

    • Typical linear regression analysis provides a prediction of a dependent variable y based on an independent variable x

Cios / Pedrycz / Swiniarski / Kurgan


Data models6
Data Models

  • Visualization of Regression

    • Scatter plot for height versus weight data

Cios / Pedrycz / Swiniarski / Kurgan


Data models7
Data Models

  • Visualization of Regression

    • Scatter plot for height versus weight data

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis
Simple Linear Regression Analysis

  • Sample data and Regression model

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis1
Simple Linear Regression Analysis

  • Assumptions

    • The observations yi (i = 1, …, N) are random samples and are mutually independent.

    • The regression error terms (the difference between the predicted value and the true value) are also mutually independent, with the same distribution (normal distribution with zero mean) and constant variances

    • The distribution of the error term is independent of the joint distribution of explanatory variables. It is also assumed that unknown parameters of regression models are constants

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis2
Simple Linear Regression Analysis

  • Simple Linear Regression Analysis

    • Evaluation of basic statistical characteristics of data

    • An estimation of the optimal parameters of a linear model

    • Assess of model quality and generalization ability to predict the outcome for new data

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis3
Simple Linear Regression Analysis

  • Model Structure

    • Nonlinear data:

    • Generally, a function f(x) could be nonlinear in x:

    • Linear form :

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis4
Simple Linear Regression Analysis

  • Regression Error (residual error)

    • Difference between real-value yi and predicted-value yi,est

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis5
Simple Linear Regression Analysis

  • Performance Criterion – Sum of Squared Errors.

    • The sum of squared errors performance criterion for multiple regression

    • The minimization technique in regression uses as a criterion the sum of squared error - method of least squares or errors (LSE) or, in short, the method of least squares

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis6
Simple Linear Regression Analysis

  • Basic Statistical Characteristics of Data

    • The mean of N samples

    • The variance

    • The covariance

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis7
Simple Linear Regression Analysis

  • Sum of Squared Variations in y Caused by the Regression Model

    • The total sum of squared variations in y

    • These formulas are used to define important regression measures (for example, the correlation coefficient)

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis8
Simple Linear Regression Analysis

  • Computing Optimal Values of the Regression Model Parameters

    • The optimal model parameters values have to be computed based on the given data set and the defined performance criterion

    • Methods for estimation of optimal model parameter values

      • The analytical offline method

      • The analytical recursive offline method

      • Searching iteratively optimal model parameters

      • Neural network-based regression

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis9
Simple Linear Regression Analysis

  • Simple Linear Regression Analysis, Linear Least Squares, and Design of a Model

    • The general linear model structure

    • The performance criterion

    • and performance curve

y = ax (a model with b=0)

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis10
Simple Linear Regression Analysis

  • Simple Linear Regression Analysis, Linear Least Squares, and Design of a Model

    • The optimal parameters

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis11
Simple Linear Regression Analysis

  • Procedure for simple linear regression

    • Given: The number N of experimental observations, and the set of the N experimental data points { (xi, yi), i = 1, 2, …, N }

    • Compute the statistical characteristics of the data

    • Compute the estimates of the model optimal parameters using Equations

    • Assess the regression model quality indicating howwell the model fits the data. Compute

      • Standard error of estimate

      • Correlation coefficient r

      • Coefficient of determination r2

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis12
Simple Linear Regression Analysis

Example

  • Sample of four data points

  • Resulting regression line

  • y = 0.9 + 0.56x

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis13
Simple Linear Regression Analysis

  • Optimal Parameter Values in the Minimum Least Squares Sense

    • Required conditions for a valid linear regression

      • The error term e = y - (b + ax) is normally distributed

      • The error variance is the same for all values of x

      • Error are independent of each other.

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis14
Simple Linear Regression Analysis

  • Quality of the Linear Regression Model and Linear Correlation Analysis

    • Assessment of model quality

      • The resulting correlation coefficient can be used as a measure of how well the trends predicted by the values follow the trends in the training data

      • The coefficient of determination can be used to measure how well the regression line fits the data points

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis15
Simple Linear Regression Analysis

  • Correlation coefficient

  • Coefficient of determination

    • The percent of variation in the dependent variable y that can be explained by the regression equation,

    • the explained variation in y divided by the total variation, or

    • the square of r (correlation coefficient)

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis16
Simple Linear Regression Analysis

  • Coefficient of determination

    • Explained and unexplained variation in y

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis17
Simple Linear Regression Analysis

  • Coefficient of determination

    • Example

      • If the coefficient of correlation has the value r = 0.9327, then the value of the coefficient of determination is r2 = 0.8700. It can be understood that 87% portion of the total variation in y can be explained by the linear relationship between x and y, as it is described by the optimal regression model of the data. The remaining portion 13% of the total variation in y remains unexplained.

    • The calculation of coefficient of determination

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis18
Simple Linear Regression Analysis

  • Matrix Version of Simple Linear Regression Based on Least Squares Method

    • The matrix form of the model description (the estimation of ) for all N experimental data points

    • The regression error

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis19
Simple Linear Regression Analysis

  • Matrix Version of Simple Linear Regression Based on Least Squares Method

    • The performance criterion:

    • Optimal parameters:

    • The value of the criterion for the optimal parameter vector:

    • The regression error for the model with the optimal parameter vector:

Cios / Pedrycz / Swiniarski / Kurgan


Simple linear regression analysis20
Simple Linear Regression Analysis

  • Matrix Version of Simple Linear Regression Based on Least Squares Method

    • Example: let us consider again the dataset shown in the following table

    • y = 0.56x + 0.9

Cios / Pedrycz / Swiniarski / Kurgan


Multiple regression
Multiple Regression

  • Definition

  • The multiple regression analysis is the statistical technique of exploring the relation (association) between the set of n independent variables that are used to explain the variability of one (generally many) dependent variable y

    • Linear multiple regression model

    • Linear multiple regression model using vector notation

    • This regression model is represented by a hyperplane in (n + 1)-dimensional space.

Cios / Pedrycz / Swiniarski / Kurgan


Multiple regression1
Multiple Regression

  • Geometrical Interpretation: Regression Errors

  • The goal of multiple regression is to find a hyperplane in the (n + 1)-dimensional space that will best fit the data

    • The performance criterion

    • The error variance and standard error of the estimate

Cios / Pedrycz / Swiniarski / Kurgan


Multiple regression2
Multiple Regression

  • Degree of Freedom

    • The denominator N – n – 1 in the previous equation tells us that in multiple regression with n independent variables, the standard error has N – n – 1 degrees of freedom

    • The degree of freedom has been reduced from N by n + 1 because n + 1 numerical parameters a0, a1, a2, …, anof the regression model have been estimated from the data

Cios / Pedrycz / Swiniarski / Kurgan


General least squares and multiple regression
General Least Squares and Multiple Regression

  • General model description in function form

  • Data model

  • Performance criterion

  • Regression error

Cios / Pedrycz / Swiniarski / Kurgan


General least squares and multiple regression1
General Least Squares and Multiple Regression

  • General model description in matrix form

  • Data model

  • Performance criterion

  • Optimal parameters

Cios / Pedrycz / Swiniarski / Kurgan


General least squares and multiple regression2
General Least Squares and Multiple Regression

  • Practical, Numerically Stable Computation of the Optimal Model Parameters

  • Problem

  • “The solution for the optimal least-squares parameters is almost never computed from the equation due to its poor numerical performance in cases when the matrix (the covariance matrix) is ill conditioned”

  • Solution: various matrix decomposition methods

Cios / Pedrycz / Swiniarski / Kurgan


Assessing the quality of the multiple regression model
Assessing the Quality of the Multiple Regression Model

  • The Coefficient of Multiple Determination,R2

  • “The percent of the variance in the dependent variable that can be explained by all of the independent variables taken together.”

    • Adjusted R2

    • Adjusted R2 uses the number of design parameters plus a constant that are used in the model and the number of data points N in order to correct the statistic of this coefficient in situations when unnecessary parameters are used in the model structure

Cios / Pedrycz / Swiniarski / Kurgan


Assessing the quality of the multiple regression model1
Assessing the Quality of the Multiple Regression Model

  • Cp Statistic

    • It is used to compare multiple regression models Cp

    • When comparing alternative regression models, the designer aims to choose models whose values of Cn is close to or below (n + 1)

Cios / Pedrycz / Swiniarski / Kurgan


Assessing the quality of the multiple regression model2
Assessing the Quality of the Multiple Regression Model

  • Multiple Correlation

    • A value of R can be found as the positive square root of R2 (coefficient of multiple determination)

    • It is a measure of the strength of the linear relationship between the dependent variable y and the set of independent variables x1, x2, …, xn.

    • A value of R close to 1 indicates that the fit is very good

    • A value near zero indicates that the model is not a good approximation of the data and cannot be efficiently used for prediction

Cios / Pedrycz / Swiniarski / Kurgan


Assessing the quality of the multiple regression model3
Assessing the Quality of the Multiple Regression Model

Example

  • “Let us consider a multiple linear regression analysis for the data set containing N = 4 cases, composed with one dependent variable y and two independent variables x1 and x2”

    • Three-dimensional data

Cios / Pedrycz / Swiniarski / Kurgan


Assessing the quality of the multiple regression model4
Assessing the Quality of the Multiple Regression Model

Example

  • The scatter plot of data points in three-dimensional space (x1, x2, y)

Cios / Pedrycz / Swiniarski / Kurgan


Assessing the quality of the multiple regression model5
Assessing the Quality of the Multiple Regression Model

Example

  • The data matrix

  • The optimal model parameters

Cios / Pedrycz / Swiniarski / Kurgan


Assessing the quality of the multiple regression model6
Assessing the Quality of the Multiple Regression Model

Example

  • The optimal model:

  • y = 3.1+0.9x1+0.56x2

  • The optimal regression model in (x1, x2, y) space :

Cios / Pedrycz / Swiniarski / Kurgan


Assessing the quality of the multiple regression model7
Assessing the Quality of the Multiple Regression Model

Example

  • Multipleregression, regression plane model and scatter plot

Cios / Pedrycz / Swiniarski / Kurgan


Assessing the quality of the multiple regression model8
Assessing the Quality of the Multiple Regression Model

Example

  • The residuals (errors)

  • The criterion value for the optimal parameters: 0.016

Cios / Pedrycz / Swiniarski / Kurgan


References
References

  • Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford Press

  • Cios, K.J., Pedrycz, W., and Swiniarski, R. 1998. Data Mining Methods for Knowledge Discovery. Kluwer

  • Draper, N.R., and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics

  • Duda, R.O. Hart, P.E., and D.G. Stork. 2001. Pattern Classification. Wiley

  • Myers, R.H. 1986. Classical and Modern Regression with Applications, Boston, MA: Duxbury Press.

Cios / Pedrycz / Swiniarski / Kurgan