1 / 131

# Chapter 11 Supervised Learning: STATISTICAL METHODS - PowerPoint PPT Presentation

Chapter 11 Supervised Learning: STATISTICAL METHODS. Cios / Pedrycz / Swiniarski / Kurgan. Outline. Bayesian Methods Basics of Bayesian Methods Bayesian Classification – General Case Classification that Minimizes Risk Decision Regions and Probability of Errors Discriminant Functions

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Chapter 11 Supervised Learning: STATISTICAL METHODS' - virgo

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Chapter 11Supervised Learning:STATISTICAL METHODS

Cios / Pedrycz / Swiniarski / Kurgan

• Bayesian Methods

• Basics of Bayesian Methods

• Bayesian Classification – General Case

• Classification that Minimizes Risk

• Decision Regions and Probability of Errors

• Discriminant Functions

• Estimation of Probability Densities

• Probabilistic Neural Network

• Constraints in Classifier Design

Cios / Pedrycz / Swiniarski / Kurgan

• Regression

• Data Models

• Simple Linear Regression

• Multiple Regression

• General Least Squares and Multiple Regression

• Assessing Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan

Statistical processing based on the Bayes decision theory is a fundamental technique for pattern recognition and classification.

The Bayes decision theory provides a framework for statistical methods for classifying patterns into classes based on probabilities of patterns and their features.

Cios / Pedrycz / Swiniarski / Kurgan

Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk.

States of nature C = { “ an eagle ”, “ a hawk ” }

Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” }

We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”)

and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N)

Cios / Pedrycz / Swiniarski / Kurgan

• A priori (prior) probability P(ci):

• Estimation of a prior P(ci):

• P(ci)denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object.

Cios / Pedrycz / Swiniarski / Kurgan

The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears.

• Natural and best decision:

“Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a bird to a class c2 ”

• The probability of classification error:

P(classification error) = P(c2) if we decide C = c1

P(c1) if we decide C = c2

Cios / Pedrycz / Swiniarski / Kurgan

• Feature variable / feature x

• It characterizes an object and allows for better discrimination between one class from another

• We assume it to be a continuous random variable taking continuous values from a given range

• The variability of a random variable x can be expressed in probabilistic terms

• We represent a distribution of a random variable xby the class conditional probability density function (the state conditional probability density function):

Cios / Pedrycz / Swiniarski / Kurgan

Examples of probability densities

Cios / Pedrycz / Swiniarski / Kurgan

• Probability density function p(x|ci)

• also called the likelihood of a class ciwith respect to the valuexof a feature variable

• the likelihood that an object belongs to class ciis bigger if p(x|ci)is larger

• joint probability density function p(ci , x)

• A probability density that an object is in a class ci and has a feature variable value x.

• A posteriori (posterior) probability P(x|ci)

• The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x.

Cios / Pedrycz / Swiniarski / Kurgan

• Bayes’ rule / Bayes’ theorem

• From probability theory (see Appendix B)

• An unconditional probability density function

Cios / Pedrycz / Swiniarski / Kurgan

• Bayes’ rule

• “The conditional probability P(ci|x) can be expressed in terms of the a priori probability function P(ci), together with the class conditional probability density function p(ci|x).”

Cios / Pedrycz / Swiniarski / Kurgan

• Bayes’ decision rule

• P(c2|x) if we decide C = c1

• P(classification error | x) =

• P(c1|x) if we decide C = c2

• “This statistical classification rule is best in the sense of minimizing the probability of misclassification (the probability of classification error)”

• Bayes’ classification rule guarantees minimization of the average probability of classification error

Cios / Pedrycz / Swiniarski / Kurgan

• Example

• Let us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2).

• Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2and p(45|c2) = 1.1053 ∙ 10-2.

• Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is

Cios / Pedrycz / Swiniarski / Kurgan

• Bayes’ Classification Rule for Multiclass Multifeature Objects

• Real-valued features of an object as n-dimensional column vector x  Rn:

• The object may belong to l distinct classes (l distinct states of nature):

Cios / Pedrycz / Swiniarski / Kurgan

• Bayes’ Classification Rule for Multiclass Multifeature Objects

• Bayes’ theorem

• A priori probability: P(ci) (i = 1, 2…,l)

• Class conditional probability density function : p(x|ci)

• A posteriori (posterior) probability: P(ci|x)

• Unconditional probability density function:

Cios / Pedrycz / Swiniarski / Kurgan

• Bayes’ Classification Rule for Multiclass Multifeature Objects

• Bayes classification rule:

• A given object with a given value x of a feature vector can be classified as belonging to class cj when:

• Assign an object with a given value x of a feature vector to class cj when:

Cios / Pedrycz / Swiniarski / Kurgan

• Basic Idea

• To incorporate the fact that misclassifications of some classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature

• A loss function

• Cost (penalty, weight) due to the fact of assigning an object to class cjwhen in fact the true class is ci

Cios / Pedrycz / Swiniarski / Kurgan

• A loss matrix

• We denote a loss function by Lijmatrix for l-class classification problems

• Expected (average) conditional loss

• In short,

Cios / Pedrycz / Swiniarski / Kurgan

• Overall Risk

• The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision.

• Bayes risk

• Minimal overall risk Rleads to the generalization of Bayes’ rule for minimization of probability of the classification error.

Cios / Pedrycz / Swiniarski / Kurgan

• Bayes’ classification rule with Bayes risk

• Choose a decision (a class) ci for which:

Cios / Pedrycz / Swiniarski / Kurgan

• Bayesian Classification Minimizing the Probability of Error

• Symmetrical zero-one conditional loss function

• The conditional risk R(cj| x) criterion is the same as the average probability of classification error:

• An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision

Cios / Pedrycz / Swiniarski / Kurgan

• Generalization of the Maximum Likelihood Classification

• Generalized likelihood ratio for classes ci and cj

• Generalized threshold value

• The maximum likelihood classification rule

• “Decide a class cj if

Cios / Pedrycz / Swiniarski / Kurgan

• Decision regions

• A classifier divides the feature space into l disjoint decision subspaces R1,R2, … Rl

• The region Ri is a subspace such that each realization x of a feature vector of an object falling into this region will be assigned to a class ci

Cios / Pedrycz / Swiniarski / Kurgan

• Decision boundaries (decision surfaces)

• The regions intersect, and boundaries between adjacent regions

• “The task of a classifier design is to find classification rules that will guarantee division of a feature space into optimal decision regions R1,R2, … Rl (with optimal decision boundaries) that will minimize a selected classification performance criterion”

Cios / Pedrycz / Swiniarski / Kurgan

• Decision boundaries

Cios / Pedrycz / Swiniarski / Kurgan

• Optimal classification with decision regions

• Average probability of correct classification

• “Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion”

Cios / Pedrycz / Swiniarski / Kurgan

• Discriminant functions:

• Discriminant type classifier

• It assigns an object with a given value x of a feature vector to a class cj if

• Classification rule for a discriminant function-based classifier

• Compute numerical values of all discriminant functions for x

• Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest:

• Select a class cj for which dj(x) = max(di(x)); i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

• Discriminant classifier

Cios / Pedrycz / Swiniarski / Kurgan

• Discriminant type classifier for Bayesian classification

• The natural choice for the discriminant function is the a posteriori conditional probability P(ci|x):

• Practical versions using Bayes’ theorem

• Bayesian discriminant in a natural logarithmic form

Cios / Pedrycz / Swiniarski / Kurgan

• Characteristics of discriminant function

• Discriminant functions define the decision boundaries that separate the decision regions

• Generally, the decision boundaries are defined by neighboring decision regions when the corresponding discriminant function values are equal

• The decision boundaries are unaffected by the increasingly monotonic transformation of discriminant functions

Cios / Pedrycz / Swiniarski / Kurgan

• Bayesian Discriminant Functions for Two Classes

• General case

• Two discriminant functions: d1(x) and d2(x).

• Two decision regions: R1andR2.

• The decision boundary: d1(x) = d2(x).

• Using dichotomizer

• Single discriminant function: d(x) = d1(x) -d2(x).

Cios / Pedrycz / Swiniarski / Kurgan

• Quadratic and Linear Discriminants Derived from the Bayes Rule

• Assumption:

• A multivariate normal Gaussian distribution of the feature vector x within each class

• The Bayesian discriminant( in the previous section):

Cios / Pedrycz / Swiniarski / Kurgan

• Quadratic and Linear Discriminants Derived from the Bayes Rule

• Gaussian distribution of the probability density function

• Decision boundaries:

• hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.)

Cios / Pedrycz / Swiniarski / Kurgan

• Quadratic and Linear Discriminants Derived from the Bayes Rule

• Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci)

• Compute values of the mean vectors i and the covariance matrices i for all classes i = 1, 2, …, l based on the training set

• Compute values of the discriminant function for all classes

• Choose a class ci as a prediction of true class for which a value of the associated discriminant function dj(x) is largest:

• Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

• Quadratic and Linear Discriminants Derived from the Bayes Rule

• Linear Discriminant:

• Assumption: equal covariances for all classes i= 

• A linear form of discriminant functions:

Cios / Pedrycz / Swiniarski / Kurgan

• Quadratic and Linear Discriminants Derived from the Bayes Rule

• Linear Discriminant:

Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space

Cios / Pedrycz / Swiniarski / Kurgan

• Quadratic and Linear Discriminants Derived from the Bayes Rule

• The classification process using linear discriminants

• Compute, for a given x, numerical values of discriminant functions for all classes:

• Choose a class ci for which a value of the discriminant function dj(x) is largest:

• Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

• Example

• Let us assume that the following two-feature patterns xR2 from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution:

Cios / Pedrycz / Swiniarski / Kurgan

• Example

• The estimates of the symmetric covariance matrices for both classes

• The linear discriminant functions for both classes

Cios / Pedrycz / Swiniarski / Kurgan

• Example

• Two-class two-feature pattern dichotomizer.

Cios / Pedrycz / Swiniarski / Kurgan

• Minimum Mahalanobis Distance Classifier

• Assumption

• Equal covariances for all classes i= ( i = 1, 2, …, l )

• Equal a priori probabilities for all classes P(ci) = P

• Discriminant function

Cios / Pedrycz / Swiniarski / Kurgan

• Minimum Mahalanobis Distance Classifier

• A classifier selects the class cj for which a value x is nearest, in the sense of Mahalanobis distance, to the corresponding mean vector j . This classifier is called a minimum Mahalanobis distance classifier.

• Linear version of the minimum Mahalanobis distance classifier

Cios / Pedrycz / Swiniarski / Kurgan

• Minimum Mahalanobis Distance Classifier

• Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector

• Compute numerical values of the Mahalanobis distances between x and means i for all classes.

• Choose a class cjas a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum:

Cios / Pedrycz / Swiniarski / Kurgan

• Linear Discriminant for Statistically Independent Features

• Assumption

• Equal covariances for all classes i= ( i = 1, 2, …, l )

• Features are statistically independent

• Discriminant function

• where

Cios / Pedrycz / Swiniarski / Kurgan

• Linear Discriminant for Statistically Independent Features

• Discriminants

• Linear discriminant formula

Cios / Pedrycz / Swiniarski / Kurgan

• Linear Discriminant for Statistically Independent Features

• “Neural network” style as a linear threshold machine

• where

• The decision surfaces for the linear discriminants are pieces of hyperplanes defined by equations di(x)-dj(x).

Cios / Pedrycz / Swiniarski / Kurgan

• Minimum Euclidean Distance Classifier

• Assumption

• Equal covariances for all classes i= ( i = 1, 2, …, l )

• Features are statistically independent

• Equal a priori probabilities for all classes P(ci) = P

• Discriminants

• or

Cios / Pedrycz / Swiniarski / Kurgan

• Minimum Euclidean Distance Classifier

• The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j.

• Linear version of the minimum distance classifier

Cios / Pedrycz / Swiniarski / Kurgan

• Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector

• Compute numerical values of Euclidean distances between x and means i for all classes:

• Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest:

Cios / Pedrycz / Swiniarski / Kurgan

• Characteristics of Bayesian Normal Discriminant

• Assumptions

• multivariate normality within classes

• equal covariance matrices between classes

• The linear discriminant is equivalent to the optimal classifier

• These assumptions are satisfied only approximately

• Due to its simple structure, the linear discriminant tends not to overfit the training data set, which may lead to stronger generalization ability for unseen cases

Cios / Pedrycz / Swiniarski / Kurgan

• Basic Idea

• In Bayesian classifier design, it is necessary to estimate a priori probabilities and conditional probability densities due to the limited number of a priori observed objects. This estimation should be optimal according to the well-defined estimation criterion.

• Estimates of a priori probabilities

Cios / Pedrycz / Swiniarski / Kurgan

• Estimation of the class conditional probability densities p(x|ci)

• Parametric methods

• with the assumption of a specific functional form of a probability density function

• Nonparametric methods

• without the assumption of a specific functional form of a probability density function

• Semiparametric method

• a combination of parametric and nonparametric methods

Cios / Pedrycz / Swiniarski / Kurgan

• Parametric Methods

• A priori observations of objects and corresponding patterns:

• Split set of all patterns X according to a class into l disjoint sets:

• Assume that the parametric form of the class conditional probability density is given as a function:

• where

Cios / Pedrycz / Swiniarski / Kurgan

• Parametric Methods

• If the probability density has a normal (Gaussian) form:

• where

Cios / Pedrycz / Swiniarski / Kurgan

• The Maximum Likelihood Estimation of Parameters

• Assumption

• we are given a limited-size set of N patterns xi:

• we know a parametric form p(x|) of a conditional probability density function

• Goal

• The task of estimation is to find the optimal (the best according to the used criterion) value of the parameter vector of a given dimension m.

Cios / Pedrycz / Swiniarski / Kurgan

• The Maximum Likelihood Estimation of Parameters

• Likelihood

• The joint probability density L( ) is a function of a parameter vector  for a given set of patterns X.

• It is called the likelihood of  for a given set of patterns X.

• Maximum Likelihood Estimation

• The function L( ) can be chosen as a criterion for finding the optimal estimate of . It is called the maximum likelihood estimation of parameters 

Cios / Pedrycz / Swiniarski / Kurgan

• The Maximum Likelihood Estimation of Parameters

• Minimizing the negative natural logarithm of the likelihood L( ) :

• For the differentiable function p(xi| ):

Cios / Pedrycz / Swiniarski / Kurgan

• The Maximum Likelihood Estimation of Parameters

• For the normal form of a probability density function N(µ,) with unknown parameters µ and  constituting vector  :

Cios / Pedrycz / Swiniarski / Kurgan

• The Maximum Likelihood Estimation of Parameters

• Example of Maximum Likelihood Estimation

• for

• The maximum likelihood estimation criterion

• The maximum likelihood estimates for the parameters:

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• “Nonparametric methods are more general methods of probability density estimation that  based on existing data, but without an assumption about  a functional  form  of the probability density function.”

• Nonparametric techniques:

• Histogram

• Kernel-based method

• k-nearest neighbors

• Nearest neighbors

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• General Idea

• Determine an estimate of a true probability density p(x) based on the available limited-size samples

• The probability that a new pattern x will fall inside a region R

• Approximation of the probability for a small region and for continuous p(x), with almost the same values within a region R

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• General Idea

• The probability that for N sample patterns set k of them will fall in a region R

• Estimate of the probability P

• Approximation for a probability density function for a given pattern x

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• Kernel-based Method and Parzen Window

• Kernel-based method is based on fixing around a pattern vector x a region R (and thus a region volume V ) and counting a number k of given training patterns falling in this region by using a special kernel function associated with the region.

• Such a kernel function is also called a Parzen window

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• Hypercube-type Parzen window

• Volume of the hypercube:

• Kernel (window) function:

• Total number of patterns falling within the hypercube

• The estimate of the probability density function

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• Smooth estimate of the probability density function

• A kernel function must satisfy two conditions:

• and

• For example, the radial symmetric multivariate Gaussian (normal) kernel:

• The estimate of the probability density function:

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• Smooth estimate of the probability density function

• The estimate of the class-dependent p(x|ck) probability density:

• The estimate of the class-dependent p(x|ck) probability density for the Gaussian kernel:

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• Design issues

• The selection of a kernel function:

• Parzen window, Gaussian kernel, etc.

• The selection of a smoothing parameter

• The generalization ability of the kernel-based density estimation depends on the training set and on smoothing parameters

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• K-nearest Neighbors

• “A method of probability density estimation with variable size regions”

• First, a small n-dimensional sphere is located in the pattern space centered at the point x.

• Second, a radius of this sphere is extended until the sphere contains exactly the fixed number k of patterns from a given training set.

• Then an estimate of the probability density for x is computed as

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• K-nearest Neighbors Classification Rule

• First, for a given x, the first k-nearest neighbors from a training set should be found (regardless of a class label) based on a defined pattern distance measure.

• Second, among the selected k nearest neighbors, numbers ni of patterns belonging to each class ci are computed.

• Then, the predicted class cj assigned to x corresponds to a class for which nj is the largest.

Cios / Pedrycz / Swiniarski / Kurgan

• Nonparametric Methods

• Nearest Neighbors Classification Rule

• “The simple version of the k-nearest neighbors classification is for a number of neighbors k equal to one”

• Algorithm

• Given: A training set Ttra of N patterns x1, x2, …, xNlabeled by l classes. A new pattern x.

• Compute for a given x the nearest neighbor xj from a whole training set based on the defined pattern distance measure distance(x,xi).

• Assign to x a class cjof nearest neighbors to x.

Cios / Pedrycz / Swiniarski / Kurgan

• Semiparametric Methods

• “Combination of parametric and nonparametric methods”

• Two semiparametric methods

• Functional approximation

• Mixture models (mixtures of probability densities)

• It is able to precisely fit component functions locally to specific regions of a feature space, based on discoveries about probability distributions and their modalities from the existing data

Cios / Pedrycz / Swiniarski / Kurgan

• Semiparametric Methods

• Functional Approximation

• Approximation of density by the linear combination of m basis functions i(x):

• Using a symmetric radial basis function

Cios / Pedrycz / Swiniarski / Kurgan

• Semiparametric Methods

• Functional Approximation

• Gaussian radial function: “The most commonly used basis function”

• Optimization criterion for the functional approximation of density

• Optimal estimates for parameters:

Cios / Pedrycz / Swiniarski / Kurgan

• Semiparametric Methods

• The algorithm for functional approximation

• Given : A training set Ttra of N patterns x1, x2, …, xN. The m orthonormal radial basis functions i(x) (i = 1, 2,…,m), along with their parameters.

• Compute the estimates of unknown parameters

• Form the model of the probability density as a functional approximation

Cios / Pedrycz / Swiniarski / Kurgan

• Semiparametric Methods

• Mixture Models (Mixtures of Probability Densities)

• “These models are based on linear parametric combination of known probability density functions (for example, normal densities) localized in certain regions of data”

• The linear mixture distribution

• Simplified version:

Cios / Pedrycz / Swiniarski / Kurgan

• Distance Between Probability Densities and the Kullback-Leibler Distance

• Distance

• “We can define distance between two densities, with true density p(x) and its approximate estimate ”

• Kullback-Leibler distance

Cios / Pedrycz / Swiniarski / Kurgan

• Probabilistic Neural Network

• “The PNN is a hardware implementation of the kernel-based method of density estimation and Bayesian optimal classification (providing minimization of the average probability of the classification error”

• Optimal Bayes’ classification rule

• Kernel-based estimation of a probability density function

Cios / Pedrycz / Swiniarski / Kurgan

• Topology

Cios / Pedrycz / Swiniarski / Kurgan

• Details

• An input layer (weightless) consists of n neurons (units), each receiving one element xi (i = 1,2,…, n) of the n-dimensional input pattern vector x.

• A pattern layer consists of N neurons (units, nodes), each representing one reference pattern from the training set Ttra .

• The transfer function of the pattern layer neuron implements a kernel function(a Parzen window)

Cios / Pedrycz / Swiniarski / Kurgan

• Details

• The weightless second hidden layer is the summation layer. The number of neurons in the summation layer is equal to the number of classes l.

• The output activation function of the summation layer neuron is generally equal to but may be modified for different kernel functions.

• The output layer is the classification decision layer that implements Bayes’ classification rule by selecting the largest value and thus decides a class cj for the pattern x

Cios / Pedrycz / Swiniarski / Kurgan

• Pattern Processing

• “Processing of patterns by the already-designed PNN network is performed in the feedforward manner. The input pattern is presented to the network and processed forward by each layer. The resulting output is the predicted class”

• PNN with the Radial Gaussian Kernel

• Kernel function:

• Transfer function:

• Output activation function:

Cios / Pedrycz / Swiniarski / Kurgan

• PNN with the Radial Gaussian Normal Kernel and Normalized Patterns

• Transfer function:

• Normalization of patterns allows for a simpler architecture of the pattern-layer neurons, containing here also input weights and an exponential output activation function.

• The transfer function of a pattern neuron can be divided into a neuron’s transfer function and an output activation function

• The pattern-neuron output activation function:

Cios / Pedrycz / Swiniarski / Kurgan

• Problems

• Will a classifier guarantee minimization of the average probability of the classification error?

• Does a training set well represent patterns generated by a physical phenomenon?

• Are patterns drawn according to the characteristic of underlying phenomenon probability density?

• Is the average probability of a classification error difficult to calculate?

Cios / Pedrycz / Swiniarski / Kurgan

• Suboptimal solutions of Bayesian classifier design

• The estimation of class conditional probabilities is based on a limited sample

• The samples are frequently collected randomly, and not by use of a well-planned experimental procedure

Cios / Pedrycz / Swiniarski / Kurgan

• Data Models

• Simple Linear Regression Analysis

• Multiple Regression

• General Least Squares and Multiple Regression

• Assessing the Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan

• Mathematical models

• “They are useful approximate representations of phenomena that generate data and may be used for prediction, classification, compression, or control design.”

• Black-box models

• Mathematical models obtained by processing existing data without using laws of physics governing data-generating phenomena

• Regression analysis

• Data analysis and model design are based on a sample from a given population

Cios / Pedrycz / Swiniarski / Kurgan

• Categories of regression models

• Simple linear regression

• Multiple linear regression

• Neural network-based linear regression

• Polynomial regression

• Logistic regression

• Log-linear regression

• Local piecewise linear regression

• Nonlinear regression (with a nonlinear model)

• Neural network-based nonlinear regression

Cios / Pedrycz / Swiniarski / Kurgan

• Static and dynamic models

• A static model produces outcomes based only on the current input (no internal memory).

• A dynamic model produces outcomes based on the current input and the past history of the model behavior (internal memory)

Cios / Pedrycz / Swiniarski / Kurgan

• Data gathering

• Random sample from a certain population

• N pairs of the experimental data set named Torig

Cios / Pedrycz / Swiniarski / Kurgan

• Regression analysis

• “A statistical method used to discover the relationship between variables and to design a data model that can be used to predict variable values based on other variables”

Cios / Pedrycz / Swiniarski / Kurgan

• Regression analysis

• A simple linear regression

• To find the linear relationship between two variables, x and y, and to discover a linear model, i.e., a line equation y = b+ax, which is the best fit to given data in order to predict values of data

• This modeling line is called the regression line of y on x

• The equation of that line is called a regression equation (regression model)

• Typical linear regression analysis provides a prediction of a dependent variable y based on an independent variable x

Cios / Pedrycz / Swiniarski / Kurgan

• Visualization of Regression

• Scatter plot for height versus weight data

Cios / Pedrycz / Swiniarski / Kurgan

• Visualization of Regression

• Scatter plot for height versus weight data

Cios / Pedrycz / Swiniarski / Kurgan

• Sample data and Regression model

Cios / Pedrycz / Swiniarski / Kurgan

• Assumptions

• The observations yi (i = 1, …, N) are random samples and are mutually independent.

• The regression error terms (the difference between the predicted value and the true value) are also mutually independent, with the same distribution (normal distribution with zero mean) and constant variances

• The distribution of the error term is independent of the joint distribution of explanatory variables. It is also assumed that unknown parameters of regression models are constants

Cios / Pedrycz / Swiniarski / Kurgan

• Simple Linear Regression Analysis

• Evaluation of basic statistical characteristics of data

• An estimation of the optimal parameters of a linear model

• Assess of model quality and generalization ability to predict the outcome for new data

Cios / Pedrycz / Swiniarski / Kurgan

• Model Structure

• Nonlinear data:

• Generally, a function f(x) could be nonlinear in x:

• Linear form :

Cios / Pedrycz / Swiniarski / Kurgan

• Regression Error (residual error)

• Difference between real-value yi and predicted-value yi,est

Cios / Pedrycz / Swiniarski / Kurgan

• Performance Criterion – Sum of Squared Errors.

• The sum of squared errors performance criterion for multiple regression

• The minimization technique in regression uses as a criterion the sum of squared error - method of least squares or errors (LSE) or, in short, the method of least squares

Cios / Pedrycz / Swiniarski / Kurgan

• Basic Statistical Characteristics of Data

• The mean of N samples

• The variance

• The covariance

Cios / Pedrycz / Swiniarski / Kurgan

• Sum of Squared Variations in y Caused by the Regression Model

• The total sum of squared variations in y

• These formulas are used to define important regression measures (for example, the correlation coefficient)

Cios / Pedrycz / Swiniarski / Kurgan

• Computing Optimal Values of the Regression Model Parameters

• The optimal model parameters values have to be computed based on the given data set and the defined performance criterion

• Methods for estimation of optimal model parameter values

• The analytical offline method

• The analytical recursive offline method

• Searching iteratively optimal model parameters

• Neural network-based regression

Cios / Pedrycz / Swiniarski / Kurgan

• Simple Linear Regression Analysis, Linear Least Squares, and Design of a Model

• The general linear model structure

• The performance criterion

• and performance curve

y = ax (a model with b=0)

Cios / Pedrycz / Swiniarski / Kurgan

• Simple Linear Regression Analysis, Linear Least Squares, and Design of a Model

• The optimal parameters

Cios / Pedrycz / Swiniarski / Kurgan

• Procedure for simple linear regression

• Given: The number N of experimental observations, and the set of the N experimental data points { (xi, yi), i = 1, 2, …, N }

• Compute the statistical characteristics of the data

• Compute the estimates of the model optimal parameters using Equations

• Assess the regression model quality indicating howwell the model fits the data. Compute

• Standard error of estimate

• Correlation coefficient r

• Coefficient of determination r2

Cios / Pedrycz / Swiniarski / Kurgan

Example

• Sample of four data points

• Resulting regression line

• y = 0.9 + 0.56x

Cios / Pedrycz / Swiniarski / Kurgan

• Optimal Parameter Values in the Minimum Least Squares Sense

• Required conditions for a valid linear regression

• The error term e = y - (b + ax) is normally distributed

• The error variance is the same for all values of x

• Error are independent of each other.

Cios / Pedrycz / Swiniarski / Kurgan

• Quality of the Linear Regression Model and Linear Correlation Analysis

• Assessment of model quality

• The resulting correlation coefficient can be used as a measure of how well the trends predicted by the values follow the trends in the training data

• The coefficient of determination can be used to measure how well the regression line fits the data points

Cios / Pedrycz / Swiniarski / Kurgan

• Correlation coefficient

• Coefficient of determination

• The percent of variation in the dependent variable y that can be explained by the regression equation,

• the explained variation in y divided by the total variation, or

• the square of r (correlation coefficient)

Cios / Pedrycz / Swiniarski / Kurgan

• Coefficient of determination

• Explained and unexplained variation in y

Cios / Pedrycz / Swiniarski / Kurgan

• Coefficient of determination

• Example

• If the coefficient of correlation has the value r = 0.9327, then the value of the coefficient of determination is r2 = 0.8700. It can be understood that 87% portion of the total variation in y can be explained by the linear relationship between x and y, as it is described by the optimal regression model of the data. The remaining portion 13% of the total variation in y remains unexplained.

• The calculation of coefficient of determination

Cios / Pedrycz / Swiniarski / Kurgan

• Matrix Version of Simple Linear Regression Based on Least Squares Method

• The matrix form of the model description (the estimation of ) for all N experimental data points

• The regression error

Cios / Pedrycz / Swiniarski / Kurgan

• Matrix Version of Simple Linear Regression Based on Least Squares Method

• The performance criterion:

• Optimal parameters:

• The value of the criterion for the optimal parameter vector:

• The regression error for the model with the optimal parameter vector:

Cios / Pedrycz / Swiniarski / Kurgan

• Matrix Version of Simple Linear Regression Based on Least Squares Method

• Example: let us consider again the dataset shown in the following table

• y = 0.56x + 0.9

Cios / Pedrycz / Swiniarski / Kurgan

• Definition

• The multiple regression analysis is the statistical technique of exploring the relation (association) between the set of n independent variables that are used to explain the variability of one (generally many) dependent variable y

• Linear multiple regression model

• Linear multiple regression model using vector notation

• This regression model is represented by a hyperplane in (n + 1)-dimensional space.

Cios / Pedrycz / Swiniarski / Kurgan

• Geometrical Interpretation: Regression Errors

• The goal of multiple regression is to find a hyperplane in the (n + 1)-dimensional space that will best fit the data

• The performance criterion

• The error variance and standard error of the estimate

Cios / Pedrycz / Swiniarski / Kurgan

• Degree of Freedom

• The denominator N – n – 1 in the previous equation tells us that in multiple regression with n independent variables, the standard error has N – n – 1 degrees of freedom

• The degree of freedom has been reduced from N by n + 1 because n + 1 numerical parameters a0, a1, a2, …, anof the regression model have been estimated from the data

Cios / Pedrycz / Swiniarski / Kurgan

• General model description in function form

• Data model

• Performance criterion

• Regression error

Cios / Pedrycz / Swiniarski / Kurgan

• General model description in matrix form

• Data model

• Performance criterion

• Optimal parameters

Cios / Pedrycz / Swiniarski / Kurgan

• Practical, Numerically Stable Computation of the Optimal Model Parameters

• Problem

• “The solution for the optimal least-squares parameters is almost never computed from the equation due to its poor numerical performance in cases when the matrix (the covariance matrix) is ill conditioned”

• Solution: various matrix decomposition methods

Cios / Pedrycz / Swiniarski / Kurgan

• The Coefficient of Multiple Determination,R2

• “The percent of the variance in the dependent variable that can be explained by all of the independent variables taken together.”

• Adjusted R2 uses the number of design parameters plus a constant that are used in the model and the number of data points N in order to correct the statistic of this coefficient in situations when unnecessary parameters are used in the model structure

Cios / Pedrycz / Swiniarski / Kurgan

• Cp Statistic

• It is used to compare multiple regression models Cp

• When comparing alternative regression models, the designer aims to choose models whose values of Cn is close to or below (n + 1)

Cios / Pedrycz / Swiniarski / Kurgan

• Multiple Correlation

• A value of R can be found as the positive square root of R2 (coefficient of multiple determination)

• It is a measure of the strength of the linear relationship between the dependent variable y and the set of independent variables x1, x2, …, xn.

• A value of R close to 1 indicates that the fit is very good

• A value near zero indicates that the model is not a good approximation of the data and cannot be efficiently used for prediction

Cios / Pedrycz / Swiniarski / Kurgan

Example

• “Let us consider a multiple linear regression analysis for the data set containing N = 4 cases, composed with one dependent variable y and two independent variables x1 and x2”

• Three-dimensional data

Cios / Pedrycz / Swiniarski / Kurgan

Example

• The scatter plot of data points in three-dimensional space (x1, x2, y)

Cios / Pedrycz / Swiniarski / Kurgan

Example

• The data matrix

• The optimal model parameters

Cios / Pedrycz / Swiniarski / Kurgan

Example

• The optimal model:

• y = 3.1+0.9x1+0.56x2

• The optimal regression model in (x1, x2, y) space :

Cios / Pedrycz / Swiniarski / Kurgan

Example

• Multipleregression, regression plane model and scatter plot

Cios / Pedrycz / Swiniarski / Kurgan

Example

• The residuals (errors)

• The criterion value for the optimal parameters: 0.016

Cios / Pedrycz / Swiniarski / Kurgan

• Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford Press

• Cios, K.J., Pedrycz, W., and Swiniarski, R. 1998. Data Mining Methods for Knowledge Discovery. Kluwer

• Draper, N.R., and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics

• Duda, R.O. Hart, P.E., and D.G. Stork. 2001. Pattern Classification. Wiley

• Myers, R.H. 1986. Classical and Modern Regression with Applications, Boston, MA: Duxbury Press.

Cios / Pedrycz / Swiniarski / Kurgan