Chapter 11 Supervised Learning: STATISTICAL METHODS

1 / 131

# Chapter 11 Supervised Learning: STATISTICAL METHODS - PowerPoint PPT Presentation

Chapter 11 Supervised Learning: STATISTICAL METHODS. Cios / Pedrycz / Swiniarski / Kurgan. Outline. Bayesian Methods Basics of Bayesian Methods Bayesian Classification – General Case Classification that Minimizes Risk Decision Regions and Probability of Errors Discriminant Functions

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Chapter 11 Supervised Learning: STATISTICAL METHODS' - virgo

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Chapter 11Supervised Learning:STATISTICAL METHODS

Cios / Pedrycz / Swiniarski / Kurgan

Outline
• Bayesian Methods
• Basics of Bayesian Methods
• Bayesian Classification – General Case
• Classification that Minimizes Risk
• Decision Regions and Probability of Errors
• Discriminant Functions
• Estimation of Probability Densities
• Probabilistic Neural Network
• Constraints in Classifier Design

Cios / Pedrycz / Swiniarski / Kurgan

Outline
• Regression
• Data Models
• Simple Linear Regression
• Multiple Regression
• General Least Squares and Multiple Regression
• Assessing Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Methods

Statistical processing based on the Bayes decision theory is a fundamental technique for pattern recognition and classification.

The Bayes decision theory provides a framework for statistical methods for classifying patterns into classes based on probabilities of patterns and their features.

Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods

Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk.

States of nature C = { “ an eagle ”, “ a hawk ” }

Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” }

We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”)

and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N)

Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods
• A priori (prior) probability P(ci):
• Estimation of a prior P(ci):
• P(ci)denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object.

Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods

The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears.

• Natural and best decision:

“Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a bird to a class c2 ”

• The probability of classification error:

P(classification error) = P(c2) if we decide C = c1

P(c1) if we decide C = c2

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification
• Feature variable / feature x
• It characterizes an object and allows for better discrimination between one class from another
• We assume it to be a continuous random variable taking continuous values from a given range
• The variability of a random variable x can be expressed in probabilistic terms
• We represent a distribution of a random variable xby the class conditional probability density function (the state conditional probability density function):

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification

Examples of probability densities

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification
• Probability density function p(x|ci)
• also called the likelihood of a class ciwith respect to the valuexof a feature variable
• the likelihood that an object belongs to class ciis bigger if p(x|ci)is larger
• joint probability density function p(ci , x)
• A probability density that an object is in a class ci and has a feature variable value x.
• A posteriori (posterior) probability P(x|ci)
• The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x.

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification
• Bayes’ rule / Bayes’ theorem
• From probability theory (see Appendix B)
• An unconditional probability density function

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification
• Bayes’ rule
• “The conditional probability P(ci|x) can be expressed in terms of the a priori probability function P(ci), together with the class conditional probability density function p(ci|x).”

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification
• Bayes’ decision rule
• P(c2|x) if we decide C = c1
• P(classification error | x) =
• P(c1|x) if we decide C = c2
• “This statistical classification rule is best in the sense of minimizing the probability of misclassification (the probability of classification error)”
• Bayes’ classification rule guarantees minimization of the average probability of classification error

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification
• Example
• Let us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2).
• Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2and p(45|c2) = 1.1053 ∙ 10-2.
• Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is

Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Classification – General Case
• Bayes’ Classification Rule for Multiclass Multifeature Objects
• Real-valued features of an object as n-dimensional column vector x  Rn:
• The object may belong to l distinct classes (l distinct states of nature):

Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Classification – General Case
• Bayes’ Classification Rule for Multiclass Multifeature Objects
• Bayes’ theorem
• A priori probability: P(ci) (i = 1, 2…,l)
• Class conditional probability density function : p(x|ci)
• A posteriori (posterior) probability: P(ci|x)
• Unconditional probability density function:

Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Classification – General Case
• Bayes’ Classification Rule for Multiclass Multifeature Objects
• Bayes classification rule:
• A given object with a given value x of a feature vector can be classified as belonging to class cj when:
• Assign an object with a given value x of a feature vector to class cj when:

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk
• Basic Idea
• To incorporate the fact that misclassifications of some classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature
• A loss function
• Cost (penalty, weight) due to the fact of assigning an object to class cjwhen in fact the true class is ci

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk
• A loss matrix
• We denote a loss function by Lijmatrix for l-class classification problems
• Expected (average) conditional loss
• In short,

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk
• Overall Risk
• The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision.
• Bayes risk
• Minimal overall risk Rleads to the generalization of Bayes’ rule for minimization of probability of the classification error.

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk
• Bayes’ classification rule with Bayes risk
• Choose a decision (a class) ci for which:

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk
• Bayesian Classification Minimizing the Probability of Error
• Symmetrical zero-one conditional loss function
• The conditional risk R(cj| x) criterion is the same as the average probability of classification error:
• An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk
• Generalization of the Maximum Likelihood Classification
• Generalized likelihood ratio for classes ci and cj
• Generalized threshold value
• The maximum likelihood classification rule
• “Decide a class cj if

Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors
• Decision regions
• A classifier divides the feature space into l disjoint decision subspaces R1,R2, … Rl
• The region Ri is a subspace such that each realization x of a feature vector of an object falling into this region will be assigned to a class ci

Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors
• Decision boundaries (decision surfaces)
• The regions intersect, and boundaries between adjacent regions
• “The task of a classifier design is to find classification rules that will guarantee division of a feature space into optimal decision regions R1,R2, … Rl (with optimal decision boundaries) that will minimize a selected classification performance criterion”

Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors
• Decision boundaries

Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors
• Optimal classification with decision regions
• Average probability of correct classification
• “Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion”

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Discriminant functions:
• Discriminant type classifier
• It assigns an object with a given value x of a feature vector to a class cj if
• Classification rule for a discriminant function-based classifier
• Compute numerical values of all discriminant functions for x
• Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest:
• Select a class cj for which dj(x) = max(di(x)); i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Discriminant classifier

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Discriminant type classifier for Bayesian classification
• The natural choice for the discriminant function is the a posteriori conditional probability P(ci|x):
• Practical versions using Bayes’ theorem
• Bayesian discriminant in a natural logarithmic form

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Characteristics of discriminant function
• Discriminant functions define the decision boundaries that separate the decision regions
• Generally, the decision boundaries are defined by neighboring decision regions when the corresponding discriminant function values are equal
• The decision boundaries are unaffected by the increasingly monotonic transformation of discriminant functions

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Bayesian Discriminant Functions for Two Classes
• General case
• Two discriminant functions: d1(x) and d2(x).
• Two decision regions: R1andR2.
• The decision boundary: d1(x) = d2(x).
• Using dichotomizer
• Single discriminant function: d(x) = d1(x) -d2(x).

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
• Assumption:
• A multivariate normal Gaussian distribution of the feature vector x within each class
• The Bayesian discriminant( in the previous section):

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
• Gaussian distribution of the probability density function
• Decision boundaries:
• hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.)

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
• Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci)
• Compute values of the mean vectors i and the covariance matrices i for all classes i = 1, 2, …, l based on the training set
• Compute values of the discriminant function for all classes
• Choose a class ci as a prediction of true class for which a value of the associated discriminant function dj(x) is largest:
• Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
• Linear Discriminant:
• Assumption: equal covariances for all classes i= 
• A linear form of discriminant functions:

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
• Linear Discriminant:

Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
• The classification process using linear discriminants
• Compute, for a given x, numerical values of discriminant functions for all classes:
• Choose a class ci for which a value of the discriminant function dj(x) is largest:
• Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Example
• Let us assume that the following two-feature patterns xR2 from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution:

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Example
• The estimates of the symmetric covariance matrices for both classes
• The linear discriminant functions for both classes

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Example
• Two-class two-feature pattern dichotomizer.

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Minimum Mahalanobis Distance Classifier
• Assumption
• Equal covariances for all classes i= ( i = 1, 2, …, l )
• Equal a priori probabilities for all classes P(ci) = P
• Discriminant function

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Minimum Mahalanobis Distance Classifier
• A classifier selects the class cj for which a value x is nearest, in the sense of Mahalanobis distance, to the corresponding mean vector j . This classifier is called a minimum Mahalanobis distance classifier.
• Linear version of the minimum Mahalanobis distance classifier

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Minimum Mahalanobis Distance Classifier
• Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector
• Compute numerical values of the Mahalanobis distances between x and means i for all classes.
• Choose a class cjas a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum:

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Linear Discriminant for Statistically Independent Features
• Assumption
• Equal covariances for all classes i= ( i = 1, 2, …, l )
• Features are statistically independent
• Discriminant function
• where

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Linear Discriminant for Statistically Independent Features
• Discriminants
• Linear discriminant formula

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Linear Discriminant for Statistically Independent Features
• “Neural network” style as a linear threshold machine
• where
• The decision surfaces for the linear discriminants are pieces of hyperplanes defined by equations di(x)-dj(x).

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Minimum Euclidean Distance Classifier
• Assumption
• Equal covariances for all classes i= ( i = 1, 2, …, l )
• Features are statistically independent
• Equal a priori probabilities for all classes P(ci) = P
• Discriminants
• or

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Minimum Euclidean Distance Classifier
• The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j.
• Linear version of the minimum distance classifier

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector
• Compute numerical values of Euclidean distances between x and means i for all classes:
• Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest:

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions
• Characteristics of Bayesian Normal Discriminant
• Assumptions
• multivariate normality within classes
• equal covariance matrices between classes
• The linear discriminant is equivalent to the optimal classifier
• These assumptions are satisfied only approximately
• Due to its simple structure, the linear discriminant tends not to overfit the training data set, which may lead to stronger generalization ability for unseen cases

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Basic Idea
• In Bayesian classifier design, it is necessary to estimate a priori probabilities and conditional probability densities due to the limited number of a priori observed objects. This estimation should be optimal according to the well-defined estimation criterion.
• Estimates of a priori probabilities

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Estimation of the class conditional probability densities p(x|ci)
• Parametric methods
• with the assumption of a specific functional form of a probability density function
• Nonparametric methods
• without the assumption of a specific functional form of a probability density function
• Semiparametric method
• a combination of parametric and nonparametric methods

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Parametric Methods
• A priori observations of objects and corresponding patterns:
• Split set of all patterns X according to a class into l disjoint sets:
• Assume that the parametric form of the class conditional probability density is given as a function:
• where

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Parametric Methods
• If the probability density has a normal (Gaussian) form:
• where

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• The Maximum Likelihood Estimation of Parameters
• Assumption
• we are given a limited-size set of N patterns xi:
• we know a parametric form p(x|) of a conditional probability density function
• Goal
• The task of estimation is to find the optimal (the best according to the used criterion) value of the parameter vector of a given dimension m.

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• The Maximum Likelihood Estimation of Parameters
• Likelihood
• The joint probability density L( ) is a function of a parameter vector  for a given set of patterns X.
• It is called the likelihood of  for a given set of patterns X.
• Maximum Likelihood Estimation
• The function L( ) can be chosen as a criterion for finding the optimal estimate of . It is called the maximum likelihood estimation of parameters 

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• The Maximum Likelihood Estimation of Parameters
• Minimizing the negative natural logarithm of the likelihood L( ) :
• For the differentiable function p(xi| ):

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• The Maximum Likelihood Estimation of Parameters
• For the normal form of a probability density function N(µ,) with unknown parameters µ and  constituting vector  :

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• The Maximum Likelihood Estimation of Parameters
• Example of Maximum Likelihood Estimation
• for
• The maximum likelihood estimation criterion
• The maximum likelihood estimates for the parameters:

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• “Nonparametric methods are more general methods of probability density estimation that  based on existing data, but without an assumption about  a functional  form  of the probability density function.”
• Nonparametric techniques:
• Histogram
• Kernel-based method
• k-nearest neighbors
• Nearest neighbors

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• General Idea
• Determine an estimate of a true probability density p(x) based on the available limited-size samples
• The probability that a new pattern x will fall inside a region R
• Approximation of the probability for a small region and for continuous p(x), with almost the same values within a region R

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• General Idea
• The probability that for N sample patterns set k of them will fall in a region R
• Estimate of the probability P
• Approximation for a probability density function for a given pattern x

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• Kernel-based Method and Parzen Window
• Kernel-based method is based on fixing around a pattern vector x a region R (and thus a region volume V ) and counting a number k of given training patterns falling in this region by using a special kernel function associated with the region.
• Such a kernel function is also called a Parzen window

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• Hypercube-type Parzen window
• Volume of the hypercube:
• Kernel (window) function:
• Total number of patterns falling within the hypercube
• The estimate of the probability density function

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• Smooth estimate of the probability density function
• A kernel function must satisfy two conditions:
• and
• For example, the radial symmetric multivariate Gaussian (normal) kernel:
• The estimate of the probability density function:

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• Smooth estimate of the probability density function
• The estimate of the class-dependent p(x|ck) probability density:
• The estimate of the class-dependent p(x|ck) probability density for the Gaussian kernel:

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• Design issues
• The selection of a kernel function:
• Parzen window, Gaussian kernel, etc.
• The selection of a smoothing parameter
• The generalization ability of the kernel-based density estimation depends on the training set and on smoothing parameters

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• K-nearest Neighbors
• “A method of probability density estimation with variable size regions”
• First, a small n-dimensional sphere is located in the pattern space centered at the point x.
• Second, a radius of this sphere is extended until the sphere contains exactly the fixed number k of patterns from a given training set.
• Then an estimate of the probability density for x is computed as

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• K-nearest Neighbors Classification Rule
• First, for a given x, the first k-nearest neighbors from a training set should be found (regardless of a class label) based on a defined pattern distance measure.
• Second, among the selected k nearest neighbors, numbers ni of patterns belonging to each class ci are computed.
• Then, the predicted class cj assigned to x corresponds to a class for which nj is the largest.

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Nonparametric Methods
• Nearest Neighbors Classification Rule
• “The simple version of the k-nearest neighbors classification is for a number of neighbors k equal to one”
• Algorithm
• Given: A training set Ttra of N patterns x1, x2, …, xNlabeled by l classes. A new pattern x.
• Compute for a given x the nearest neighbor xj from a whole training set based on the defined pattern distance measure distance(x,xi).
• Assign to x a class cjof nearest neighbors to x.

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Semiparametric Methods
• “Combination of parametric and nonparametric methods”
• Two semiparametric methods
• Functional approximation
• Mixture models (mixtures of probability densities)
• It is able to precisely fit component functions locally to specific regions of a feature space, based on discoveries about probability distributions and their modalities from the existing data

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Semiparametric Methods
• Functional Approximation
• Approximation of density by the linear combination of m basis functions i(x):
• Using a symmetric radial basis function

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Semiparametric Methods
• Functional Approximation
• Gaussian radial function: “The most commonly used basis function”
• Optimization criterion for the functional approximation of density
• Optimal estimates for parameters:

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Semiparametric Methods
• The algorithm for functional approximation
• Given : A training set Ttra of N patterns x1, x2, …, xN. The m orthonormal radial basis functions i(x) (i = 1, 2,…,m), along with their parameters.
• Compute the estimates of unknown parameters
• Form the model of the probability density as a functional approximation

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Semiparametric Methods
• Mixture Models (Mixtures of Probability Densities)
• “These models are based on linear parametric combination of known probability density functions (for example, normal densities) localized in certain regions of data”
• The linear mixture distribution
• Simplified version:

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities
• Distance Between Probability Densities and the Kullback-Leibler Distance
• Distance
• “We can define distance between two densities, with true density p(x) and its approximate estimate ”
• Kullback-Leibler distance

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network
• Probabilistic Neural Network
• “The PNN is a hardware implementation of the kernel-based method of density estimation and Bayesian optimal classification (providing minimization of the average probability of the classification error”
• Optimal Bayes’ classification rule
• Kernel-based estimation of a probability density function

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network
• Topology

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network
• Details
• An input layer (weightless) consists of n neurons (units), each receiving one element xi (i = 1,2,…, n) of the n-dimensional input pattern vector x.
• A pattern layer consists of N neurons (units, nodes), each representing one reference pattern from the training set Ttra .
• The transfer function of the pattern layer neuron implements a kernel function(a Parzen window)

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network
• Details
• The weightless second hidden layer is the summation layer. The number of neurons in the summation layer is equal to the number of classes l.
• The output activation function of the summation layer neuron is generally equal to but may be modified for different kernel functions.
• The output layer is the classification decision layer that implements Bayes’ classification rule by selecting the largest value and thus decides a class cj for the pattern x

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network
• Pattern Processing
• “Processing of patterns by the already-designed PNN network is performed in the feedforward manner. The input pattern is presented to the network and processed forward by each layer. The resulting output is the predicted class”
• PNN with the Radial Gaussian Kernel
• Kernel function:
• Transfer function:
• Output activation function:

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network
• PNN with the Radial Gaussian Normal Kernel and Normalized Patterns
• Transfer function:
• Normalization of patterns allows for a simpler architecture of the pattern-layer neurons, containing here also input weights and an exponential output activation function.
• The transfer function of a pattern neuron can be divided into a neuron’s transfer function and an output activation function
• The pattern-neuron output activation function:

Cios / Pedrycz / Swiniarski / Kurgan

Constraints in Classifier Design
• Problems
• Will a classifier guarantee minimization of the average probability of the classification error?
• Does a training set well represent patterns generated by a physical phenomenon?
• Are patterns drawn according to the characteristic of underlying phenomenon probability density?
• Is the average probability of a classification error difficult to calculate?

Cios / Pedrycz / Swiniarski / Kurgan

Constraints in Classifier Design
• Suboptimal solutions of Bayesian classifier design
• The estimation of class conditional probabilities is based on a limited sample
• The samples are frequently collected randomly, and not by use of a well-planned experimental procedure

Cios / Pedrycz / Swiniarski / Kurgan

REGRESSION
• Data Models
• Simple Linear Regression Analysis
• Multiple Regression
• General Least Squares and Multiple Regression
• Assessing the Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan

Data Models
• Mathematical models
• “They are useful approximate representations of phenomena that generate data and may be used for prediction, classification, compression, or control design.”
• Black-box models
• Mathematical models obtained by processing existing data without using laws of physics governing data-generating phenomena
• Regression analysis
• Data analysis and model design are based on a sample from a given population

Cios / Pedrycz / Swiniarski / Kurgan

Data Models
• Categories of regression models
• Simple linear regression
• Multiple linear regression
• Neural network-based linear regression
• Polynomial regression
• Logistic regression
• Log-linear regression
• Local piecewise linear regression
• Nonlinear regression (with a nonlinear model)
• Neural network-based nonlinear regression

Cios / Pedrycz / Swiniarski / Kurgan

Data Models
• Static and dynamic models
• A static model produces outcomes based only on the current input (no internal memory).
• A dynamic model produces outcomes based on the current input and the past history of the model behavior (internal memory)

Cios / Pedrycz / Swiniarski / Kurgan

Data Models
• Data gathering
• Random sample from a certain population
• N pairs of the experimental data set named Torig

Cios / Pedrycz / Swiniarski / Kurgan

Data Models
• Regression analysis
• “A statistical method used to discover the relationship between variables and to design a data model that can be used to predict variable values based on other variables”

Cios / Pedrycz / Swiniarski / Kurgan

Data Models
• Regression analysis
• A simple linear regression
• To find the linear relationship between two variables, x and y, and to discover a linear model, i.e., a line equation y = b+ax, which is the best fit to given data in order to predict values of data
• This modeling line is called the regression line of y on x
• The equation of that line is called a regression equation (regression model)
• Typical linear regression analysis provides a prediction of a dependent variable y based on an independent variable x

Cios / Pedrycz / Swiniarski / Kurgan

Data Models
• Visualization of Regression
• Scatter plot for height versus weight data

Cios / Pedrycz / Swiniarski / Kurgan

Data Models
• Visualization of Regression
• Scatter plot for height versus weight data

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Sample data and Regression model

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Assumptions
• The observations yi (i = 1, …, N) are random samples and are mutually independent.
• The regression error terms (the difference between the predicted value and the true value) are also mutually independent, with the same distribution (normal distribution with zero mean) and constant variances
• The distribution of the error term is independent of the joint distribution of explanatory variables. It is also assumed that unknown parameters of regression models are constants

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Simple Linear Regression Analysis
• Evaluation of basic statistical characteristics of data
• An estimation of the optimal parameters of a linear model
• Assess of model quality and generalization ability to predict the outcome for new data

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Model Structure
• Nonlinear data:
• Generally, a function f(x) could be nonlinear in x:
• Linear form :

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Regression Error (residual error)
• Difference between real-value yi and predicted-value yi,est

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Performance Criterion – Sum of Squared Errors.
• The sum of squared errors performance criterion for multiple regression
• The minimization technique in regression uses as a criterion the sum of squared error - method of least squares or errors (LSE) or, in short, the method of least squares

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Basic Statistical Characteristics of Data
• The mean of N samples
• The variance
• The covariance

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Sum of Squared Variations in y Caused by the Regression Model
• The total sum of squared variations in y
• These formulas are used to define important regression measures (for example, the correlation coefficient)

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Computing Optimal Values of the Regression Model Parameters
• The optimal model parameters values have to be computed based on the given data set and the defined performance criterion
• Methods for estimation of optimal model parameter values
• The analytical offline method
• The analytical recursive offline method
• Searching iteratively optimal model parameters
• Neural network-based regression

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Simple Linear Regression Analysis, Linear Least Squares, and Design of a Model
• The general linear model structure
• The performance criterion
• and performance curve

y = ax (a model with b=0)

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Simple Linear Regression Analysis, Linear Least Squares, and Design of a Model
• The optimal parameters

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Procedure for simple linear regression
• Given: The number N of experimental observations, and the set of the N experimental data points { (xi, yi), i = 1, 2, …, N }
• Compute the statistical characteristics of the data
• Compute the estimates of the model optimal parameters using Equations
• Assess the regression model quality indicating howwell the model fits the data. Compute
• Standard error of estimate
• Correlation coefficient r
• Coefficient of determination r2

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

Example

• Sample of four data points
• Resulting regression line
• y = 0.9 + 0.56x

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Optimal Parameter Values in the Minimum Least Squares Sense
• Required conditions for a valid linear regression
• The error term e = y - (b + ax) is normally distributed
• The error variance is the same for all values of x
• Error are independent of each other.

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Quality of the Linear Regression Model and Linear Correlation Analysis
• Assessment of model quality
• The resulting correlation coefficient can be used as a measure of how well the trends predicted by the values follow the trends in the training data
• The coefficient of determination can be used to measure how well the regression line fits the data points

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Correlation coefficient
• Coefficient of determination
• The percent of variation in the dependent variable y that can be explained by the regression equation,
• the explained variation in y divided by the total variation, or
• the square of r (correlation coefficient)

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Coefficient of determination
• Explained and unexplained variation in y

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Coefficient of determination
• Example
• If the coefficient of correlation has the value r = 0.9327, then the value of the coefficient of determination is r2 = 0.8700. It can be understood that 87% portion of the total variation in y can be explained by the linear relationship between x and y, as it is described by the optimal regression model of the data. The remaining portion 13% of the total variation in y remains unexplained.
• The calculation of coefficient of determination

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Matrix Version of Simple Linear Regression Based on Least Squares Method
• The matrix form of the model description (the estimation of ) for all N experimental data points
• The regression error

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Matrix Version of Simple Linear Regression Based on Least Squares Method
• The performance criterion:
• Optimal parameters:
• The value of the criterion for the optimal parameter vector:
• The regression error for the model with the optimal parameter vector:

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis
• Matrix Version of Simple Linear Regression Based on Least Squares Method
• Example: let us consider again the dataset shown in the following table
• y = 0.56x + 0.9

Cios / Pedrycz / Swiniarski / Kurgan

Multiple Regression
• Definition
• The multiple regression analysis is the statistical technique of exploring the relation (association) between the set of n independent variables that are used to explain the variability of one (generally many) dependent variable y
• Linear multiple regression model
• Linear multiple regression model using vector notation
• This regression model is represented by a hyperplane in (n + 1)-dimensional space.

Cios / Pedrycz / Swiniarski / Kurgan

Multiple Regression
• Geometrical Interpretation: Regression Errors
• The goal of multiple regression is to find a hyperplane in the (n + 1)-dimensional space that will best fit the data
• The performance criterion
• The error variance and standard error of the estimate

Cios / Pedrycz / Swiniarski / Kurgan

Multiple Regression
• Degree of Freedom
• The denominator N – n – 1 in the previous equation tells us that in multiple regression with n independent variables, the standard error has N – n – 1 degrees of freedom
• The degree of freedom has been reduced from N by n + 1 because n + 1 numerical parameters a0, a1, a2, …, anof the regression model have been estimated from the data

Cios / Pedrycz / Swiniarski / Kurgan

General Least Squares and Multiple Regression
• General model description in function form
• Data model
• Performance criterion
• Regression error

Cios / Pedrycz / Swiniarski / Kurgan

General Least Squares and Multiple Regression
• General model description in matrix form
• Data model
• Performance criterion
• Optimal parameters

Cios / Pedrycz / Swiniarski / Kurgan

General Least Squares and Multiple Regression
• Practical, Numerically Stable Computation of the Optimal Model Parameters
• Problem
• “The solution for the optimal least-squares parameters is almost never computed from the equation due to its poor numerical performance in cases when the matrix (the covariance matrix) is ill conditioned”
• Solution: various matrix decomposition methods

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model
• The Coefficient of Multiple Determination,R2
• “The percent of the variance in the dependent variable that can be explained by all of the independent variables taken together.”
• Adjusted R2 uses the number of design parameters plus a constant that are used in the model and the number of data points N in order to correct the statistic of this coefficient in situations when unnecessary parameters are used in the model structure

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model
• Cp Statistic
• It is used to compare multiple regression models Cp
• When comparing alternative regression models, the designer aims to choose models whose values of Cn is close to or below (n + 1)

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model
• Multiple Correlation
• A value of R can be found as the positive square root of R2 (coefficient of multiple determination)
• It is a measure of the strength of the linear relationship between the dependent variable y and the set of independent variables x1, x2, …, xn.
• A value of R close to 1 indicates that the fit is very good
• A value near zero indicates that the model is not a good approximation of the data and cannot be efficiently used for prediction

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

• “Let us consider a multiple linear regression analysis for the data set containing N = 4 cases, composed with one dependent variable y and two independent variables x1 and x2”
• Three-dimensional data

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

• The scatter plot of data points in three-dimensional space (x1, x2, y)

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

• The data matrix
• The optimal model parameters

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

• The optimal model:
• y = 3.1+0.9x1+0.56x2
• The optimal regression model in (x1, x2, y) space :

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

• Multipleregression, regression plane model and scatter plot

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

• The residuals (errors)
• The criterion value for the optimal parameters: 0.016

Cios / Pedrycz / Swiniarski / Kurgan

References
• Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford Press
• Cios, K.J., Pedrycz, W., and Swiniarski, R. 1998. Data Mining Methods for Knowledge Discovery. Kluwer
• Draper, N.R., and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics
• Duda, R.O. Hart, P.E., and D.G. Stork. 2001. Pattern Classification. Wiley
• Myers, R.H. 1986. Classical and Modern Regression with Applications, Boston, MA: Duxbury Press.

Cios / Pedrycz / Swiniarski / Kurgan