- By
**virgo** - Follow User

- 137 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Chapter 11 Supervised Learning: STATISTICAL METHODS' - virgo

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Chapter 11Supervised Learning:STATISTICAL METHODS

Cios / Pedrycz / Swiniarski / Kurgan

Outline

- Bayesian Methods
- Basics of Bayesian Methods
- Bayesian Classification – General Case
- Classification that Minimizes Risk
- Decision Regions and Probability of Errors
- Discriminant Functions
- Estimation of Probability Densities
- Probabilistic Neural Network
- Constraints in Classifier Design

Cios / Pedrycz / Swiniarski / Kurgan

Outline

- Regression
- Data Models
- Simple Linear Regression
- Multiple Regression
- General Least Squares and Multiple Regression
- Assessing Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Methods

Statistical processing based on the Bayes decision theory is a fundamental technique for pattern recognition and classification.

The Bayes decision theory provides a framework for statistical methods for classifying patterns into classes based on probabilities of patterns and their features.

Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods

Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk.

States of nature C = { “ an eagle ”, “ a hawk ” }

Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” }

We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”)

and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N)

Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods

- A priori (prior) probability P(ci):
- Estimation of a prior P(ci):
- P(ci)denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object.

Cios / Pedrycz / Swiniarski / Kurgan

Basics of Bayesian Methods

The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears.

- Natural and best decision:

“Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a bird to a class c2 ”

- The probability of classification error:

P(classification error) = P(c2) if we decide C = c1

P(c1) if we decide C = c2

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification

- Feature variable / feature x
- It characterizes an object and allows for better discrimination between one class from another
- We assume it to be a continuous random variable taking continuous values from a given range
- The variability of a random variable x can be expressed in probabilistic terms
- We represent a distribution of a random variable xby the class conditional probability density function (the state conditional probability density function):

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification

Examples of probability densities

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification

- Probability density function p(x|ci)
- also called the likelihood of a class ciwith respect to the valuexof a feature variable
- the likelihood that an object belongs to class ciis bigger if p(x|ci)is larger
- joint probability density function p(ci , x)
- A probability density that an object is in a class ci and has a feature variable value x.
- A posteriori (posterior) probability P(x|ci)
- The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x.

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification

- Bayes’ rule / Bayes’ theorem
- From probability theory (see Appendix B)
- An unconditional probability density function

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification

- Bayes’ rule
- “The conditional probability P(ci|x) can be expressed in terms of the a priori probability function P(ci), together with the class conditional probability density function p(ci|x).”

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification

- Bayes’ decision rule
- P(c2|x) if we decide C = c1
- P(classification error | x) =
- P(c1|x) if we decide C = c2
- “This statistical classification rule is best in the sense of minimizing the probability of misclassification (the probability of classification error)”
- Bayes’ classification rule guarantees minimization of the average probability of classification error

Cios / Pedrycz / Swiniarski / Kurgan

Involving Object Features in Classification

- Example
- Let us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2).
- Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2and p(45|c2) = 1.1053 ∙ 10-2.
- Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is

Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Classification – General Case

- Bayes’ Classification Rule for Multiclass Multifeature Objects
- Real-valued features of an object as n-dimensional column vector x Rn:
- The object may belong to l distinct classes (l distinct states of nature):

Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Classification – General Case

- Bayes’ Classification Rule for Multiclass Multifeature Objects
- Bayes’ theorem
- A priori probability: P(ci) (i = 1, 2…,l)
- Class conditional probability density function : p(x|ci)
- A posteriori (posterior) probability: P(ci|x)
- Unconditional probability density function:

Cios / Pedrycz / Swiniarski / Kurgan

Bayesian Classification – General Case

- Bayes’ Classification Rule for Multiclass Multifeature Objects
- Bayes classification rule:
- A given object with a given value x of a feature vector can be classified as belonging to class cj when:
- Assign an object with a given value x of a feature vector to class cj when:

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk

- Basic Idea
- To incorporate the fact that misclassifications of some classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature
- A loss function
- Cost (penalty, weight) due to the fact of assigning an object to class cjwhen in fact the true class is ci

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk

- A loss matrix
- We denote a loss function by Lijmatrix for l-class classification problems
- Expected (average) conditional loss
- In short,

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk

- Overall Risk
- The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision.
- Bayes risk
- Minimal overall risk Rleads to the generalization of Bayes’ rule for minimization of probability of the classification error.

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk

- Bayes’ classification rule with Bayes risk
- Choose a decision (a class) ci for which:

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk

- Bayesian Classification Minimizing the Probability of Error
- Symmetrical zero-one conditional loss function
- The conditional risk R(cj| x) criterion is the same as the average probability of classification error:
- An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision

Cios / Pedrycz / Swiniarski / Kurgan

Classification that Minimizes Risk

- Generalization of the Maximum Likelihood Classification
- Generalized likelihood ratio for classes ci and cj
- Generalized threshold value
- The maximum likelihood classification rule
- “Decide a class cj if

Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors

- Decision regions
- A classifier divides the feature space into l disjoint decision subspaces R1,R2, … Rl
- The region Ri is a subspace such that each realization x of a feature vector of an object falling into this region will be assigned to a class ci

Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors

- Decision boundaries (decision surfaces)
- The regions intersect, and boundaries between adjacent regions
- “The task of a classifier design is to find classification rules that will guarantee division of a feature space into optimal decision regions R1,R2, … Rl (with optimal decision boundaries) that will minimize a selected classification performance criterion”

Cios / Pedrycz / Swiniarski / Kurgan

Decision Regions and Probability of Errors

- Optimal classification with decision regions
- Average probability of correct classification
- “Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion”

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Discriminant functions:
- Discriminant type classifier
- It assigns an object with a given value x of a feature vector to a class cj if
- Classification rule for a discriminant function-based classifier
- Compute numerical values of all discriminant functions for x
- Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest:
- Select a class cj for which dj(x) = max(di(x)); i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Discriminant type classifier for Bayesian classification
- The natural choice for the discriminant function is the a posteriori conditional probability P(ci|x):
- Practical versions using Bayes’ theorem
- Bayesian discriminant in a natural logarithmic form

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Characteristics of discriminant function
- Discriminant functions define the decision boundaries that separate the decision regions
- Generally, the decision boundaries are defined by neighboring decision regions when the corresponding discriminant function values are equal
- The decision boundaries are unaffected by the increasingly monotonic transformation of discriminant functions

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Bayesian Discriminant Functions for Two Classes
- General case
- Two discriminant functions: d1(x) and d2(x).
- Two decision regions: R1andR2.
- The decision boundary: d1(x) = d2(x).
- Using dichotomizer
- Single discriminant function: d(x) = d1(x) -d2(x).

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants Derived from the Bayes Rule
- Quadratic Discriminant
- Assumption:
- A multivariate normal Gaussian distribution of the feature vector x within each class
- The Bayesian discriminant( in the previous section):

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants Derived from the Bayes Rule
- Quadratic Discriminant
- Gaussian distribution of the probability density function
- Quadratic Discriminant function
- Decision boundaries:
- hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.)

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants Derived from the Bayes Rule
- Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci)
- Compute values of the mean vectors i and the covariance matrices i for all classes i = 1, 2, …, l based on the training set
- Compute values of the discriminant function for all classes
- Choose a class ci as a prediction of true class for which a value of the associated discriminant function dj(x) is largest:
- Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants Derived from the Bayes Rule
- Linear Discriminant:
- Assumption: equal covariances for all classes i=
- The Quadratic Discriminant:
- A linear form of discriminant functions:

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants Derived from the Bayes Rule
- Linear Discriminant:

Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants Derived from the Bayes Rule
- The classification process using linear discriminants
- Compute, for a given x, numerical values of discriminant functions for all classes:
- Choose a class ci for which a value of the discriminant function dj(x) is largest:
- Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Example
- Let us assume that the following two-feature patterns xR2 from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution:

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Example
- The estimates of the symmetric covariance matrices for both classes
- The linear discriminant functions for both classes

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Example
- Two-class two-feature pattern dichotomizer.

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Minimum Mahalanobis Distance Classifier
- Assumption
- Equal covariances for all classes i= ( i = 1, 2, …, l )
- Equal a priori probabilities for all classes P(ci) = P
- Discriminant function

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Minimum Mahalanobis Distance Classifier
- A classifier selects the class cj for which a value x is nearest, in the sense of Mahalanobis distance, to the corresponding mean vector j . This classifier is called a minimum Mahalanobis distance classifier.
- Linear version of the minimum Mahalanobis distance classifier

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Minimum Mahalanobis Distance Classifier
- Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector
- Compute numerical values of the Mahalanobis distances between x and means i for all classes.
- Choose a class cjas a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum:

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Linear Discriminant for Statistically Independent Features
- Assumption
- Equal covariances for all classes i= ( i = 1, 2, …, l )
- Features are statistically independent
- Discriminant function
- where

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Linear Discriminant for Statistically Independent Features
- Discriminants
- Quadratic discriminant formula
- Linear discriminant formula

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Linear Discriminant for Statistically Independent Features
- “Neural network” style as a linear threshold machine
- where
- The decision surfaces for the linear discriminants are pieces of hyperplanes defined by equations di(x)-dj(x).

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Minimum Euclidean Distance Classifier
- Assumption
- Equal covariances for all classes i= ( i = 1, 2, …, l )
- Features are statistically independent
- Equal a priori probabilities for all classes P(ci) = P
- Discriminants
- or

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Minimum Euclidean Distance Classifier
- The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j.
- Linear version of the minimum distance classifier

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector
- Compute numerical values of Euclidean distances between x and means i for all classes:
- Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest:

Cios / Pedrycz / Swiniarski / Kurgan

Discriminant Functions

- Quadratic and Linear Discriminants
- Characteristics of Bayesian Normal Discriminant
- Assumptions
- multivariate normality within classes
- equal covariance matrices between classes
- The linear discriminant is equivalent to the optimal classifier
- These assumptions are satisfied only approximately
- Due to its simple structure, the linear discriminant tends not to overfit the training data set, which may lead to stronger generalization ability for unseen cases

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Basic Idea
- In Bayesian classifier design, it is necessary to estimate a priori probabilities and conditional probability densities due to the limited number of a priori observed objects. This estimation should be optimal according to the well-defined estimation criterion.
- Estimates of a priori probabilities

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Estimation of the class conditional probability densities p(x|ci)
- Parametric methods
- with the assumption of a specific functional form of a probability density function
- Nonparametric methods
- without the assumption of a specific functional form of a probability density function
- Semiparametric method
- a combination of parametric and nonparametric methods

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Parametric Methods
- A priori observations of objects and corresponding patterns:
- Split set of all patterns X according to a class into l disjoint sets:
- Assume that the parametric form of the class conditional probability density is given as a function:
- where

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Parametric Methods
- If the probability density has a normal (Gaussian) form:
- where

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- The Maximum Likelihood Estimation of Parameters
- Assumption
- we are given a limited-size set of N patterns xi:
- we know a parametric form p(x|) of a conditional probability density function
- Goal
- The task of estimation is to find the optimal (the best according to the used criterion) value of the parameter vector of a given dimension m.

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- The Maximum Likelihood Estimation of Parameters
- Likelihood
- The joint probability density L( ) is a function of a parameter vector for a given set of patterns X.
- It is called the likelihood of for a given set of patterns X.
- Maximum Likelihood Estimation
- The function L( ) can be chosen as a criterion for finding the optimal estimate of . It is called the maximum likelihood estimation of parameters

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- The Maximum Likelihood Estimation of Parameters
- Minimizing the negative natural logarithm of the likelihood L( ) :
- For the differentiable function p(xi| ):

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- The Maximum Likelihood Estimation of Parameters
- For the normal form of a probability density function N(µ,) with unknown parameters µ and constituting vector :

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- The Maximum Likelihood Estimation of Parameters
- Example of Maximum Likelihood Estimation
- for
- The maximum likelihood estimation criterion
- The maximum likelihood estimates for the parameters:

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- “Nonparametric methods are more general methods of probability density estimation that based on existing data, but without an assumption about a functional form of the probability density function.”
- Nonparametric techniques:
- Histogram
- Kernel-based method
- k-nearest neighbors
- Nearest neighbors

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- General Idea
- Determine an estimate of a true probability density p(x) based on the available limited-size samples
- The probability that a new pattern x will fall inside a region R
- Approximation of the probability for a small region and for continuous p(x), with almost the same values within a region R

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- General Idea
- The probability that for N sample patterns set k of them will fall in a region R
- Estimate of the probability P
- Approximation for a probability density function for a given pattern x

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- Kernel-based Method and Parzen Window
- Kernel-based method is based on fixing around a pattern vector x a region R (and thus a region volume V ) and counting a number k of given training patterns falling in this region by using a special kernel function associated with the region.
- Such a kernel function is also called a Parzen window

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- Hypercube-type Parzen window
- Volume of the hypercube:
- Kernel (window) function:
- Total number of patterns falling within the hypercube
- The estimate of the probability density function

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- Smooth estimate of the probability density function
- A kernel function must satisfy two conditions:
- and
- For example, the radial symmetric multivariate Gaussian (normal) kernel:
- The estimate of the probability density function:

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- Smooth estimate of the probability density function
- The estimate of the class-dependent p(x|ck) probability density:
- The estimate of the class-dependent p(x|ck) probability density for the Gaussian kernel:

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- Design issues
- The selection of a kernel function:
- Parzen window, Gaussian kernel, etc.
- The selection of a smoothing parameter
- The generalization ability of the kernel-based density estimation depends on the training set and on smoothing parameters

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- K-nearest Neighbors
- “A method of probability density estimation with variable size regions”
- First, a small n-dimensional sphere is located in the pattern space centered at the point x.
- Second, a radius of this sphere is extended until the sphere contains exactly the fixed number k of patterns from a given training set.
- Then an estimate of the probability density for x is computed as

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- K-nearest Neighbors Classification Rule
- First, for a given x, the first k-nearest neighbors from a training set should be found (regardless of a class label) based on a defined pattern distance measure.
- Second, among the selected k nearest neighbors, numbers ni of patterns belonging to each class ci are computed.
- Then, the predicted class cj assigned to x corresponds to a class for which nj is the largest.

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Nonparametric Methods
- Nearest Neighbors Classification Rule
- “The simple version of the k-nearest neighbors classification is for a number of neighbors k equal to one”
- Algorithm
- Given: A training set Ttra of N patterns x1, x2, …, xNlabeled by l classes. A new pattern x.
- Compute for a given x the nearest neighbor xj from a whole training set based on the defined pattern distance measure distance(x,xi).
- Assign to x a class cjof nearest neighbors to x.

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Semiparametric Methods
- “Combination of parametric and nonparametric methods”
- Two semiparametric methods
- Functional approximation
- Mixture models (mixtures of probability densities)
- Major advantage
- It is able to precisely fit component functions locally to specific regions of a feature space, based on discoveries about probability distributions and their modalities from the existing data

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Semiparametric Methods
- Functional Approximation
- Approximation of density by the linear combination of m basis functions i(x):
- Using a symmetric radial basis function

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Semiparametric Methods
- Functional Approximation
- Gaussian radial function: “The most commonly used basis function”
- Optimization criterion for the functional approximation of density
- Optimal estimates for parameters:

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Semiparametric Methods
- The algorithm for functional approximation
- Given : A training set Ttra of N patterns x1, x2, …, xN. The m orthonormal radial basis functions i(x) (i = 1, 2,…,m), along with their parameters.
- Compute the estimates of unknown parameters
- Form the model of the probability density as a functional approximation

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Semiparametric Methods
- Mixture Models (Mixtures of Probability Densities)
- “These models are based on linear parametric combination of known probability density functions (for example, normal densities) localized in certain regions of data”
- The linear mixture distribution
- Simplified version:

Cios / Pedrycz / Swiniarski / Kurgan

Estimation of Probability Densities

- Distance Between Probability Densities and the Kullback-Leibler Distance
- Distance
- “We can define distance between two densities, with true density p(x) and its approximate estimate ”
- Kullback-Leibler distance

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network

- Probabilistic Neural Network
- “The PNN is a hardware implementation of the kernel-based method of density estimation and Bayesian optimal classification (providing minimization of the average probability of the classification error”
- Optimal Bayes’ classification rule
- Kernel-based estimation of a probability density function

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network

- Details
- An input layer (weightless) consists of n neurons (units), each receiving one element xi (i = 1,2,…, n) of the n-dimensional input pattern vector x.
- A pattern layer consists of N neurons (units, nodes), each representing one reference pattern from the training set Ttra .
- The transfer function of the pattern layer neuron implements a kernel function(a Parzen window)

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network

- Details
- The weightless second hidden layer is the summation layer. The number of neurons in the summation layer is equal to the number of classes l.
- The output activation function of the summation layer neuron is generally equal to but may be modified for different kernel functions.
- The output layer is the classification decision layer that implements Bayes’ classification rule by selecting the largest value and thus decides a class cj for the pattern x

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network

- Pattern Processing
- “Processing of patterns by the already-designed PNN network is performed in the feedforward manner. The input pattern is presented to the network and processed forward by each layer. The resulting output is the predicted class”
- PNN with the Radial Gaussian Kernel
- Kernel function:
- Transfer function:
- Output activation function:

Cios / Pedrycz / Swiniarski / Kurgan

Probabilistic Neural Network

- PNN with the Radial Gaussian Normal Kernel and Normalized Patterns
- Transfer function:
- Normalization of patterns allows for a simpler architecture of the pattern-layer neurons, containing here also input weights and an exponential output activation function.
- The transfer function of a pattern neuron can be divided into a neuron’s transfer function and an output activation function
- The pattern-neuron output activation function:

Cios / Pedrycz / Swiniarski / Kurgan

Constraints in Classifier Design

- Problems
- Will a classifier guarantee minimization of the average probability of the classification error?
- Does a training set well represent patterns generated by a physical phenomenon?
- Are patterns drawn according to the characteristic of underlying phenomenon probability density?
- Is the average probability of a classification error difficult to calculate?

Cios / Pedrycz / Swiniarski / Kurgan

Constraints in Classifier Design

- Suboptimal solutions of Bayesian classifier design
- The estimation of class conditional probabilities is based on a limited sample
- The samples are frequently collected randomly, and not by use of a well-planned experimental procedure

Cios / Pedrycz / Swiniarski / Kurgan

REGRESSION

- Data Models
- Simple Linear Regression Analysis
- Multiple Regression
- General Least Squares and Multiple Regression
- Assessing the Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan

Data Models

- Mathematical models
- “They are useful approximate representations of phenomena that generate data and may be used for prediction, classification, compression, or control design.”
- Black-box models
- Mathematical models obtained by processing existing data without using laws of physics governing data-generating phenomena
- Regression analysis
- Data analysis and model design are based on a sample from a given population

Cios / Pedrycz / Swiniarski / Kurgan

Data Models

- Categories of regression models
- Simple linear regression
- Multiple linear regression
- Neural network-based linear regression
- Polynomial regression
- Logistic regression
- Log-linear regression
- Local piecewise linear regression
- Nonlinear regression (with a nonlinear model)
- Neural network-based nonlinear regression

Cios / Pedrycz / Swiniarski / Kurgan

Data Models

- Static and dynamic models
- A static model produces outcomes based only on the current input (no internal memory).
- A dynamic model produces outcomes based on the current input and the past history of the model behavior (internal memory)

Cios / Pedrycz / Swiniarski / Kurgan

Data Models

- Data gathering
- Random sample from a certain population
- N pairs of the experimental data set named Torig

Cios / Pedrycz / Swiniarski / Kurgan

Data Models

- Regression analysis
- “A statistical method used to discover the relationship between variables and to design a data model that can be used to predict variable values based on other variables”

Cios / Pedrycz / Swiniarski / Kurgan

Data Models

- Regression analysis
- A simple linear regression
- To find the linear relationship between two variables, x and y, and to discover a linear model, i.e., a line equation y = b+ax, which is the best fit to given data in order to predict values of data
- This modeling line is called the regression line of y on x
- The equation of that line is called a regression equation (regression model)
- Typical linear regression analysis provides a prediction of a dependent variable y based on an independent variable x

Cios / Pedrycz / Swiniarski / Kurgan

Data Models

- Visualization of Regression
- Scatter plot for height versus weight data

Cios / Pedrycz / Swiniarski / Kurgan

Data Models

- Visualization of Regression
- Scatter plot for height versus weight data

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Sample data and Regression model

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Assumptions
- The observations yi (i = 1, …, N) are random samples and are mutually independent.
- The regression error terms (the difference between the predicted value and the true value) are also mutually independent, with the same distribution (normal distribution with zero mean) and constant variances
- The distribution of the error term is independent of the joint distribution of explanatory variables. It is also assumed that unknown parameters of regression models are constants

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Simple Linear Regression Analysis
- Evaluation of basic statistical characteristics of data
- An estimation of the optimal parameters of a linear model
- Assess of model quality and generalization ability to predict the outcome for new data

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Model Structure
- Nonlinear data:
- Generally, a function f(x) could be nonlinear in x:
- Linear form :

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Regression Error (residual error)
- Difference between real-value yi and predicted-value yi,est

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Performance Criterion – Sum of Squared Errors.
- The sum of squared errors performance criterion for multiple regression
- The minimization technique in regression uses as a criterion the sum of squared error - method of least squares or errors (LSE) or, in short, the method of least squares

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Basic Statistical Characteristics of Data
- The mean of N samples
- The variance
- The covariance

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Sum of Squared Variations in y Caused by the Regression Model
- The total sum of squared variations in y
- These formulas are used to define important regression measures (for example, the correlation coefficient)

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Computing Optimal Values of the Regression Model Parameters
- The optimal model parameters values have to be computed based on the given data set and the defined performance criterion
- Methods for estimation of optimal model parameter values
- The analytical offline method
- The analytical recursive offline method
- Searching iteratively optimal model parameters
- Neural network-based regression

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Simple Linear Regression Analysis, Linear Least Squares, and Design of a Model
- The general linear model structure
- The performance criterion
- and performance curve

y = ax (a model with b=0)

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Simple Linear Regression Analysis, Linear Least Squares, and Design of a Model
- The optimal parameters

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Procedure for simple linear regression
- Given: The number N of experimental observations, and the set of the N experimental data points { (xi, yi), i = 1, 2, …, N }
- Compute the statistical characteristics of the data
- Compute the estimates of the model optimal parameters using Equations
- Assess the regression model quality indicating howwell the model fits the data. Compute
- Standard error of estimate
- Correlation coefficient r
- Coefficient of determination r2

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

Example

- Sample of four data points
- Resulting regression line
- y = 0.9 + 0.56x

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Optimal Parameter Values in the Minimum Least Squares Sense
- Required conditions for a valid linear regression
- The error term e = y - (b + ax) is normally distributed
- The error variance is the same for all values of x
- Error are independent of each other.

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Quality of the Linear Regression Model and Linear Correlation Analysis
- Assessment of model quality
- The resulting correlation coefficient can be used as a measure of how well the trends predicted by the values follow the trends in the training data
- The coefficient of determination can be used to measure how well the regression line fits the data points

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Correlation coefficient
- Coefficient of determination
- The percent of variation in the dependent variable y that can be explained by the regression equation,
- the explained variation in y divided by the total variation, or
- the square of r (correlation coefficient)

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Coefficient of determination
- Explained and unexplained variation in y

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Coefficient of determination
- Example
- If the coefficient of correlation has the value r = 0.9327, then the value of the coefficient of determination is r2 = 0.8700. It can be understood that 87% portion of the total variation in y can be explained by the linear relationship between x and y, as it is described by the optimal regression model of the data. The remaining portion 13% of the total variation in y remains unexplained.
- The calculation of coefficient of determination

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Matrix Version of Simple Linear Regression Based on Least Squares Method
- The matrix form of the model description (the estimation of ) for all N experimental data points
- The regression error

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Matrix Version of Simple Linear Regression Based on Least Squares Method
- The performance criterion:
- Optimal parameters:
- The value of the criterion for the optimal parameter vector:
- The regression error for the model with the optimal parameter vector:

Cios / Pedrycz / Swiniarski / Kurgan

Simple Linear Regression Analysis

- Matrix Version of Simple Linear Regression Based on Least Squares Method
- Example: let us consider again the dataset shown in the following table
- y = 0.56x + 0.9

Cios / Pedrycz / Swiniarski / Kurgan

Multiple Regression

- Definition
- The multiple regression analysis is the statistical technique of exploring the relation (association) between the set of n independent variables that are used to explain the variability of one (generally many) dependent variable y
- Linear multiple regression model
- Linear multiple regression model using vector notation
- This regression model is represented by a hyperplane in (n + 1)-dimensional space.

Cios / Pedrycz / Swiniarski / Kurgan

Multiple Regression

- Geometrical Interpretation: Regression Errors
- The goal of multiple regression is to find a hyperplane in the (n + 1)-dimensional space that will best fit the data
- The performance criterion
- The error variance and standard error of the estimate

Cios / Pedrycz / Swiniarski / Kurgan

Multiple Regression

- Degree of Freedom
- The denominator N – n – 1 in the previous equation tells us that in multiple regression with n independent variables, the standard error has N – n – 1 degrees of freedom
- The degree of freedom has been reduced from N by n + 1 because n + 1 numerical parameters a0, a1, a2, …, anof the regression model have been estimated from the data

Cios / Pedrycz / Swiniarski / Kurgan

General Least Squares and Multiple Regression

- General model description in function form
- Data model
- Performance criterion
- Regression error

Cios / Pedrycz / Swiniarski / Kurgan

General Least Squares and Multiple Regression

- General model description in matrix form
- Data model
- Performance criterion
- Optimal parameters

Cios / Pedrycz / Swiniarski / Kurgan

General Least Squares and Multiple Regression

- Practical, Numerically Stable Computation of the Optimal Model Parameters
- Problem
- “The solution for the optimal least-squares parameters is almost never computed from the equation due to its poor numerical performance in cases when the matrix (the covariance matrix) is ill conditioned”
- Solution: various matrix decomposition methods

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

- The Coefficient of Multiple Determination,R2
- “The percent of the variance in the dependent variable that can be explained by all of the independent variables taken together.”
- Adjusted R2
- Adjusted R2 uses the number of design parameters plus a constant that are used in the model and the number of data points N in order to correct the statistic of this coefficient in situations when unnecessary parameters are used in the model structure

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

- Cp Statistic
- It is used to compare multiple regression models Cp
- When comparing alternative regression models, the designer aims to choose models whose values of Cn is close to or below (n + 1)

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

- Multiple Correlation
- A value of R can be found as the positive square root of R2 (coefficient of multiple determination)
- It is a measure of the strength of the linear relationship between the dependent variable y and the set of independent variables x1, x2, …, xn.
- A value of R close to 1 indicates that the fit is very good
- A value near zero indicates that the model is not a good approximation of the data and cannot be efficiently used for prediction

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

- “Let us consider a multiple linear regression analysis for the data set containing N = 4 cases, composed with one dependent variable y and two independent variables x1 and x2”
- Three-dimensional data

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

- The scatter plot of data points in three-dimensional space (x1, x2, y)

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

- The data matrix
- The optimal model parameters

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

- The optimal model:
- y = 3.1+0.9x1+0.56x2
- The optimal regression model in (x1, x2, y) space :

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

- Multipleregression, regression plane model and scatter plot

Cios / Pedrycz / Swiniarski / Kurgan

Assessing the Quality of the Multiple Regression Model

Example

- The residuals (errors)
- The criterion value for the optimal parameters: 0.016

Cios / Pedrycz / Swiniarski / Kurgan

References

- Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford Press
- Cios, K.J., Pedrycz, W., and Swiniarski, R. 1998. Data Mining Methods for Knowledge Discovery. Kluwer
- Draper, N.R., and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics
- Duda, R.O. Hart, P.E., and D.G. Stork. 2001. Pattern Classification. Wiley
- Myers, R.H. 1986. Classical and Modern Regression with Applications, Boston, MA: Duxbury Press.

Cios / Pedrycz / Swiniarski / Kurgan

Download Presentation

Connecting to Server..