chapter 11 supervised learning statistical methods
Download
Skip this Video
Download Presentation
Chapter 11 Supervised Learning: STATISTICAL METHODS

Loading in 2 Seconds...

play fullscreen
1 / 131

Chapter 11 Supervised Learning: STATISTICAL METHODS - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

Chapter 11 Supervised Learning: STATISTICAL METHODS. Cios / Pedrycz / Swiniarski / Kurgan. Outline. Bayesian Methods Basics of Bayesian Methods Bayesian Classification – General Case Classification that Minimizes Risk Decision Regions and Probability of Errors Discriminant Functions

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Chapter 11 Supervised Learning: STATISTICAL METHODS' - virgo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
chapter 11 supervised learning statistical methods

Chapter 11Supervised Learning:STATISTICAL METHODS

Cios / Pedrycz / Swiniarski / Kurgan

outline
Outline
  • Bayesian Methods
    • Basics of Bayesian Methods
    • Bayesian Classification – General Case
    • Classification that Minimizes Risk
    • Decision Regions and Probability of Errors
    • Discriminant Functions
    • Estimation of Probability Densities
    • Probabilistic Neural Network
    • Constraints in Classifier Design

Cios / Pedrycz / Swiniarski / Kurgan

outline1
Outline
  • Regression
    • Data Models
    • Simple Linear Regression
    • Multiple Regression
    • General Least Squares and Multiple Regression
    • Assessing Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan

bayesian methods
Bayesian Methods

Statistical processing based on the Bayes decision theory is a fundamental technique for pattern recognition and classification.

The Bayes decision theory provides a framework for statistical methods for classifying patterns into classes based on probabilities of patterns and their features.

Cios / Pedrycz / Swiniarski / Kurgan

basics of bayesian methods
Basics of Bayesian Methods

Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk.

States of nature C = { “ an eagle ”, “ a hawk ” }

Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” }

We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”)

and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N)

Cios / Pedrycz / Swiniarski / Kurgan

basics of bayesian methods1
Basics of Bayesian Methods
  • A priori (prior) probability P(ci):
  • Estimation of a prior P(ci):
  • P(ci)denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object.

Cios / Pedrycz / Swiniarski / Kurgan

basics of bayesian methods2
Basics of Bayesian Methods

The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears.

  • Natural and best decision:

“Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a bird to a class c2 ”

  • The probability of classification error:

P(classification error) = P(c2) if we decide C = c1

P(c1) if we decide C = c2

Cios / Pedrycz / Swiniarski / Kurgan

involving object features in classification
Involving Object Features in Classification
  • Feature variable / feature x
    • It characterizes an object and allows for better discrimination between one class from another
    • We assume it to be a continuous random variable taking continuous values from a given range
    • The variability of a random variable x can be expressed in probabilistic terms
    • We represent a distribution of a random variable xby the class conditional probability density function (the state conditional probability density function):

Cios / Pedrycz / Swiniarski / Kurgan

involving object features in classification1
Involving Object Features in Classification

Examples of probability densities

Cios / Pedrycz / Swiniarski / Kurgan

involving object features in classification2
Involving Object Features in Classification
  • Probability density function p(x|ci)
    • also called the likelihood of a class ciwith respect to the valuexof a feature variable
    • the likelihood that an object belongs to class ciis bigger if p(x|ci)is larger
    • joint probability density function p(ci , x)
    • A probability density that an object is in a class ci and has a feature variable value x.
    • A posteriori (posterior) probability P(x|ci)
    • The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x.

Cios / Pedrycz / Swiniarski / Kurgan

involving object features in classification3
Involving Object Features in Classification
  • Bayes’ rule / Bayes’ theorem
    • From probability theory (see Appendix B)
    • An unconditional probability density function

Cios / Pedrycz / Swiniarski / Kurgan

involving object features in classification4
Involving Object Features in Classification
  • Bayes’ rule
    • “The conditional probability P(ci|x) can be expressed in terms of the a priori probability function P(ci), together with the class conditional probability density function p(ci|x).”

Cios / Pedrycz / Swiniarski / Kurgan

involving object features in classification5
Involving Object Features in Classification
  • Bayes’ decision rule
    • P(c2|x) if we decide C = c1
  • P(classification error | x) =
    • P(c1|x) if we decide C = c2
    • “This statistical classification rule is best in the sense of minimizing the probability of misclassification (the probability of classification error)”
    • Bayes’ classification rule guarantees minimization of the average probability of classification error

Cios / Pedrycz / Swiniarski / Kurgan

involving object features in classification6
Involving Object Features in Classification
  • Example
    • Let us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2).
    • Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2and p(45|c2) = 1.1053 ∙ 10-2.
    • Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is

Cios / Pedrycz / Swiniarski / Kurgan

bayesian classification general case
Bayesian Classification – General Case
  • Bayes’ Classification Rule for Multiclass Multifeature Objects
    • Real-valued features of an object as n-dimensional column vector x  Rn:
    • The object may belong to l distinct classes (l distinct states of nature):

Cios / Pedrycz / Swiniarski / Kurgan

bayesian classification general case1
Bayesian Classification – General Case
  • Bayes’ Classification Rule for Multiclass Multifeature Objects
    • Bayes’ theorem
      • A priori probability: P(ci) (i = 1, 2…,l)
      • Class conditional probability density function : p(x|ci)
      • A posteriori (posterior) probability: P(ci|x)
      • Unconditional probability density function:

Cios / Pedrycz / Swiniarski / Kurgan

bayesian classification general case2
Bayesian Classification – General Case
  • Bayes’ Classification Rule for Multiclass Multifeature Objects
    • Bayes classification rule:
    • A given object with a given value x of a feature vector can be classified as belonging to class cj when:
    • Assign an object with a given value x of a feature vector to class cj when:

Cios / Pedrycz / Swiniarski / Kurgan

classification that minimizes risk
Classification that Minimizes Risk
  • Basic Idea
    • To incorporate the fact that misclassifications of some classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature
  • A loss function
    • Cost (penalty, weight) due to the fact of assigning an object to class cjwhen in fact the true class is ci

Cios / Pedrycz / Swiniarski / Kurgan

classification that minimizes risk1
Classification that Minimizes Risk
  • A loss matrix
    • We denote a loss function by Lijmatrix for l-class classification problems
  • Expected (average) conditional loss
      • In short,

Cios / Pedrycz / Swiniarski / Kurgan

classification that minimizes risk2
Classification that Minimizes Risk
  • Overall Risk
    • The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision.
  • Bayes risk
    • Minimal overall risk Rleads to the generalization of Bayes’ rule for minimization of probability of the classification error.

Cios / Pedrycz / Swiniarski / Kurgan

classification that minimizes risk3
Classification that Minimizes Risk
  • Bayes’ classification rule with Bayes risk
    • Choose a decision (a class) ci for which:

Cios / Pedrycz / Swiniarski / Kurgan

classification that minimizes risk4
Classification that Minimizes Risk
  • Bayesian Classification Minimizing the Probability of Error
    • Symmetrical zero-one conditional loss function
    • The conditional risk R(cj| x) criterion is the same as the average probability of classification error:
    • An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision

Cios / Pedrycz / Swiniarski / Kurgan

classification that minimizes risk5
Classification that Minimizes Risk
  • Generalization of the Maximum Likelihood Classification
    • Generalized likelihood ratio for classes ci and cj
    • Generalized threshold value
    • The maximum likelihood classification rule
    • “Decide a class cj if

Cios / Pedrycz / Swiniarski / Kurgan

decision regions and probability of errors
Decision Regions and Probability of Errors
  • Decision regions
    • A classifier divides the feature space into l disjoint decision subspaces R1,R2, … Rl
    • The region Ri is a subspace such that each realization x of a feature vector of an object falling into this region will be assigned to a class ci

Cios / Pedrycz / Swiniarski / Kurgan

decision regions and probability of errors1
Decision Regions and Probability of Errors
  • Decision boundaries (decision surfaces)
    • The regions intersect, and boundaries between adjacent regions
    • “The task of a classifier design is to find classification rules that will guarantee division of a feature space into optimal decision regions R1,R2, … Rl (with optimal decision boundaries) that will minimize a selected classification performance criterion”

Cios / Pedrycz / Swiniarski / Kurgan

decision regions and probability of errors2
Decision Regions and Probability of Errors
  • Decision boundaries

Cios / Pedrycz / Swiniarski / Kurgan

decision regions and probability of errors3
Decision Regions and Probability of Errors
  • Optimal classification with decision regions
    • Average probability of correct classification
  • “Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion”

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions
Discriminant Functions
  • Discriminant functions:
  • Discriminant type classifier
    • It assigns an object with a given value x of a feature vector to a class cj if
  • Classification rule for a discriminant function-based classifier
    • Compute numerical values of all discriminant functions for x
    • Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest:
      • Select a class cj for which dj(x) = max(di(x)); i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions1
Discriminant Functions
  • Discriminant classifier

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions2
Discriminant Functions
  • Discriminant type classifier for Bayesian classification
    • The natural choice for the discriminant function is the a posteriori conditional probability P(ci|x):
    • Practical versions using Bayes’ theorem
    • Bayesian discriminant in a natural logarithmic form

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions3
Discriminant Functions
  • Characteristics of discriminant function
    • Discriminant functions define the decision boundaries that separate the decision regions
    • Generally, the decision boundaries are defined by neighboring decision regions when the corresponding discriminant function values are equal
    • The decision boundaries are unaffected by the increasingly monotonic transformation of discriminant functions

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions4
Discriminant Functions
  • Bayesian Discriminant Functions for Two Classes
    • General case
      • Two discriminant functions: d1(x) and d2(x).
      • Two decision regions: R1andR2.
      • The decision boundary: d1(x) = d2(x).
    • Using dichotomizer
      • Single discriminant function: d(x) = d1(x) -d2(x).

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions5
Discriminant Functions
  • Quadratic and Linear Discriminants Derived from the Bayes Rule
    • Quadratic Discriminant
      • Assumption:
      • A multivariate normal Gaussian distribution of the feature vector x within each class
      • The Bayesian discriminant( in the previous section):

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions6
Discriminant Functions
  • Quadratic and Linear Discriminants Derived from the Bayes Rule
    • Quadratic Discriminant
      • Gaussian distribution of the probability density function
      • Quadratic Discriminant function
      • Decision boundaries:
      • hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.)

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions7
Discriminant Functions
  • Quadratic and Linear Discriminants Derived from the Bayes Rule
  • Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci)
    • Compute values of the mean vectors i and the covariance matrices i for all classes i = 1, 2, …, l based on the training set
    • Compute values of the discriminant function for all classes
    • Choose a class ci as a prediction of true class for which a value of the associated discriminant function dj(x) is largest:
      • Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions8
Discriminant Functions
  • Quadratic and Linear Discriminants Derived from the Bayes Rule
    • Linear Discriminant:
      • Assumption: equal covariances for all classes i= 
      • The Quadratic Discriminant:
      • A linear form of discriminant functions:

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions9
Discriminant Functions
  • Quadratic and Linear Discriminants Derived from the Bayes Rule
    • Linear Discriminant:

Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions10
Discriminant Functions
  • Quadratic and Linear Discriminants Derived from the Bayes Rule
    • The classification process using linear discriminants
      • Compute, for a given x, numerical values of discriminant functions for all classes:
      • Choose a class ci for which a value of the discriminant function dj(x) is largest:
        • Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions11
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Example
    • Let us assume that the following two-feature patterns xR2 from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution:

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions12
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Example
      • The estimates of the symmetric covariance matrices for both classes
      • The linear discriminant functions for both classes

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions13
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Example
      • Two-class two-feature pattern dichotomizer.

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions14
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Minimum Mahalanobis Distance Classifier
      • Assumption
        • Equal covariances for all classes i= ( i = 1, 2, …, l )
        • Equal a priori probabilities for all classes P(ci) = P
      • Discriminant function

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions15
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Minimum Mahalanobis Distance Classifier
      • A classifier selects the class cj for which a value x is nearest, in the sense of Mahalanobis distance, to the corresponding mean vector j . This classifier is called a minimum Mahalanobis distance classifier.
      • Linear version of the minimum Mahalanobis distance classifier

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions16
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Minimum Mahalanobis Distance Classifier
    • Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector
      • Compute numerical values of the Mahalanobis distances between x and means i for all classes.
      • Choose a class cjas a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum:

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions17
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Linear Discriminant for Statistically Independent Features
      • Assumption
        • Equal covariances for all classes i= ( i = 1, 2, …, l )
        • Features are statistically independent
      • Discriminant function
      • where

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions18
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Linear Discriminant for Statistically Independent Features
      • Discriminants
      • Quadratic discriminant formula
      • Linear discriminant formula

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions19
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Linear Discriminant for Statistically Independent Features
      • “Neural network” style as a linear threshold machine
      • where
      • The decision surfaces for the linear discriminants are pieces of hyperplanes defined by equations di(x)-dj(x).

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions20
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Minimum Euclidean Distance Classifier
      • Assumption
        • Equal covariances for all classes i= ( i = 1, 2, …, l )
        • Features are statistically independent
        • Equal a priori probabilities for all classes P(ci) = P
      • Discriminants
      • or

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions21
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Minimum Euclidean Distance Classifier
      • The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j.
      • Linear version of the minimum distance classifier

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions22
Discriminant Functions
  • Quadratic and Linear Discriminants
  • Given: The mean vectors for all classes i (i = 1, 2, …, l) and a given value x of a feature vector
    • Compute numerical values of Euclidean distances between x and means i for all classes:
    • Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest:

Cios / Pedrycz / Swiniarski / Kurgan

discriminant functions23
Discriminant Functions
  • Quadratic and Linear Discriminants
    • Characteristics of Bayesian Normal Discriminant
      • Assumptions
        • multivariate normality within classes
        • equal covariance matrices between classes
      • The linear discriminant is equivalent to the optimal classifier
      • These assumptions are satisfied only approximately
      • Due to its simple structure, the linear discriminant tends not to overfit the training data set, which may lead to stronger generalization ability for unseen cases

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities
Estimation of Probability Densities
  • Basic Idea
    • In Bayesian classifier design, it is necessary to estimate a priori probabilities and conditional probability densities due to the limited number of a priori observed objects. This estimation should be optimal according to the well-defined estimation criterion.
  • Estimates of a priori probabilities

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities1
Estimation of Probability Densities
  • Estimation of the class conditional probability densities p(x|ci)
    • Parametric methods
    • with the assumption of a specific functional form of a probability density function
    • Nonparametric methods
    • without the assumption of a specific functional form of a probability density function
    • Semiparametric method
    • a combination of parametric and nonparametric methods

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities2
Estimation of Probability Densities
  • Parametric Methods
    • A priori observations of objects and corresponding patterns:
    • Split set of all patterns X according to a class into l disjoint sets:
    • Assume that the parametric form of the class conditional probability density is given as a function:
    • where

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities3
Estimation of Probability Densities
  • Parametric Methods
    • If the probability density has a normal (Gaussian) form:
    • where

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities4
Estimation of Probability Densities
  • The Maximum Likelihood Estimation of Parameters
    • Assumption
      • we are given a limited-size set of N patterns xi:
      • we know a parametric form p(x|) of a conditional probability density function
    • Goal
      • The task of estimation is to find the optimal (the best according to the used criterion) value of the parameter vector of a given dimension m.

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities5
Estimation of Probability Densities
  • The Maximum Likelihood Estimation of Parameters
    • Likelihood
      • The joint probability density L( ) is a function of a parameter vector  for a given set of patterns X.
      • It is called the likelihood of  for a given set of patterns X.
    • Maximum Likelihood Estimation
    • The function L( ) can be chosen as a criterion for finding the optimal estimate of . It is called the maximum likelihood estimation of parameters 

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities6
Estimation of Probability Densities
  • The Maximum Likelihood Estimation of Parameters
    • Minimizing the negative natural logarithm of the likelihood L( ) :
    • For the differentiable function p(xi| ):

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities7
Estimation of Probability Densities
  • The Maximum Likelihood Estimation of Parameters
    • For the normal form of a probability density function N(µ,) with unknown parameters µ and  constituting vector  :

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities8
Estimation of Probability Densities
  • The Maximum Likelihood Estimation of Parameters
    • Example of Maximum Likelihood Estimation
      • for
      • The maximum likelihood estimation criterion
      • The maximum likelihood estimates for the parameters:

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities9
Estimation of Probability Densities
  • Nonparametric Methods
    • “Nonparametric methods are more general methods of probability density estimation that  based on existing data, but without an assumption about  a functional  form  of the probability density function.”
    • Nonparametric techniques:
      • Histogram
      • Kernel-based method
      • k-nearest neighbors
      • Nearest neighbors

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities10
Estimation of Probability Densities
  • Nonparametric Methods
    • General Idea
    • Determine an estimate of a true probability density p(x) based on the available limited-size samples
      • The probability that a new pattern x will fall inside a region R
      • Approximation of the probability for a small region and for continuous p(x), with almost the same values within a region R

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities11
Estimation of Probability Densities
  • Nonparametric Methods
    • General Idea
      • The probability that for N sample patterns set k of them will fall in a region R
      • Estimate of the probability P
      • Approximation for a probability density function for a given pattern x

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities12
Estimation of Probability Densities
  • Nonparametric Methods
    • Kernel-based Method and Parzen Window
      • Kernel-based method is based on fixing around a pattern vector x a region R (and thus a region volume V ) and counting a number k of given training patterns falling in this region by using a special kernel function associated with the region.
      • Such a kernel function is also called a Parzen window

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities13
Estimation of Probability Densities
  • Nonparametric Methods
    • Hypercube-type Parzen window
      • Volume of the hypercube:
      • Kernel (window) function:
      • Total number of patterns falling within the hypercube
      • The estimate of the probability density function

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities14
Estimation of Probability Densities
  • Nonparametric Methods
    • Smooth estimate of the probability density function
      • A kernel function must satisfy two conditions:
      • and
      • For example, the radial symmetric multivariate Gaussian (normal) kernel:
      • The estimate of the probability density function:

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities15
Estimation of Probability Densities
  • Nonparametric Methods
    • Smooth estimate of the probability density function
      • The estimate of the class-dependent p(x|ck) probability density:
      • The estimate of the class-dependent p(x|ck) probability density for the Gaussian kernel:

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities16
Estimation of Probability Densities
  • Nonparametric Methods
    • Design issues
      • The selection of a kernel function:
        • Parzen window, Gaussian kernel, etc.
      • The selection of a smoothing parameter
    • The generalization ability of the kernel-based density estimation depends on the training set and on smoothing parameters

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities17
Estimation of Probability Densities
  • Nonparametric Methods
    • K-nearest Neighbors
      • “A method of probability density estimation with variable size regions”
      • First, a small n-dimensional sphere is located in the pattern space centered at the point x.
      • Second, a radius of this sphere is extended until the sphere contains exactly the fixed number k of patterns from a given training set.
      • Then an estimate of the probability density for x is computed as

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities18
Estimation of Probability Densities
  • Nonparametric Methods
    • K-nearest Neighbors Classification Rule
      • First, for a given x, the first k-nearest neighbors from a training set should be found (regardless of a class label) based on a defined pattern distance measure.
      • Second, among the selected k nearest neighbors, numbers ni of patterns belonging to each class ci are computed.
      • Then, the predicted class cj assigned to x corresponds to a class for which nj is the largest.

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities19
Estimation of Probability Densities
  • Nonparametric Methods
    • Nearest Neighbors Classification Rule
      • “The simple version of the k-nearest neighbors classification is for a number of neighbors k equal to one”
    • Algorithm
    • Given: A training set Ttra of N patterns x1, x2, …, xNlabeled by l classes. A new pattern x.
      • Compute for a given x the nearest neighbor xj from a whole training set based on the defined pattern distance measure distance(x,xi).
      • Assign to x a class cjof nearest neighbors to x.

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities20
Estimation of Probability Densities
  • Semiparametric Methods
    • “Combination of parametric and nonparametric methods”
    • Two semiparametric methods
      • Functional approximation
      • Mixture models (mixtures of probability densities)
    • Major advantage
    • It is able to precisely fit component functions locally to specific regions of a feature space, based on discoveries about probability distributions and their modalities from the existing data

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities21
Estimation of Probability Densities
  • Semiparametric Methods
    • Functional Approximation
      • Approximation of density by the linear combination of m basis functions i(x):
      • Using a symmetric radial basis function

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities22
Estimation of Probability Densities
  • Semiparametric Methods
    • Functional Approximation
      • Gaussian radial function: “The most commonly used basis function”
      • Optimization criterion for the functional approximation of density
      • Optimal estimates for parameters:

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities23
Estimation of Probability Densities
  • Semiparametric Methods
    • The algorithm for functional approximation
    • Given : A training set Ttra of N patterns x1, x2, …, xN. The m orthonormal radial basis functions i(x) (i = 1, 2,…,m), along with their parameters.
      • Compute the estimates of unknown parameters
      • Form the model of the probability density as a functional approximation

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities24
Estimation of Probability Densities
  • Semiparametric Methods
    • Mixture Models (Mixtures of Probability Densities)
    • “These models are based on linear parametric combination of known probability density functions (for example, normal densities) localized in certain regions of data”
      • The linear mixture distribution
      • Simplified version:

Cios / Pedrycz / Swiniarski / Kurgan

estimation of probability densities25
Estimation of Probability Densities
  • Distance Between Probability Densities and the Kullback-Leibler Distance
    • Distance
    • “We can define distance between two densities, with true density p(x) and its approximate estimate ”
    • Kullback-Leibler distance

Cios / Pedrycz / Swiniarski / Kurgan

probabilistic neural network
Probabilistic Neural Network
  • Probabilistic Neural Network
  • “The PNN is a hardware implementation of the kernel-based method of density estimation and Bayesian optimal classification (providing minimization of the average probability of the classification error”
    • Optimal Bayes’ classification rule
    • Kernel-based estimation of a probability density function

Cios / Pedrycz / Swiniarski / Kurgan

probabilistic neural network1
Probabilistic Neural Network
  • Topology

Cios / Pedrycz / Swiniarski / Kurgan

probabilistic neural network2
Probabilistic Neural Network
  • Details
    • An input layer (weightless) consists of n neurons (units), each receiving one element xi (i = 1,2,…, n) of the n-dimensional input pattern vector x.
    • A pattern layer consists of N neurons (units, nodes), each representing one reference pattern from the training set Ttra .
    • The transfer function of the pattern layer neuron implements a kernel function(a Parzen window)

Cios / Pedrycz / Swiniarski / Kurgan

probabilistic neural network3
Probabilistic Neural Network
  • Details
    • The weightless second hidden layer is the summation layer. The number of neurons in the summation layer is equal to the number of classes l.
    • The output activation function of the summation layer neuron is generally equal to but may be modified for different kernel functions.
    • The output layer is the classification decision layer that implements Bayes’ classification rule by selecting the largest value and thus decides a class cj for the pattern x

Cios / Pedrycz / Swiniarski / Kurgan

probabilistic neural network4
Probabilistic Neural Network
  • Pattern Processing
  • “Processing of patterns by the already-designed PNN network is performed in the feedforward manner. The input pattern is presented to the network and processed forward by each layer. The resulting output is the predicted class”
  • PNN with the Radial Gaussian Kernel
    • Kernel function:
    • Transfer function:
    • Output activation function:

Cios / Pedrycz / Swiniarski / Kurgan

probabilistic neural network5
Probabilistic Neural Network
  • PNN with the Radial Gaussian Normal Kernel and Normalized Patterns
    • Transfer function:
    • Normalization of patterns allows for a simpler architecture of the pattern-layer neurons, containing here also input weights and an exponential output activation function.
    • The transfer function of a pattern neuron can be divided into a neuron’s transfer function and an output activation function
    • The pattern-neuron output activation function:

Cios / Pedrycz / Swiniarski / Kurgan

constraints in classifier design
Constraints in Classifier Design
  • Problems
    • Will a classifier guarantee minimization of the average probability of the classification error?
    • Does a training set well represent patterns generated by a physical phenomenon?
    • Are patterns drawn according to the characteristic of underlying phenomenon probability density?
    • Is the average probability of a classification error difficult to calculate?

Cios / Pedrycz / Swiniarski / Kurgan

constraints in classifier design1
Constraints in Classifier Design
  • Suboptimal solutions of Bayesian classifier design
    • The estimation of class conditional probabilities is based on a limited sample
    • The samples are frequently collected randomly, and not by use of a well-planned experimental procedure

Cios / Pedrycz / Swiniarski / Kurgan

regression
REGRESSION
  • Data Models
  • Simple Linear Regression Analysis
  • Multiple Regression
  • General Least Squares and Multiple Regression
  • Assessing the Quality of the Multiple Regression Model

Cios / Pedrycz / Swiniarski / Kurgan

data models
Data Models
  • Mathematical models
  • “They are useful approximate representations of phenomena that generate data and may be used for prediction, classification, compression, or control design.”
  • Black-box models
    • Mathematical models obtained by processing existing data without using laws of physics governing data-generating phenomena
  • Regression analysis
    • Data analysis and model design are based on a sample from a given population

Cios / Pedrycz / Swiniarski / Kurgan

data models1
Data Models
  • Categories of regression models
    • Simple linear regression
    • Multiple linear regression
    • Neural network-based linear regression
    • Polynomial regression
    • Logistic regression
    • Log-linear regression
    • Local piecewise linear regression
    • Nonlinear regression (with a nonlinear model)
    • Neural network-based nonlinear regression

Cios / Pedrycz / Swiniarski / Kurgan

data models2
Data Models
  • Static and dynamic models
    • A static model produces outcomes based only on the current input (no internal memory).
    • A dynamic model produces outcomes based on the current input and the past history of the model behavior (internal memory)

Cios / Pedrycz / Swiniarski / Kurgan

data models3
Data Models
  • Data gathering
    • Random sample from a certain population
    • N pairs of the experimental data set named Torig

Cios / Pedrycz / Swiniarski / Kurgan

data models4
Data Models
  • Regression analysis
  • “A statistical method used to discover the relationship between variables and to design a data model that can be used to predict variable values based on other variables”

Cios / Pedrycz / Swiniarski / Kurgan

data models5
Data Models
  • Regression analysis
    • A simple linear regression
      • To find the linear relationship between two variables, x and y, and to discover a linear model, i.e., a line equation y = b+ax, which is the best fit to given data in order to predict values of data
      • This modeling line is called the regression line of y on x
      • The equation of that line is called a regression equation (regression model)
    • Typical linear regression analysis provides a prediction of a dependent variable y based on an independent variable x

Cios / Pedrycz / Swiniarski / Kurgan

data models6
Data Models
  • Visualization of Regression
    • Scatter plot for height versus weight data

Cios / Pedrycz / Swiniarski / Kurgan

data models7
Data Models
  • Visualization of Regression
    • Scatter plot for height versus weight data

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis
Simple Linear Regression Analysis
  • Sample data and Regression model

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis1
Simple Linear Regression Analysis
  • Assumptions
    • The observations yi (i = 1, …, N) are random samples and are mutually independent.
    • The regression error terms (the difference between the predicted value and the true value) are also mutually independent, with the same distribution (normal distribution with zero mean) and constant variances
    • The distribution of the error term is independent of the joint distribution of explanatory variables. It is also assumed that unknown parameters of regression models are constants

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis2
Simple Linear Regression Analysis
  • Simple Linear Regression Analysis
    • Evaluation of basic statistical characteristics of data
    • An estimation of the optimal parameters of a linear model
    • Assess of model quality and generalization ability to predict the outcome for new data

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis3
Simple Linear Regression Analysis
  • Model Structure
    • Nonlinear data:
    • Generally, a function f(x) could be nonlinear in x:
    • Linear form :

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis4
Simple Linear Regression Analysis
  • Regression Error (residual error)
    • Difference between real-value yi and predicted-value yi,est

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis5
Simple Linear Regression Analysis
  • Performance Criterion – Sum of Squared Errors.
    • The sum of squared errors performance criterion for multiple regression
    • The minimization technique in regression uses as a criterion the sum of squared error - method of least squares or errors (LSE) or, in short, the method of least squares

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis6
Simple Linear Regression Analysis
  • Basic Statistical Characteristics of Data
    • The mean of N samples
    • The variance
    • The covariance

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis7
Simple Linear Regression Analysis
  • Sum of Squared Variations in y Caused by the Regression Model
    • The total sum of squared variations in y
    • These formulas are used to define important regression measures (for example, the correlation coefficient)

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis8
Simple Linear Regression Analysis
  • Computing Optimal Values of the Regression Model Parameters
    • The optimal model parameters values have to be computed based on the given data set and the defined performance criterion
    • Methods for estimation of optimal model parameter values
      • The analytical offline method
      • The analytical recursive offline method
      • Searching iteratively optimal model parameters
      • Neural network-based regression

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis9
Simple Linear Regression Analysis
  • Simple Linear Regression Analysis, Linear Least Squares, and Design of a Model
    • The general linear model structure
    • The performance criterion
    • and performance curve

y = ax (a model with b=0)

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis10
Simple Linear Regression Analysis
  • Simple Linear Regression Analysis, Linear Least Squares, and Design of a Model
    • The optimal parameters

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis11
Simple Linear Regression Analysis
  • Procedure for simple linear regression
    • Given: The number N of experimental observations, and the set of the N experimental data points { (xi, yi), i = 1, 2, …, N }
    • Compute the statistical characteristics of the data
    • Compute the estimates of the model optimal parameters using Equations
    • Assess the regression model quality indicating howwell the model fits the data. Compute
      • Standard error of estimate
      • Correlation coefficient r
      • Coefficient of determination r2

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis12
Simple Linear Regression Analysis

Example

  • Sample of four data points
  • Resulting regression line
  • y = 0.9 + 0.56x

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis13
Simple Linear Regression Analysis
  • Optimal Parameter Values in the Minimum Least Squares Sense
    • Required conditions for a valid linear regression
      • The error term e = y - (b + ax) is normally distributed
      • The error variance is the same for all values of x
      • Error are independent of each other.

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis14
Simple Linear Regression Analysis
  • Quality of the Linear Regression Model and Linear Correlation Analysis
    • Assessment of model quality
      • The resulting correlation coefficient can be used as a measure of how well the trends predicted by the values follow the trends in the training data
      • The coefficient of determination can be used to measure how well the regression line fits the data points

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis15
Simple Linear Regression Analysis
  • Correlation coefficient
  • Coefficient of determination
    • The percent of variation in the dependent variable y that can be explained by the regression equation,
    • the explained variation in y divided by the total variation, or
    • the square of r (correlation coefficient)

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis16
Simple Linear Regression Analysis
  • Coefficient of determination
    • Explained and unexplained variation in y

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis17
Simple Linear Regression Analysis
  • Coefficient of determination
    • Example
      • If the coefficient of correlation has the value r = 0.9327, then the value of the coefficient of determination is r2 = 0.8700. It can be understood that 87% portion of the total variation in y can be explained by the linear relationship between x and y, as it is described by the optimal regression model of the data. The remaining portion 13% of the total variation in y remains unexplained.
    • The calculation of coefficient of determination

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis18
Simple Linear Regression Analysis
  • Matrix Version of Simple Linear Regression Based on Least Squares Method
    • The matrix form of the model description (the estimation of ) for all N experimental data points
    • The regression error

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis19
Simple Linear Regression Analysis
  • Matrix Version of Simple Linear Regression Based on Least Squares Method
    • The performance criterion:
    • Optimal parameters:
    • The value of the criterion for the optimal parameter vector:
    • The regression error for the model with the optimal parameter vector:

Cios / Pedrycz / Swiniarski / Kurgan

simple linear regression analysis20
Simple Linear Regression Analysis
  • Matrix Version of Simple Linear Regression Based on Least Squares Method
    • Example: let us consider again the dataset shown in the following table
    • y = 0.56x + 0.9

Cios / Pedrycz / Swiniarski / Kurgan

multiple regression
Multiple Regression
  • Definition
  • The multiple regression analysis is the statistical technique of exploring the relation (association) between the set of n independent variables that are used to explain the variability of one (generally many) dependent variable y
    • Linear multiple regression model
    • Linear multiple regression model using vector notation
    • This regression model is represented by a hyperplane in (n + 1)-dimensional space.

Cios / Pedrycz / Swiniarski / Kurgan

multiple regression1
Multiple Regression
  • Geometrical Interpretation: Regression Errors
  • The goal of multiple regression is to find a hyperplane in the (n + 1)-dimensional space that will best fit the data
    • The performance criterion
    • The error variance and standard error of the estimate

Cios / Pedrycz / Swiniarski / Kurgan

multiple regression2
Multiple Regression
  • Degree of Freedom
    • The denominator N – n – 1 in the previous equation tells us that in multiple regression with n independent variables, the standard error has N – n – 1 degrees of freedom
    • The degree of freedom has been reduced from N by n + 1 because n + 1 numerical parameters a0, a1, a2, …, anof the regression model have been estimated from the data

Cios / Pedrycz / Swiniarski / Kurgan

general least squares and multiple regression
General Least Squares and Multiple Regression
  • General model description in function form
  • Data model
  • Performance criterion
  • Regression error

Cios / Pedrycz / Swiniarski / Kurgan

general least squares and multiple regression1
General Least Squares and Multiple Regression
  • General model description in matrix form
  • Data model
  • Performance criterion
  • Optimal parameters

Cios / Pedrycz / Swiniarski / Kurgan

general least squares and multiple regression2
General Least Squares and Multiple Regression
  • Practical, Numerically Stable Computation of the Optimal Model Parameters
  • Problem
  • “The solution for the optimal least-squares parameters is almost never computed from the equation due to its poor numerical performance in cases when the matrix (the covariance matrix) is ill conditioned”
  • Solution: various matrix decomposition methods

Cios / Pedrycz / Swiniarski / Kurgan

assessing the quality of the multiple regression model
Assessing the Quality of the Multiple Regression Model
  • The Coefficient of Multiple Determination,R2
  • “The percent of the variance in the dependent variable that can be explained by all of the independent variables taken together.”
    • Adjusted R2
    • Adjusted R2 uses the number of design parameters plus a constant that are used in the model and the number of data points N in order to correct the statistic of this coefficient in situations when unnecessary parameters are used in the model structure

Cios / Pedrycz / Swiniarski / Kurgan

assessing the quality of the multiple regression model1
Assessing the Quality of the Multiple Regression Model
  • Cp Statistic
    • It is used to compare multiple regression models Cp
    • When comparing alternative regression models, the designer aims to choose models whose values of Cn is close to or below (n + 1)

Cios / Pedrycz / Swiniarski / Kurgan

assessing the quality of the multiple regression model2
Assessing the Quality of the Multiple Regression Model
  • Multiple Correlation
    • A value of R can be found as the positive square root of R2 (coefficient of multiple determination)
    • It is a measure of the strength of the linear relationship between the dependent variable y and the set of independent variables x1, x2, …, xn.
    • A value of R close to 1 indicates that the fit is very good
    • A value near zero indicates that the model is not a good approximation of the data and cannot be efficiently used for prediction

Cios / Pedrycz / Swiniarski / Kurgan

assessing the quality of the multiple regression model3
Assessing the Quality of the Multiple Regression Model

Example

  • “Let us consider a multiple linear regression analysis for the data set containing N = 4 cases, composed with one dependent variable y and two independent variables x1 and x2”
    • Three-dimensional data

Cios / Pedrycz / Swiniarski / Kurgan

assessing the quality of the multiple regression model4
Assessing the Quality of the Multiple Regression Model

Example

  • The scatter plot of data points in three-dimensional space (x1, x2, y)

Cios / Pedrycz / Swiniarski / Kurgan

assessing the quality of the multiple regression model5
Assessing the Quality of the Multiple Regression Model

Example

  • The data matrix
  • The optimal model parameters

Cios / Pedrycz / Swiniarski / Kurgan

assessing the quality of the multiple regression model6
Assessing the Quality of the Multiple Regression Model

Example

  • The optimal model:
  • y = 3.1+0.9x1+0.56x2
  • The optimal regression model in (x1, x2, y) space :

Cios / Pedrycz / Swiniarski / Kurgan

assessing the quality of the multiple regression model7
Assessing the Quality of the Multiple Regression Model

Example

  • Multipleregression, regression plane model and scatter plot

Cios / Pedrycz / Swiniarski / Kurgan

assessing the quality of the multiple regression model8
Assessing the Quality of the Multiple Regression Model

Example

  • The residuals (errors)
  • The criterion value for the optimal parameters: 0.016

Cios / Pedrycz / Swiniarski / Kurgan

references
References
  • Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford Press
  • Cios, K.J., Pedrycz, W., and Swiniarski, R. 1998. Data Mining Methods for Knowledge Discovery. Kluwer
  • Draper, N.R., and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics
  • Duda, R.O. Hart, P.E., and D.G. Stork. 2001. Pattern Classification. Wiley
  • Myers, R.H. 1986. Classical and Modern Regression with Applications, Boston, MA: Duxbury Press.

Cios / Pedrycz / Swiniarski / Kurgan

ad