1 / 68

# Radial Basis Function Network and Support Vector Machine - PowerPoint PPT Presentation

Radial Basis Function Network and Support Vector Machine. Team 1: J-X Huang, J-H Kim, K-S Cho 2003. 10. 29. Outline. Radial Basis Function Network Introduction Architecture Learning Strategies MLP vs RBFN Support Vector Machine Introduction VC Dimension, Structural Risk Minimization

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Radial Basis Function Network and Support Vector Machine' - rhett

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Radial Basis Function Networkand Support Vector Machine

Team 1: J-X Huang, J-H Kim, K-S Cho

2003. 10. 29

• Introduction

• Architecture

• Learning Strategies

• MLP vs RBFN

• Support Vector Machine

• Introduction

• VC Dimension, Structural Risk Minimization

• Linear Support Vector Machine

• Nonlinear Support Vector Machine

• Conclusion

• Characteristic Feature

• Response decreases (or increase) monotonically with distance from a central point.

• A kind of supervised neural networks, a feedforward network with three layers

• Approximate function with linear combination of Radial basis functions

F(x) = S wi G(||x-xi||) i = 1, 2, … , M

• G(||x-xi||) is Radial Basis Function

• Mostly Gaussian function

• When M=number of sample, regularization network

• When M<number of sample, we call it Radial-basis function network

wo

x1

x2

w1

...

...

wj

...

Xp-1

wm

Xp

Output

layer

Hidden layer of

Functions

Input layer

Architecture

• Input layer

• Source nodes that connect to the network to its environment

• Hidden layer

• Each hidden unit (neuron) represents a single radial basis function

• Has own center position and width (spread)

• Output layer

• Linear combination of hidden functions

m

f(x) =  wjhj(x)

j=1

hj(x)= exp( -(x-cj)2 / rj2 )

Where cj is center of a region,

rj is width of the receptive field

• A Feedforward Network

• A linear model with a radial basis function

• Three layers:

• Input layer, hidden layer, output layer

• Each hidden unit

• Represents a single radial basis function

• Has own center position and width (spread)

• Parameter

• Center, breath, weight

• Require

• Number of radial basis neurons

• Selection of the center of each neuron

• Selection of the each breath (width) parameter

• Decide by designer

• Max of neurons = number of input

• Min of neurons will be experimentally determined

• More neurons

• More complex, but smaller tolerance

• Spread: the selectivity of the neuron

• Two Levels of Learning

• Center and spread learning (or determination)

• Output layer weights learning

• Fixed Center Selection

• Self-organizing Center Selection

• Supervised Selection of Centers with Weights

• Make # (parameter) small as possible

• Principles of dimensionality

• Fixed RBFs of the hidden units

• The locations of the centers may be chosen randomly from the training data set.

• We can use different values of centers and widths for each radial basis function -> experimentation with training data is needed.

• Only ouput layer weight is need to be learned

• Obtain the value of the output layer weight by pseudo-inverse method

• Main problem: require a large training set for a satisfactory level of performance

• Self-organized learning of centers by means of clustering

• Clustering on the Hidden Layer

• K-means clustering

• Initialization

• Sampling

• Similarity matching

• Updating

• Continuation

• By selecting the average distance between center and the c closest points in the cluster (e.g. c=5)

• Supervised learning on the output Layer

• Estimate the connection weights w by the iterative gradient descent method based on least squares

• All free parameters are changed by supervised learning process

• The center is selected with the weight learning

• Error-correction learning using least mean square (LMS) algorithm

• Training for centers and spreads is very slow

• Linear weights (output layer)

• Positions of centers (hidden layer)

• Spreads of centers (hidden layer)

• RBF: Local network

• Only inputs near a receptive field produce an activation

• Can give “don’t know” output

• MLP: Global network

• All inputs cause an output

• Introduction

• Model

• Training

• Support Vector Machine

• Introduction

• VC Dimension, Structural Risk Minimization

• Linear Support Vector Machine

• Nonlinear Support Vector Machine

• Conclusion

• Objective

• Find an optimal hyperplane to:

• Classify data points as much as possible

• Separate the points of two classes as far as possible

• Approach

• Formulate a constrained optimization problem

• Solve it using constrained quadratic programming (constrained QP)

• Theorem

• Structural Risk Minimization

Key Idea: Transform to Higher Dimensional Space

margin

hyperplan

optimal

hyperplan

hyperplan

Find the Optimal Hyperplan

• Given

• A set of data points belong to either of two classes

• SVM: Finds the Optimal Hyperplane

• Minimizes the risk of misclassifying the training samples and unseen test samples

• Maximizing the distance of either class from the hyperplane

• Introduction

• VC Dimension, Structural Risk Minimization

• Linear Support Vector Machine

• Nonlinear Support Vector Machine

• Conclusion

• Minimize the Expected Risk

• Minimize the h: VC dimension

• Minimize the empirical risk

Classification Error

underfitting

overfitting

Confidence Interval

Empirical Risk

h(VC-dim.)

VC Dimension and Empirical Risk

• Empirical Risk is Decreasing Function of VC Dimension

• Need a principled methods for the minimization

• Why Structural Risk Minimization (SRM)

• It is not enough to minimize the empirical risk

• Need to overcome the problem of choosing an appropriate VC dimension

• SRM Principle

• To minimize the expected risk, both sides in VC bound should be small

• Minimize the empirical risk and VC confidence simultaneously

• SRM picks a trade-off in between VC dimension and empirical risk

• Introduction

• VC Dimension, Structural Risk Minimization

• Linear Support Vector Machine

• Nonlinear Support Vector Machine

• Performance and Application

• Conclusion

• Set S is Linearly Separable, then

• The same as

w: normal to the hyperplan; is inverse proportion to the perpendicular distance from the hyperplane to the origin

• Idea

• Use a transformation (x) from input space to higher dimensional space

• Find the separating hyperplane, make the inverse transformation

• Kernel: dot product in a Banach space

• Mercer’s Condition

• Polynomial Kernels

• Neural Network Like Kernel

• Efficient training algorithm (vs. multi-layer NN)

• Represent complex and nonlinear functions (vs. single-layer NN)

• Always find a global minimum

• Solution usually cubic in the number of training data

• Large training set is a problem

• A class of single hidden layer feedforward networks

• Activation functions for hidden units are defined as radially symmetric basis functions such as the Gaussian function.

• Faster convergence

• Smaller extrapolation errors

• Higher reliability

• Multi quaric RBF and Gaussian RBF

• Neural Network

• Linear Models

• The Perceptron

• Multi-Layer Perceptrons (Feedforward Neural Networks)

• One hidden layer of basis functions, or neurons

• At the input of each neuron, the distance between the neuron center and the input vector is calculated.

• The output of the neuron is then formed by applying the basis function to this distance

• The RBF network output is formed by a weighted sum of the neuron outputs and the unity bias shown

• Input space X

• output space Y

• For classification Y={+1, -1}

• Assume there is an (unknown) probability distribution P on X£Y.

• Data D={(Xi, Yi)|i-=1…,n} is observed identically and independently according to P.

• Goal: Construct g:X!Y that predicts Y from X.

• Expected Risk

R(g)=P(g(X)¹Y)=E(1g(X)¹ Y)

• P is unknown so that we cannot compute this.

• Empirical Risk

Rn(g)=åi 1g(X)¹ Y/n

• Dependent upon the data set.

• Minimize the expected risk which is unknown

• Supposes we have n data points to be labeled to two class, we have SG(n)·2n

• When SG(n)=2n, G can generates any classification on (some set of) n points. In other words, Gshatters n points.

• The VC dimension (after Vapnik-Chervonenkis) is defined as the largest n such that SG(n)=2n.

• It is a simplest measure of classifier complexity/capacity.

• VC dim=n doesn’t mean that G can shatter every data set of size n.

• VC dim=n does mean that G can shatter some data set of size n.

• Is VC dimension == number of parameters

• In Rd, VC dimension of {all hyperplanes} is d+1.

• For any d+1 points in general position we can find hyperplanes shattering them.

• For d+2 points, hyperplanes cann’t shatter them.

• Hyperplanes are given by a1x1+…+adxd+a0=0,

• Is VC dimension == number of parameters?

• The Answer is No: An Example

• Let G={sgn(sin(tx)) | t2R}, X=R.

• We can prove VC-dim(G)=1, even though G is a one-parameter family!

• The VC bound can be predictive even when loose

• Introduce “structure”

• Dividing the entire class of functions into nested subsets

• For each subset, compute h or a bound of h

• Finding that subset of functions which minimizes the bound on the actual risk

• Picking a trade-off in between VC dimension and empirical risk

• Choose y=1 for positive labels and y=-1 for negative labels

• Problem: minimize

• Dual formulation: maximize L as a function of i with the constrains:

• Transformed problem: maximize

• Karush-Kuhn-Tucker conditions as extremum:

• Separating surface

• Minimize

• Dual formulation: maximize L as a function of i

• Karush-Kuhn-Tuker conditions as extremum:

• Final optimization problem: miximize L as function of i

• Separating surface