- 125 Views
- Uploaded on
- Presentation posted in: General

Regularization

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Regularization

Instructor : Dr. SaeedShiry

- The hypothesis space H is the space of functions
- allow our algorithm to provide.
- in the space the algorithm is allowed to search.
- it is often important to choose the hypothesis space as a function of the amount of data available.

- The basic goal of supervised learning:
- to use the training set S to “learn” a function
- For a new x value predict the associated value of y:

- Regression : If y is a real-valued random variable
- Pattern classification : If y takes values from an unordered finite set,
- In two-class pattern classification problems, we assign one class a y value of 1, and the other class a y value of −1.

- In order to measure goodness of our function, we need a loss function V.
- In general, we let V(f , z) = V(f (x), y)
- price we pay when we see x and guess that the associated y value is f (x) when it is actually y.

- The most common loss function is square loss or L2 loss:
- V(f (x), y) = (f (x) − y)^2

- L1 loss:
- V(f (x), y) = |f (x) − y|

- Vapnik’s more general -insensitive loss:

- In order to choose the best available approximation to the supervisor's response, one measures the loss or discrepancy L(y, f(x, a)) between the response y of the supervisor to a given input x and the response f(x, a) provided by the learning machine. Consider the expected value of the loss, given by the risk functional
- The goal is to find the function f(x, , a) which minimizes the risk functional R(a) over the class of functions f(x,), A in the situation where the joint probability distribution P(x,y) is unknown and the only available information is contained in the training set.

- Pattern Recognition
- Let the supervisor's output y take only two values y = {0,1} and let f(x,), A, be a set of indicator functions (functions which take only two values: zero and one).
- Consider the following loss function:
- For this loss function, the functional (1.2) determines the probability of different answers given by the supervisor and by the indicator function f(x, ). We call the case of different answers a classification error.
- The problem, therefore, is to find a function that minimizes the probability of classification error when the probability measure F(x, y) is unknown, but the data are given.

- Regression Estimation
- Let the supervisor's answer y be a real value, and let f(x, ), A, be a set of real functions that contains the regression function
- It is known that the regression function is the one that minimizes the functional (1.2) with the following loss function:
- Thus the problem of regression estimation is the problem of minimizing the risk functional (1.2) with the above loss function in the situation where the probability measure P(x,y) is unknown but the data are given.

- Density Estimation (Fisher-Wald Setting)
- Finally, consider the problem of density estimation from the set of densities p(x, ) A. For this problem we consider the following loss function:
- It is known that the desired density minimizes the risk functional (1.2) with the above loss function .
- Thus, again, to estimate the density from the data one has to minimize the risk functional under the condition that the corresponding probability measure P(x) is unknown, but i.i.d. data
are given.

- The expected or true error of f is:
- Given a function f , a loss function V, and a probability distribution μ over Z,
- the expected loss on a new example drawn at random from μ.
- We would like to make I[f ] small, but in general we do not know μ.

- The empirical error of f is:
- Given a function f , a loss function V, and a training set S consisting of n data points

- Let {Xn} be a sequence of bounded random variables. We say that

In addition to the key property of generalization, a “good” learning algorithm should also be stable:

- fs should depend continuously on the training set S.
- In particular, changing one of the training points should affect less and less the solution as n goes to infinity.

A problem is well-posed if its solution:

- exists
- is unique
- depends continuously on the data (e.g. it is stable)
A problem is ill-posed if it is not well-posed.

- well-posedness is mainly used to mean stability of the solution.

- In the early 1900s Hadamard observed that under some (very general) circumstances the problem of solving (linear) operator equations
- (finding f F that satisfies the equality), is ill-posed; even if there exists a unique solution to this equation,
- a small deviation on the right-hand side of this equation (Fδ instead of F, where ||F- Fδ||< δ is arbitrarily small) can cause large deviations in the solutions (it can happen that ||fδ-f||< is large).
- In this case if the right-hand side F of the equation is not exact (e.g., it equals Fδ , where Fδ differs from F by some level δ of noise), the functions fδ that minimize the function
- do not guarantee a good approximation to the desired solution even if δ tends to zero.

- Hadamard thought that ill-posed problems are a pure mathematical phenomenon and that all real-life problems are "well-posed.“
- However, in the second half of the century a number of very important real-life problems were found to be ill-posed.
- it is important that one of main problems of statistics, estimating the density function from the data, is ill-posed.

- Regularization theory was one of the first signs of the existence of intelligent inference:
- In the middle of the 1960s it was discovered that if instead of the functional R(f) one minimizes another so-called regularized functional
- where Ω(f) is some function (that belongs to a special type of functions) and (δ) is an appropriately chosen constant (depending on the level of noise), then one obtains a sequence of solutions that converges to the desired one as δ tends to zero

- Given a training set S and a function space H, empirical risk minimization (Vapnik introduced the term) is the class of algorithms that look at S and select fs as
- For example linear regression is ERM when V(z) = (f (x) − y)^2 and H is space of linear functions f = ax.

- In order to minimize the risk functional for an unknown probability measure P(z) the following induction principle is usually employed.
- The expected risk functional R() is replaced by the empirical risk functional
- Constructed on the basis of the training set.
- The principle is to approximate the function Q(z, ) which minimizes the risk by the function Q(z, l) which miniminimizes the empirical risk (1.8).
- This principle is called the Empirical Risk Minimization induction principle (ERM principle).

For ERM to represent a “good” class of learning algorithms, the solution should

- generalize
- exist, be unique and – especially – be stable (well-posedness).

Under which conditions the ERM solution converges with increasing number of examples to the true solution? In other words...what are the conditions for generalization of ERM?

- Since Tikhonov, it is well-known that a generally ill-posed problem such as ERM, can be guaranteed to be well-posed and therefore stable by an appropriate choice of H.
- For example, compactness of H guarantees stability.

- It seems intriguing that the classical conditions for consistency of ERM – thus quite a different property – consist of appropriately restricting H.

- We would like to have a hypothesis space that yields generalization. Loosely speaking this would be a H for which the solution of ERM, say fs is such that |Is[fs] −I[fs]| converges to zero in probability for n increasing.
- Note that the above requirement is NOT the law of large numbers; the requirement for a fixed f that |Is[f ] − I[f ]| converges to zero in probability for n increasing Is the law of large numbers.

- The theorem says that a proper choice of the hypothesis space H ensures generalization of ERM (and consistency since for ERM generalization is necessary and sufficient for consistency and viceversa).
- A separate theorem guarantees also stability (defined in a specific way) of ERM.
- Thus with the appropriate definition of stability, stability and generalization are equivalent for ERM.
- Other results characterize uGC classes in terms of measures of complexity or capacity of H (such as VC dimension).
- Thus the two desirable conditions for a learning algorithm –generalization and stability – are equivalent (and they correspond to the same constraints on H).

- A method of improving stability of solutions of ill-conditioned inverse problems, called regularization.
- The basic idea in the treatment of ill-conditioned problems
- use some a priori knowledge about solutions to disqualify meaningless ones.

- such knowledge can be:
- some regularity condition on the solution expressed existence of derivatives up to a certain order with bounds on the magnitudes of these derivatives
- some localization condition such as a bound on the support of the solution or its behavior at infinity.

- Tikhonov’s regularization: penalizes undesired solutions by adding a term called a stabilizer.

- Generally speaking, any regularization method tries to analyze a related well-posed problem whose solution approximates the original ill-posed problem.
- The well-posedness is achieved by implementing one or more of the following basic ideas
- restriction of the data;
- change of the space and/or topologies;
- modification of the operator itself;
- the concept of regularization operators; and
- well-posed stochastic extensions of ill-posed problems.

- Regularized cost function = empirical cost function +regularization parameter *regularizer function

- Degradation model
- H is ill-conditioned which makes image restoration problem an ill-posed problem
- Solution is not stable

- Theory
- Proposed by Tikhonov in 1963
- Proposes the use of prior knowledge to regularize mappings

- Most common application: utilize the smoothness property:
- “Similar inputs produce similar outputs for an input-output mapping to be smooth”

As we will see in future classes

- Tikhonov regularization ensures well-posedness eg existence, uniqueness and especially stability (in a very strong form) of the solution
- Tikhonov regularization ensures generalization Tikhonov regularization is closely related to – but different from – Ivanov regularization, eg ERM on a hypothesis space H which is a ball in a RKHS.