- By
**pink** - Follow User

- 78 Views
- Uploaded on

Download Presentation
## Efficient Training in high-dimensional weight space

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Efficient Training in high-dimensional weight space

Christoph Bunzmann,

Robert Urbanczik

Michael Biehl

, Michael Biehl

Theoretische Physik und Astrophysik

Computational Physics

Julius-Maximilians-Universität Würzburg

Am Hubland, D-97074 Würzburg, Germany

http://theorie.physik.uni-wuerzburg.de/~biehl

Wiskunde & Informatica

Intelligent Systems

Rijksuniversiteit Groningen, Postbus 800,

NL-9718 DD Groningen, The Netherlands

biehl@cs.rug.nl, www.cs.rug.nl/~biehl

A model situation

layered neural networks

student teacher scenario

The dynamics of on-line learning

on-line gradient descent

delayed learning, plateau states

Efficient training of multilayer networks

learning by Principal Component Analysis

idea, analysis, results

Summary, Outlook

selected further topics

prospective projects

Efficient training in high-dimensional weight space

based on example data, e.g. input/output pairs in

classification tasks

time series prediction

regression problems

supervised

learning

Learning from examples

choice of adjustable parameters in

adaptive information processing systems

- parameterizes a hypothesis
- e.g. for an unknown classification or regression task

- guided by the optimization of an appropriate
- objective or cost function
- e.g. performance with respect to the example data

- results in generalization ability
- e.g. the successful classification of novel data

e.g. performance bounds

independent of

- specific task

- statistical properties of data

- details of training procedure ...

· typical properties of e.g. learning curves

model scenarios

- network architecture

- statistics of data, noise

- learning algorithm

understanding/prediction of relevant phenomena, algorithm design

trade off: general validity / applicability

· description of specific e.g. hand written digit recognition

applications

- given real world problem

- particular training scheme

- special set of example data ...

Theory of learning processes

adaptive weights

hidden units

( fixed hidden to output weights )

input/output relation

sigmoidal hidden

activation, e.g.

g(x) = erf (a x)

A two-layered network: the soft committee machine

SCM+ adaptive thresholds:

universal approximator

adaptive student

teacher

? ? ? ? ? ? ?

hidden units

interesting effects

relevant cases

unlearnable rule

over-sophisticated student

ideal situation: perfectly matching complexity

5

Student teacher scenario

examples for the unknown function or rule

input/output pairs:

(reliable)

training based onthe performance w.r.t. example data, e.g.

evaluation after training

generalization error

expected error for a novel input

w.r.t. density of inputs / set of test inputs

consider large systems, in the thermodynamic limitN (K,M«N)

- dimension of input data
- number of adjustable parameters

N

- perform averages

over stochastic training process

over randomized example data,quenched disorder

(technically) simplest case: reliable teacher outputs,

isotropic input density: independent components

with zero mean / unit variance

- description in terms of macroscopic quantities
- e.g. overlap parameters
- student/teacher similarity measure

- evaluate typical properties
- e.g. the learning curve

Statistical Physics approach

next: eg

Central Limit Theorem: correlated Gaussians for large N

first and second

moments:

averages over integrals over

K N

½(K2+K) + K M

microscopic

macroscopic

The generalization error

On-line learning step:

number of examples discrete learning time

· no explicit storage of all examples ID required

practical advantages:

typical dynamics of learning can be evaluated on

average over a randomized sequence of examples

coupled ODEs for {Rjm,Qij}in time =P/(KN)

mathematical ease:

· little computational effort per example

Dynamics ofon-line gradient descent

presentation of single examples

examples

weights after presentation of

recursions, e.g.

large N • average over latest example Gaussian

• meanrecursions coupled ODE in continuous time

~examples per weight

training time

learning curve

fast initial decrease

0.05

eG

0.04

0.03

perfect

generalization

0.02

0.01

Biehl, Riegler, Wöhler

J.Phys. A (1996) 4769

0

200

300

0

100

= P/(KN)

quasi-stationary plateau states with all

dominate the learning process

unspecialized student weights

example: K = M = 2, = 1.5, Rij(0) 0

learning curve

aha!

300

0

100

permutation symmetry of branches in the student network

example: K = M = 2, Tmn = mn, = 1, Rij(0) 0,

evolution of overlap parameters

1.0

R11, R22

Q11, Q22

0.5

Q21= Q21

R12, R21

0.0

assume randomized initialization of weight vectors

examples needed

for successful

learning !

hidden unit specialization

requires a priori knowledge

(initial macroscopic overlaps)

property of the learning scenario

necessary phase of training

???

or

artifact of the training prescription

Plateau length

exactly

if all

self-avg.

S.J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg.)

Backpropagation: Theory, Architectures, and Applications

- identification (approximation) of the subspace of
- B) actual training within this low-dimensional space

example: soft committee teacher (K=M), isotropic input density

modified correlation matrix

eigenvalues and eigenvectors:

1 eigenvector

( N-K ) e.v.

( K-1 ) e.v.

Training by Principal Component Analysis

problem: delayed specialization in ( K N ) dimensional weight space

(K-1) smallest eigenvalues, e.v.

· determine

1

largest eigenvalue, e.v.

B) specialization in the K - dimensional space of

· representation of student weights

( K2 K N coefficients)

· optimization of w.r.t. E

( # of examples P = NK K2 )

empirical estimate from a limited data set

note: required memory N2 does not increase with P

typical properties: given a random set

typical overlap with teacher weights

measures the success of teacher space identification A)

B) given ,determine the optimal eG

achievable by a linear combination of

formal partition sum

replica trick

saddle point integration

limit

quenched free energy

c (K=2) = 4.49

c (K=3) = 8.70

large K theory:

c (K) ~ 2.94 K (N-indep.!)

c

B) given ,determine the optimal eG

achievable by a linear combination of

K = 3, Statistical Physics theory and simulations, N = 400 (), N = 1600 (•)

A)

B)

c (K=2) = 4.49

c (K=3) = 8.70

large K theory:

c (K) ~ 2.94 K (N-indep.!)

c

15

K = 3, theory and Monte Carlo simulations, N = 400 (), N = 1600 (•)

A)

specialization without

a priori knowledge

B)

( cindependent of N )

specialized

unspecialized

Bunzmann, Biehl, Urbanczik

Phys. Rev. Lett. 86, 2166 (2001)

potential application: model selection

spectrum of matrix CP, teacher with M = 7 hidden units

algorithm requires no

prior knowledge of M

PCA hints at the required

model complexity

K-1 = 6

smallest eigenvalues

· model situation, supervised learning

- the soft committee machine

- student teacher scenario

- randomized training data

· statistical physics inspiredapproach

- large systems

- thermal (training) and disorder (data) average

- typical, macroscopic properties

· dynamics of on-line gradient descent

- delayed learning due to symmetry breaking

necessary specialization processes

· efficient training

- PCA based learning algorithm

reduces dimensionality of the problem

- specialization without a priori knowledge

· perceptron training (single layer)

optimal stability classification

dynamics of learning

· unsupervised learning

principal component analysis

competitive learning, clustered data

· non-trivial statistics of data

learning from noisy data

time-dependent rules

· dynamics of on-line training

perceptron, unsupervised learning,

two-layered feed-forward networks

· specialization processes

discontinuous learning curves

delayed learning, plateau states

· algorithm design

variational method, optimal algorithms

construction algorithm

variational optimization, e.g.

alternative correlation matrix

Selected Prospective Projects

· application relevant architectures and algorithms

Local Linear Model Trees

Learning Vector Quantization

Support Vector Machines

· unsupervised learning

density estimation, feature detection,

clustering, (Learning) Vector Quantization

compression, self-organizing maps

· model selection

estimate complexity of a rule

or mixture density

Download Presentation

Connecting to Server..