slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Efficient Training in high-dimensional weight space PowerPoint Presentation
Download Presentation
Efficient Training in high-dimensional weight space

Loading in 2 Seconds...

play fullscreen
1 / 25

Efficient Training in high-dimensional weight space - PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on

Efficient Training in high-dimensional weight space. Christoph Bunzmann, Robert Urbanczik . Michael Biehl. , Michael Biehl. Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg Am Hubland, D-97074 Würzburg, Germany

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Efficient Training in high-dimensional weight space


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Efficient Training in high-dimensional weight space

Christoph Bunzmann,

Robert Urbanczik

Michael Biehl

, Michael Biehl

Theoretische Physik und Astrophysik

Computational Physics

Julius-Maximilians-Universität Würzburg

Am Hubland, D-97074 Würzburg, Germany

http://theorie.physik.uni-wuerzburg.de/~biehl

Wiskunde & Informatica

Intelligent Systems

Rijksuniversiteit Groningen, Postbus 800,

NL-9718 DD Groningen, The Netherlands

biehl@cs.rug.nl, www.cs.rug.nl/~biehl

slide2

Learning from examples

A model situation

layered neural networks

student teacher scenario

The dynamics of on-line learning

on-line gradient descent

delayed learning, plateau states

Efficient training of multilayer networks

learning by Principal Component Analysis

idea, analysis, results

Summary, Outlook

selected further topics

prospective projects

Efficient training in high-dimensional weight space

slide3

based on example data, e.g. input/output pairs in

classification tasks

time series prediction

regression problems

supervised

learning

Learning from examples

choice of adjustable parameters in

adaptive information processing systems

  • parameterizes a hypothesis
  • e.g. for an unknown classification or regression task
  • guided by the optimization of an appropriate
  • objective or cost function
  • e.g. performance with respect to the example data
  • results in generalization ability
  • e.g. the successful classification of novel data
slide4

· general results

e.g. performance bounds

independent of

- specific task

- statistical properties of data

- details of training procedure ...

· typical properties of e.g. learning curves

model scenarios

- network architecture

- statistics of data, noise

- learning algorithm

understanding/prediction of relevant phenomena, algorithm design

trade off: general validity / applicability

· description of specific e.g. hand written digit recognition

applications

- given real world problem

- particular training scheme

- special set of example data ...

Theory of learning processes

slide5

input data

adaptive weights

hidden units

( fixed hidden to output weights )

input/output relation

sigmoidal hidden

activation, e.g.

g(x) = erf (a x)

A two-layered network: the soft committee machine

SCM+ adaptive thresholds:

universal approximator

slide6

(best) rule parameterization

adaptive student

teacher

? ? ? ? ? ? ?

hidden units

interesting effects

relevant cases

unlearnable rule

over-sophisticated student

ideal situation: perfectly matching complexity

5

Student teacher scenario

slide7

examples for the unknown function or rule

input/output pairs:

(reliable)

training based onthe performance w.r.t. example data, e.g.

evaluation after training

generalization error

expected error for a novel input

w.r.t. density of inputs / set of test inputs

slide8

consider large systems, in the thermodynamic limitN   (K,M«N)

  • dimension of input data
  • number of adjustable parameters

N 

  • perform averages

over stochastic training process

over randomized example data,quenched disorder

(technically) simplest case: reliable teacher outputs,

isotropic input density: independent components

with zero mean / unit variance

  • description in terms of macroscopic quantities
  • e.g. overlap parameters
  • student/teacher similarity measure
  • evaluate typical properties
  • e.g. the learning curve

Statistical Physics approach

next: eg

slide9

(sums of many random numbers)

Central Limit Theorem: correlated Gaussians for large N

first and second

moments:

averages over  integrals over

K N

½(K2+K) + K M

microscopic

macroscopic

The generalization error

slide10

novel, random example:

On-line learning step:

number of examples  discrete learning time

· no explicit storage of all examples ID required

practical advantages:

typical dynamics of learning can be evaluated on

average over a randomized sequence of examples

 coupled ODEs for {Rjm,Qij}in time =P/(KN)

mathematical ease:

· little computational effort per example

Dynamics ofon-line gradient descent

presentation of single examples

examples

weights after presentation of

slide11

projections

recursions, e.g.

large N • average over latest example Gaussian

• meanrecursions  coupled ODE in continuous time

~examples per weight

training time

 learning curve

slide12

10

fast initial decrease

0.05

eG

0.04

0.03

perfect

generalization

0.02

0.01

Biehl, Riegler, Wöhler

J.Phys. A (1996) 4769

0

200

300

0

100

= P/(KN)

quasi-stationary plateau states with all

dominate the learning process

unspecialized student weights

example: K = M = 2,  = 1.5, Rij(0)  0

learning curve

aha!

slide13

200

300

0

100

permutation symmetry of branches in the student network

example: K = M = 2, Tmn = mn,  = 1, Rij(0)  0,

evolution of overlap parameters

1.0

R11, R22

Q11, Q22

0.5

Q21= Q21

R12, R21

0.0

slide14

Monte Carlo simulationsself-averaging

N  

Qjm

quantity

mean

1/N

standard deviation

1/N

slide15

assume randomized initialization of weight vectors

examples needed

for successful

learning !

hidden unit specialization

requires a priori knowledge

(initial macroscopic overlaps)

property of the learning scenario

necessary phase of training

???

or

artifact of the training prescription

Plateau length

exactly

if all

self-avg.

slide16

S.J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg.)

Backpropagation: Theory, Architectures, and Applications

slide17

idea:

  • identification (approximation) of the subspace of
  • B) actual training within this low-dimensional space

example: soft committee teacher (K=M), isotropic input density

modified correlation matrix

eigenvalues and eigenvectors:

1 eigenvector

( N-K ) e.v.

( K-1 ) e.v.

Training by Principal Component Analysis

problem: delayed specialization in ( K N ) dimensional weight space

slide18

(K-1) smallest eigenvalues, e.v.

· determine

1

largest eigenvalue, e.v.

B) specialization in the K - dimensional space of

· representation of student weights

( K2  K N coefficients)

· optimization of w.r.t. E

( # of examples P =  NK  K2 )

empirical estimate from a limited data set

note: required memory  N2 does not increase with P

slide19

of P= N Kexamples

typical properties: given a random set

typical overlap with teacher weights

measures the success of teacher space identification A)

B) given ,determine the optimal eG

achievable by a linear combination of

formal partition sum

replica trick

saddle point integration

limit   

quenched free energy

slide20

P =  K N examples

c (K=2) = 4.49

c (K=3) = 8.70

large K theory:

c (K) ~ 2.94 K (N-indep.!)

c

B) given ,determine the optimal eG

achievable by a linear combination of

K = 3, Statistical Physics theory and simulations, N = 400 (), N = 1600 (•)

A)

B)

slide21

P =  K N examples

c (K=2) = 4.49

c (K=3) = 8.70

large K theory:

c (K) ~ 2.94 K (N-indep.!)

c

15

K = 3, theory and Monte Carlo simulations, N = 400 (), N = 1600 (•)

A)

specialization without

a priori knowledge

B)

( cindependent of N )

specialized

unspecialized

Bunzmann, Biehl, Urbanczik

Phys. Rev. Lett. 86, 2166 (2001)

slide22

potential application: model selection

spectrum of matrix CP, teacher with M = 7 hidden units

algorithm requires no

prior knowledge of M

PCA hints at the required

model complexity

K-1 = 6

smallest eigenvalues

slide23

Summary

· model situation, supervised learning

- the soft committee machine

- student teacher scenario

- randomized training data

· statistical physics inspiredapproach

- large systems

- thermal (training) and disorder (data) average

- typical, macroscopic properties

· dynamics of on-line gradient descent

- delayed learning due to symmetry breaking

necessary specialization processes

· efficient training

- PCA based learning algorithm

reduces dimensionality of the problem

- specialization without a priori knowledge

slide24

Further topics

· perceptron training (single layer)

optimal stability classification

dynamics of learning

· unsupervised learning

principal component analysis

competitive learning, clustered data

· non-trivial statistics of data

learning from noisy data

time-dependent rules

· dynamics of on-line training

perceptron, unsupervised learning,

two-layered feed-forward networks

· specialization processes

discontinuous learning curves

delayed learning, plateau states

· algorithm design

variational method, optimal algorithms

construction algorithm

slide25

· algorithm design

variational optimization, e.g.

alternative correlation matrix

Selected Prospective Projects

· application relevant architectures and algorithms

Local Linear Model Trees

Learning Vector Quantization

Support Vector Machines

· unsupervised learning

density estimation, feature detection,

clustering, (Learning) Vector Quantization

compression, self-organizing maps

· model selection

estimate complexity of a rule

or mixture density