Glide algorithm with tunneling a fast reliably convergent algorithm for neural network training
Download
1 / 29

Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training - PowerPoint PPT Presentation


  • 258 Views
  • Updated On :

Vitit Kantabutra & Batsukh Tsendjav Computer Science Program College of Engineering Idaho State University Pocatello, ID 83209 Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training Elena Zheleva Dept. of CS/EE The University of Vermont

Related searches for Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training' - jacob


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Glide algorithm with tunneling a fast reliably convergent algorithm for neural network training l.jpg

Vitit Kantabutra &

Batsukh Tsendjav

Computer Science Program

College of Engineering

Idaho State University

Pocatello, ID 83209

Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training

Elena Zheleva

Dept. of CS/EE

The University of Vermont

Burlington, VT 05405


New algorithm for neural network training l.jpg
New Algorithm for Neural Network Training

  • Convergence of training algorithms is one of the most important issues in NN field today

  • We solve the problem for some well-known difficult-to-train networks

    • Parity 4 – 100% fast conv.

    • 2-Spiral – same

    • Character recog. – same


Our glide algorithms l.jpg
Our Glide Algorithms

  • Our first “Glide Algorithm” was a simple modification of gradient descent.

  • When the gradient is small, go a constant distance instead of a distance equal to a constant times the gradient

  • The idea was that flat regions are seemingly “safe,” enabling us to go a relatively long distance (“glide”) without missing the solution

  • Originally we even thought of going a longer distance when the gradient is smaller!

  • We simply didn’t believe in the conventional wisdom of going a longer distance on steep slopes.


Hairpin observation problem with original glide algorithm l.jpg
Hairpin Observation – problem with original Glide Algorithm

  • Our original Glide Algorithm did converge significantly faster than plain gradient descent

    • But ours didn’t even converge as reliably as plain gradient descent!

    • What seemed to be wrong?

  • We weren’t right about flat regions being always safe!!

  • We experimented by running plain gradient descent and observe its flat region behavior

    • Flat regions are indeed often safe

    • But sometimes gradient descent makes a sharp “hairpin” turn!!

    • This sometimes derailed our first Glide Algorithm


Second glide algorithm glide algorithm with tunneling l.jpg
Second Glide Algorithm: “Glide Algorithm with Tunneling”

  • In flat regions, we still try to go far

  • But we check error at tentative destination

    • Don’t go so far if error increase much

    • Can afford the time easily

    • But even if error increases a little, go anyway to “stir things up”

  • Also has mechanism for battling zigzagging

    • Direction of motion is average of 2 or 4 gradient descent moves

    • Seems better than momentum

  • Also has “tunneling”

    • Means linear search very locally, but fancier


Reducing the zigzagging problem l.jpg
Reducing the zigzagging problem

  • Direction of next move usually determined by averaging 2 or 4 (or 6, 8, etc) gradient descent moves

Gradient Descentzigzagging despite momentum!!


Importance of tunneling l.jpg
Importance of Tunneling

  • Serves to set the weight at the “bottom of the gutter”

error

distance


A few experimental results l.jpg
A Few Experimental Results

Didn’t converge

CPU time, G.D. odd runs with m=0.9

Problem: Parity-4 with 4 hidden neurons

y=running time (sec) until convergence

Even runs: starting with previous run’s weights

Odd runs: random starting wts

X=run number


Some gat data l.jpg
Some GAT Data

Voting Records

Parity 4


Testing parity 4 l.jpg
Testing Parity 4

  • Network Information

    • One Hidden layer

    • 4 inputs

    • 6 Hidden Neurons

    • 1 output Neuron

    • Fully Connected between layers

  • Machine Used

    • Windows XP

    • AMD Athlon 2.0 GHz Processor

    • 1 GB Memory


Testing parity 411 l.jpg
Testing Parity 4

  • Parity 4 (Even)

  • Number of Instances: 16

    • True (=1)

    • False (=-1)

  • Number of Attributes: 4

    • True (=1)

    • False (=-1)


Testing parity 412 l.jpg

X1

X2

X3

X4

Out

0

-1

-1

-1

-1

1

1

-1

-1

-1

1

-1

2

-1

-1

1

-1

-1

3

-1

-1

1

1

1

4

-1

1

-1

-1

-1

5

-1

1

-1

1

1

6

-1

1

1

-1

1

7

-1

1

1

1

-1

8

1

-1

-1

-1

-1

9

1

-1

-1

1

1

10

1

-1

1

-1

1

11

1

-1

1

1

-1

12

1

1

-1

-1

1

13

1

1

-1

1

-1

14

1

1

1

-1

-1

15

1

1

1

1

1

Testing Parity 4

  • Patterns Used

    • X1 – X4 are inputs




Testing parity 415 l.jpg

Statistics

Times

GAT

Grad

Total

GAT

Grad

# of Tests

35

35

Seconds

788

1488

Iterations

Minutes

13.13

24.81

GAT

Grad

Mean

28,599

211,081

Seconds

GAT

Grad

St Dev

26,388

96,424

Mean

23

43

St Dev

28

19

Testing Parity 4


Testing on voting records l.jpg
Testing on Voting Records

  • Network Information

    • One Hidden layer

    • 16 inputs

    • 16 Hidden Neurons

    • 1 output Neuron

    • Fully Connected between layers

  • Machine Used

    • Windows XP

    • AMD Athlon 2.0 GHz Processor

    • 1 GB Memory


Testing on voting records17 l.jpg
Testing on Voting Records

  • 1984 United States Congressional Voting Records Database (taken from the UCI Machine learning Repository - http://www.ics.uci.edu/~mlearn/)

  • Number of Instances: 435

    • 267 democrats (=1)

    • 168 republicans (=-1)

  • Number of Attributes: 16+class name = 17

    • Yes Vote (1)

    • No Vote (-1)

    • Abstained (0)


Testing on voting records18 l.jpg
Testing on Voting Records

  • 1. Class Name: 2 (democrat, republican)

  • 2. handicapped-infants: 2 (y,n)

  • 3. water-project-cost-sharing: 2 (y,n)

  • 4. adoption-of-the-budget-resolution: 2 (y,n)

  • 5. physician-fee-freeze: 2 (y,n)

  • 6. el-salvador-aid: 2 (y,n)

  • 7. religious-groups-in-schools: 2 (y,n)

  • 8. anti-satellite-test-ban: 2 (y,n)

  • 9. aid-to-nicaraguan-contras: 2 (y,n)

  • 10. mx-missile: 2 (y,n)

  • 11. immigration: 2 (y,n)

  • 12. synfuels-corporation-cutback: 2 (y,n)

  • 13. education-spending: 2 (y,n)

  • 14. superfund-right-to-sue: 2 (y,n)

  • 15. crime: 2 (y,n)

  • 16. duty-free-exports: 2 (y,n)

  • 17. export-administration-act-south-africa: 2 (y,n)




Testing on voting records21 l.jpg

Statistics

Times

GAT

Grad

Total

GAT

Grad

# of Tests

20

20

Seconds

4338

32603

Iterations

Minutes

72.31

543.39

GAT

Grad

Hours

1.21

9.06

Mean

4,636

107,303

St Dev

4,386

31,949

Minutes

GAT

Grad

Mean

3.62

27.17

St Dev

3.52

8.11

Testing on Voting Records


Two spiral problem l.jpg
Two-Spiral Problem

  • Very hard problem

  • Glide algorithm

    • combined with gradient descent for quicker initial error reduction

    • number of epochs required for convergence varies widely

    • average 30453 epochs

  • Gradient descent

    • often did not converge


Tuning insensitivity of glide tunnel algorithm l.jpg
Tuning Insensitivity of Glide-Tunnel Algorithm!!

Random params: odd runs

Random params: even runs


Glide algorithm tested on character recognition problem l.jpg
Glide algorithm tested on character recognition problem

  • The network was built to recognize digits 0 through 9

  • The algorithm was implemented in C++

  • Glide Algorithm was shown to outperform regular gradient descent method by the test runs.


Small neural network l.jpg
Small Neural Network

  • The network was 48-24-10

  • Bipolar inputs

  • Trained on 200 training patterns

    • 20 samples for each digit

    • Trained and tested on printed characters

  • After the training, the recognition rate for test patterns was 70% on average.

    • Not enough training patterns


Network structure l.jpg
Network Structure

  • 6X8 pixel resolution

  • 48 bipolar inputs(1/-1)

  • Hidden Layer

    • 24 neurons

    • tanh(x) for activation

  • Output Layer

    • 10 neurons

    • tanh(x) activation function


Experimental results l.jpg
Experimental results

  • 60 official runs of Glide Algorithm

    • All but 4 runs converged under 5000 epochs.

    • Average run time was 47 sec.

    • Parameters used

      • Eta = 0.005 (learning rate)

      • Lambda = 1 (steepness parameter )


Experimental results28 l.jpg
Experimental results

  • 20 runs of Regular Gradient Descent Algorithm

    • All the runs after 20,000 epochs did not converge.

    • Average run time was 3.7 min.

  • Higher order methods exist

    • Not Stable

    • Not very efficient when the error surface is flat


Conclusion l.jpg
Conclusion

  • New Glide Algorithm has been shown to perform really well for flat regions

  • With tunneling, the algorithm is very stable converging on all the test runs for different test problems

  • Converge more reliably than Gradient Descent and, presumably, than second-order methods

  • Some individual steps are computationally expensive but worth the CPU time because overall performance is far superior to regular gradient descent