discovering knowledge in data daniel t larose ph d
Download
Skip this Video
Download Presentation
Discovering Knowledge in Data Daniel T. Larose, Ph.D.

Loading in 2 Seconds...

play fullscreen
1 / 60

Discovering Knowledge in Data Daniel T. Larose, Ph.D. - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

Discovering Knowledge in Data Daniel T. Larose, Ph.D. Chapter 7 Neural Networks Prepared by James Steck , Graduate Assistant. Dendrites. Axon. Cell Body. Neural Networks. Neural Networks Complex learning systems recognized in animal brains Single neuron has simple structure

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Discovering Knowledge in Data Daniel T. Larose, Ph.D.' - etoile


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
discovering knowledge in data daniel t larose ph d

Discovering Knowledge in DataDaniel T. Larose, Ph.D.

Chapter 7Neural Networks

Prepared by James Steck, Graduate Assistant

neural networks

Dendrites

Axon

Cell Body

Neural Networks
  • Neural Networks
    • Complex learning systems recognized in animal brains
    • Single neuron has simple structure
    • Interconnected sets of neurons perform complex learning tasks
    • Human brain has 1015 synaptic connections
    • Artificial Neural Networks attempt to replicate non-linear learning found in nature
neural networks cont d
Neural Networks (cont’d)
  • Dendrites gather inputs from other neurons and combine information
  • Then generate non-linear response when threshold reached
  • Signal sent to other neurons via axon
  • Artificial neuron model is similar
  • Data inputs (xi) are collected from upstream neurons input to combination function (sigma)
neural networks cont d1
Neural Networks(cont’d)
    • Activation function reads combined input and produces non-linear response (y)
    • Response channeled downstream to other neurons
  • What problems applicable to Neural Networks?
    • Quite robust with respect to noisy data
    • Can learn and work around erroneous data
    • Results opaque to human interpretation
    • Often require long training times
slide5

Activation functions

:a nondecreasing function, which can be linear or non linear

Ex1

f(net)

1 If net ≥ θ

1

f(net)=

0 If net<θ

net

θ

Binary threshold function [2]

slide6

Ex2 Sigmoid function

f(net) = 2 -1

(1+ e-net)

f(net)

1

net

0

slide7

Ex3

Identity function

f(net) = net for all net

Ex4Hard limit function

(sign function or bipolar function)

f(net) = +1 if net >= 0

= -1 if net < 0

slide8

hidden units

Input units taking data from strain gauges

Single output unit in this case indicates the health of the aircraft

Figure 11.1 Application of neural network to aircraft health monitoring [2].

input and output encoding
Input and Output Encoding
    • Neural Networks require attribute values encoded to [0, 1]
  • Numeric
    • Apply Min-max Normalization to continuous variables
    • Works well when Min and Max known
    • Also assumes new data values occur within Min-Max range
    • Values outside range may be rejected or mapped to Min or Max
input and output encoding cont d
Input and Output Encoding (cont’d)
  • Categorical
    • Indicator Variables used when number of category values small
    • Categorical variable with k classes translated to k – 1 indicator variables
    • For example, Gender attribute values are “Male”, “Female”, and “Unknown”
    • Classes k = 3
    • Create k – 1 = 2 indicator variables named Male_I and Female_I
    • Male records have values Male_I = 1, Female_I = 0
    • Female records have values Male_I = 0, Female_I = 1
    • Unknown records have values Male_I = 0, Female_I = 0
input and output encoding cont d1
Input and Output Encoding (cont’d)
  • Apply caution when reordering unordered categorical values to [0, 1] range
  • For example, attribute Marital_Status has values “Divorced”, “Married”, “Separated”, “Single”, “Widowed”, and “Unknown”
  • Values coded as 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0, respectively
  • Coding implies “Divorced” is closer to “Married”, and farther from “Separated”
  • Neural Network only aware of numeric values
  • Naive to pre-encoded meaning of categorical values
  • Results of network model may be meaningless
input and output encoding cont d2
Input and Output Encoding (cont’d)
  • Output
    • Neural Networks always return continuous values [0, 1]
    • Many classification problems have two outcomes
    • Solution uses threshold established a priori in single output node to separate classes
    • For example, target variable is “leave” or “stay”
    • Threshold value is “leave if output >= 0.67”
    • Single output node value = 0.72 classifies record as “leave”
input and output encoding cont d3
Input and Output Encoding (cont’d)
  • Single output nodes applicable when target classes ordered
  • For example, classify elementary-level reading ability
  • Define thresholds Classify
  • “if 0.00 <= output < 0.25” “first-grade”
  • “if 0.25 <= output < 0.50” “second-grade”
  • “if 0.50 <= output < 0.75” “third-grade”
  • “if output >= 0.75” “fourth-grade”
  • Fine-tuning of thresholds may be required
neural networks for estimation and prediction
Neural Networks for Estimation and Prediction
  • Continuous output useful for estimation and prediction problems
  • For example, predict a stock price 3 months from now
  • Input values normalized using Min-max Normalization
  • Output values [0, 1] denormalized to represent scale of stock prices

Prediction = output(data range) + minimum

  • For example, suppose stock price ranged $20 to $30 dollars
  • Network output prediction value = 0.69

0.69(30.0 – 20.0) + 20.0 = $26.90

simple example of a neural network

Input Layer

Hidden Layer

Output Layer

W0A

W1A

W1B

WAZ

W2A

W2B

WBZ

W0Z

W3A

W0B

W3B

Node 1

Node A

Node 2

Node Z

Node B

Node 3

Simple Example of a Neural Network
  • Neural Network consists of layered, feedforward, completely connected network of nodes
  • Feedforward restricts network flow to single direction
  • Flow does not loop or cycle
  • Network composed of two or more layers
slide16

EX1 ANN for AND logic operations

x1

1

output

θ=2

1

∑xiwi

x2

x2

1

AND ~ Linearly separable problem

x1

1

slide17

EX2 ANN for OR logic operations

x1

1

output

θ=1

1

x2

∑xiwi

x2

1

OR ~ Linearly separable problem

x1

1

slide18

Linearly separable VS Nonlinearly separable problems

X2 X2

X1 X1

“Linearly separable”

“Nonlinearly separable”

slide19

Neural network architectures

  • Feed forward –single layer

x1

y1

x2

slide20

Feed forward –multilayer

x1

y1

z1

x2

y2

Generally, 3 layers are sufficient to solve approximately any mathematical functions.

slide21

Linearly separable VS Nonlinearly separable problems

X2 X2

X1 X1

“Linearly separable”

“Nonlinearly separable”

slide23

1

A

X1

B

0

X2

C

0

D

0

.....

E

0

1

J

0

X63

K

0

1

simple example of a neural network cont d
Simple Example of a Neural Network (cont’d)
  • Most networks have Input, Hidden, Output layers
  • Network may contain more than one hidden layer
  • Network is completely connected
  • Each node in given layer, connected to every node in next layer
  • Every connection has weight (Wij) associated with it
  • Weight values randomly assigned 0 to 1 by algorithm
  • Number of input nodes dependent on number of predictors
  • Number of hidden and output nodes configurable
simple example of a neural network cont d1
Simple Example of a Neural Network (cont’d)
  • How many nodes in hidden layer?
  • Large number of nodes increases complexity of model
  • Detailed patterns uncovered in data
  • Leads to overfitting, at expense of generalizability
  • Reduce number of hidden nodes when overfitting occurs
  • Increase number of hidden nodes when training accuracy unacceptably low
  • Input layer accepts values from input variables
  • Values passed to hidden layer nodes
  • Input layer nodes lack detailed structure compared to hidden and output layer nodes
slide26

Backpropagation algorithm

  • Step0: Initialize weights
  • Step1: While stopping condition is FALSE (i.e SSE > 0.01)
      • Step2: For each training record
    • Feedforward
        • Step3: Each input nodes & a bias broadcast signals
        • Step4: Compute net_input at each hidden node and calculate the output of each hidden node
        • Step5: Compute net_input at each output node and calculate the output of each output node
        • Backpropagation of error
        • Step6: Compute error at each output node and weight corrections
        • Step7: Compute error at each hidden node and weight corrections
        • Step8: Update weights and biases in both layers
      • Step9: Test stopping condition
simple example of a neural network cont d2
Simple Example of a Neural Network (cont’d)
  • Combination function produces linear combination of node inputs and connection weights to single scalar value
  • For node j, xij is ith input
  • Wij is weight associated with ith input node
  • I+ 1 inputs to node j
  • x1, x2, ..., xI are inputs from upstream nodes
  • x0 is constant input value = 1.0
  • Each input node has extra input W0jx0j = W0j
simple example of a neural network cont d3
Simple Example of a Neural Network (cont’d)
  • The scalar value computed for hidden layer Node A equals
  • For Node A, netA = 1.32 is input to activation function
  • Neurons “fire” in biological organisms
  • Signals sent between neurons when combination of inputs cross threshold
simple example of a neural network cont d4
Simple Example of a Neural Network (cont’d)
  • Firing response not necessarily linearly related to increase in input stimulation
  • Neural Networks model behavior using non-linear activation function
  • Sigmoid function most commonly used
  • In Node A, sigmoid function takes netA = 1.32 as input and produces output
simple example of a neural network cont d5
Simple Example of a Neural Network (cont’d)
  • Node A outputs 0.7892 along connection to Node Z, and becomes component of netZ
  • Before netZ is computed, contribution from Node B required
  • Node Z combines outputs from Node A and Node B, through netZ
simple example of a neural network cont d6
Simple Example of a Neural Network (cont’d)
  • Inputs to Node Z not data attribute values
  • Rather, outputs are from sigmoid function in upstream nodes
  • Value 0.8750 output from Neural Network on first pass
  • Represents predicted value for target variable, given first observation
sigmoid activation function
Sigmoid Activation Function
  • Sigmoid function combines nearly linear, curvilinear, and nearly constant behavior depending on input value
  • Function nearly linear for domain values -1 < x < 1
  • Becomes curvilinear as values move away from center
  • At extreme values, f(x) is nearly constant
  • Moderate increments in x produce variable increase in f(x), depending on location of x
  • Sometimes called “Squashing Function”
  • Takes real-valued input and returns values [0, 1]
back propagation
Back-Propagation
  • Neural Networks are supervised learning method
  • Require target variable
  • Each observation passed through network results in output value
  • Output value compared to actual value of target variable
  • (Actual – Output) = Error
  • Prediction error analogous to residuals in regression models
  • Most networks use Sum of Squares (SSE) to measure how well predictions fit target values
back propagation cont d
Back-Propagation (cont’d)
  • Squared prediction errors summed over all output nodes, and all records in data set
  • Model weights constructed that minimize SSE
  • Actual values that minimize SSE are unknown
  • Weights estimated, given the data set
  • Unlike least-squares regression, no closed-form solution exists for minimizing SSE
gradient descent method
Gradient Descent Method
  • Gradient Descent Method determines set of weights that minimize SSE
  • Given a set of m weights w = w1, w2, ..., wm in network model
  • Find values for weights that, together, minimize SSE
  • Gradient Descent determines direction to adjust weights, that decreases SSE
  • Gradient of SSE, with respect to vector of weights w, is vector derivative:
gradient descent method cont d

SSE

w1L

W*1

w1R

W1

Gradient Descent Method (cont’d)
  • Gradient Descent is illustrated using single weight w1
  • Figure plots SSE error against range of values for w1
  • Preferred values for w1 minimize SSE
  • Optimal value for w1 is w*1
gradient descent method cont d1
Gradient Descent Method (cont’d)
  • Develop rule defining movement from current w1 to optimal value w*1
  • If current weight near w1L, increasingw approaches w*1
  • If current weight near w1R, decreasingw approaches w*1
  • Gradient of SSE, with respect to weight wCURRENT, is slope of SSE curve at wCURRENT
  • Value wCURRENT close to w1L, slope is negative
  • Value wCURRENT close to w1R, slope is positive
gradient descent method cont d2
Gradient Descent Method (cont’d)
  • Direction for adjusting wCURRENT is negative sign of derivative at SSE at wCURRENT
  • To adjust, use magnitude of the derivative of SSE at wCURRENT
  • When curve steep, adjustment large
  • When curve nearly flat, adjustment small
  • Learning Rate (Greek “eta”) has values [0, 1]
back propagation rules
Back-Propagation Rules
  • Back-propagation percolates prediction error for record back through network
  • Partitioned responsibility for prediction error assigned to various connections
  • Weights of connections adjusted to decrease error, using Gradient Descent Method
  • Back-propagation rules defined (Mitchell)
back propagation rules cont d
Back-Propagation Rules (cont’d)
  • Error responsibility computed using partial derivative of the sigmoid function with respect to netj
  • Values take one of two forms
  • Rules show why input values require normalization
  • Large input values xij would dominate weight adjustment
  • Error propagation would be overwhelmed, and learning stifled
example of back propagation
Example of Back-Propagation
  • Recall that first pass through network yielded output = 0.8750
  • Assume actual target value = 0.8, and learning rate = 0.01
  • Prediction error = 0.8 - 0.8750 = -0.075
  • Neural Networks use stochastic back-propagation
  • Weights updated after each record processed by network
  • Adjusting the weights using back-propagation shown next
  • Error responsibility for Node Z, an output node, found first
example of back propagation cont d
Example of Back-Propagation (cont’d)
  • Now adjust “constant” weight w0Z using rules
  • Move upstream to Node A, a hidden layer node
  • Only node downstream from Node A is Node Z
example of back propagation cont d1
Example of Back-Propagation (cont’d)
  • Adjust weight wAZ using back-propagation rules
  • Connection weight between Node A and Node Z adjusted from 0.9 to 0.899353
  • Next, Node B is hidden layer node
  • Only node downstream from Node B is Node Z
example of back propagation cont d2
Example of Back-Propagation (cont’d)
  • Adjust weight wBZ using back-propagation rules
  • Connection weight between Node B and Node Z adjusted from 0.9 to 0.89933
  • Similarly, application of back-propagation rules continues to input layer nodes
  • Weights {w1A, w2A, w3A , w0A} and {w1B, w2B, w3B , w0B} updated by process
example of back propagation cont d3
Example of Back-Propagation (cont’d)
    • Now, all network weights in model are updated
    • Each iteration based on single record from data set
  • Summary
    • Network calculated predicted value for target variable
    • Prediction error derived
    • Prediction error percolated back through network
    • Weights adjusted to generate smaller prediction error
    • Process repeats record by record
termination criteria
Termination Criteria
  • Many passes through data set performed
  • Constantly adjusting weights to reduce prediction error
  • When to terminate?
  • Stopping criterion may be computational “clock” time?
  • Short training times likely result in poor model
  • Terminate when SSE reaches threshold level?
  • Neural Networks are prone to overfitting
  • Memorizing patterns rather than generalizing
termination criteria cont d
Termination Criteria (cont’d)
  • Cross-Validation Termination Procedure
    • Retain portion of training set as “hold out” data set
    • Train network on remaining data
    • Apply weights learned from training set to validation set
    • Measure two sets of weights
    • “Current” weights for training set, “Best” weights with minimum SSE on validation set
    • Terminate algorithm when current weights has significantly greater SSE than best weights
    • However, Neural Networks not guaranteed to arrive at global minimum for SSE
termination criteria cont d1
Termination Criteria (cont’d)
  • Algorithm may become stuck in local minimum
  • Results in good, but not optimal solution
  • Not necessarily an insuperable problem
  • Multiple networks trained using different starting weights
  • Best model from group chosen as “final”
  • Stochastic back-propagation method acts to prevent getting stuck in local minimum
  • Random element introduced to gradient descent
  • Momentum term may be added to back-propagation algorithm
learning rate
Learning Rate
    • Recall Learning Rate (Greek “eta”) is a constant
    • Helps adjust weights toward global minimum for SSE
  • Small Learning Rate
    • With small learning rate, weight adjustments small
    • Network takes unacceptable time converging to solution
  • Large Learning Rate
    • Suppose algorithm close to optimal solution
    • With large learning rate, network likely to “overshoot” optimal solution
learning rate cont d

SSE

wCURRENT

W*

wNEW

W

Learning Rate (cont’d)
  • w* optimum weight for w, which has current value wCURRENT
  • According to Gradient Descent Rule, wCURRENT adjusted in direction of w*
  • Learning rate acts as multiplier to formula Δ wCURRENT
  • Large learning may cause wNEW to jump past w*
  • wNEW may be farther away from w* than wCURRENT
learning rate cont d1
Learning Rate (cont’d)
    • Next adjusted weight value on opposite side of w*
    • Leads to oscillation between two “slopes”
    • Network never settles down to minimum between them
  • Adjust Learning Rate as Training Progresses
    • Learning rate initialized with large value
    • Network quickly approaches general vicinity of optimal solution
    • As network begins to converge, learning rate gradually reduced
    • Avoids overshooting minimum
momentum term
Momentum Term
  • Momentum term (“alpha”) makes back-propagation more powerful, and represents inertia
  • Large momentum values influence Δ wCURRENT to move same direction as previous adjustments
  • Including momentum in back-propagation results in adjustments becoming exponential average of all previous adjustments (Reed and Marks)
momentum term cont d
Momentum Term (cont’d)
  • Large alpha values enable algorithm to remember more terms in adjustment history
  • Small alpha values reduce inertial effects, and influence of previous adjustments
  • With alpha = 0, all previous components disappear
  • Momentum term encourages adjustments in same direction
  • Increases rate which algorithm approaches neighborhood of optimality
  • Early adjustments in same direction
  • Exponential average of adjustments, also in same direction
momentum term cont d1

SSE

SSE

I

A

B

C

w

I

A

B

C

w

Small Momentum

Large Momentum

Momentum Term (cont’d)
    • Weight initialized to I, with local minima at A, C
    • Global minimum at B
  • Small Ball
    • Small “ball”, symbolized as momentum, rolls down hill
    • Gets stuck in first trough A
    • Momentum helps find A (local minimum), but not global minimum B
momentum term cont d2
Momentum Term (cont’d)
  • Large Ball
    • Now, large “ball” symbolizes momentum term
    • Ball rolls down curve and passes over first and second hills
    • Overshoots global minimum B because of too much momentum
    • Settles to local minimum at C
    • Values of learning rate and momentum require careful consideration
    • Experimentation with different values necessary to obtain best results
sensitivity analysis
Sensitivity Analysis
    • Opacity is drawback of Neural Networks
    • Flexibility enables modeling of non-linear behavior
    • However, limits ability to interpret results
    • No procedure exists for translating weights to decision rules
    • Sensitivity Analysis measures relative influence attributes have on solution
  • Sensitivity Analysis Procedure
    • Generate new observation xMEAN
    • Each attribute value of xMEAN equals mean value for attributes in data set
    • Find network output, for input xMEAN, called outputMEAN
sensitivity analysis cont d
Sensitivity Analysis (cont’d)
  • Attribute by attribute, vary xMEAN to reflect attribute Min and Max
  • Find network output for each variation, compare to outputMEAN
  • Determines attributes, varying from their Min to Max, having greater effect on network results, compared to other attributes
  • For example, predicting stock price using price-earnings-ratio, dividend-yield
  • Varying price-earnings-ratio from its Min to Max results in output increase = 0.20
  • Varying dividend-yield from its Min to Max results in output increase = 0.30
  • Conclude network more sensitive to changes in dividend-yield
  • Therefore, more important predictor of stock price
application of neural network modeling
Application of Neural Network Modeling
  • Insightful Miner models adult data set using network
  • Model uses one hidden layer with 8 nodes
  • Algorithm iterates 47 times (epochs) before terminating
  • Resulting network model shown
application of neural network modeling cont d
Application of Neural Network Modeling (cont’d)
  • Input nodes appear on left, one per attribute
  • Eight hidden nodes in center
  • Includes single output node, whether income <= $50,000
  • Table below shows network weights from model
  • Input nodes numbered 1 – 9, hidden nodes 22 – 29
  • Weight from input node Age (1) to topmost hidden node (22) is -0.97
application of neural network modeling cont d1
Application of Neural Network Modeling (cont’d)
  • Estimated prediction accuracy of model is 82%
  • Approximately 75% of records have income <= $50,000
  • Sensitivity Analysis performed with Clementine, shown below
  • capital-gain and education-num attributes are most important predictors of those with income <= $50,000
  • However, gender (sex) not highly-predictive
  • Further analysis required to develop model
ad