- 138 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Discovering Knowledge in Data Daniel T. Larose, Ph.D.' - etoile

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Discovering Knowledge in DataDaniel T. Larose, Ph.D.

Chapter 7Neural Networks

Prepared by James Steck, Graduate Assistant

Axon

Cell Body

Neural Networks- Neural Networks
- Complex learning systems recognized in animal brains
- Single neuron has simple structure
- Interconnected sets of neurons perform complex learning tasks
- Human brain has 1015 synaptic connections
- Artificial Neural Networks attempt to replicate non-linear learning found in nature

Neural Networks (cont’d)

- Dendrites gather inputs from other neurons and combine information
- Then generate non-linear response when threshold reached
- Signal sent to other neurons via axon
- Artificial neuron model is similar
- Data inputs (xi) are collected from upstream neurons input to combination function (sigma)

Neural Networks(cont’d)

- Activation function reads combined input and produces non-linear response (y)
- Response channeled downstream to other neurons
- What problems applicable to Neural Networks?
- Quite robust with respect to noisy data
- Can learn and work around erroneous data
- Results opaque to human interpretation
- Often require long training times

:a nondecreasing function, which can be linear or non linear

Ex1

f(net)

1 If net ≥ θ

1

f(net)=

0 If net<θ

net

θ

Binary threshold function [2]

Identity function

f(net) = net for all net

Ex4Hard limit function

(sign function or bipolar function)

f(net) = +1 if net >= 0

= -1 if net < 0

Input units taking data from strain gauges

Single output unit in this case indicates the health of the aircraft

Figure 11.1 Application of neural network to aircraft health monitoring [2].

Input and Output Encoding

- Neural Networks require attribute values encoded to [0, 1]
- Numeric
- Apply Min-max Normalization to continuous variables
- Works well when Min and Max known
- Also assumes new data values occur within Min-Max range
- Values outside range may be rejected or mapped to Min or Max

Input and Output Encoding (cont’d)

- Categorical
- Indicator Variables used when number of category values small
- Categorical variable with k classes translated to k – 1 indicator variables
- For example, Gender attribute values are “Male”, “Female”, and “Unknown”
- Classes k = 3
- Create k – 1 = 2 indicator variables named Male_I and Female_I
- Male records have values Male_I = 1, Female_I = 0
- Female records have values Male_I = 0, Female_I = 1
- Unknown records have values Male_I = 0, Female_I = 0

Input and Output Encoding (cont’d)

- Apply caution when reordering unordered categorical values to [0, 1] range
- For example, attribute Marital_Status has values “Divorced”, “Married”, “Separated”, “Single”, “Widowed”, and “Unknown”
- Values coded as 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0, respectively
- Coding implies “Divorced” is closer to “Married”, and farther from “Separated”
- Neural Network only aware of numeric values
- Naive to pre-encoded meaning of categorical values
- Results of network model may be meaningless

Input and Output Encoding (cont’d)

- Output
- Neural Networks always return continuous values [0, 1]
- Many classification problems have two outcomes
- Solution uses threshold established a priori in single output node to separate classes
- For example, target variable is “leave” or “stay”
- Threshold value is “leave if output >= 0.67”
- Single output node value = 0.72 classifies record as “leave”

Input and Output Encoding (cont’d)

- Single output nodes applicable when target classes ordered
- For example, classify elementary-level reading ability
- Define thresholds Classify
- “if 0.00 <= output < 0.25” “first-grade”
- “if 0.25 <= output < 0.50” “second-grade”
- “if 0.50 <= output < 0.75” “third-grade”
- “if output >= 0.75” “fourth-grade”
- Fine-tuning of thresholds may be required

Neural Networks for Estimation and Prediction

- Continuous output useful for estimation and prediction problems
- For example, predict a stock price 3 months from now
- Input values normalized using Min-max Normalization
- Output values [0, 1] denormalized to represent scale of stock prices

Prediction = output(data range) + minimum

- For example, suppose stock price ranged $20 to $30 dollars
- Network output prediction value = 0.69

0.69(30.0 – 20.0) + 20.0 = $26.90

Hidden Layer

Output Layer

W0A

W1A

W1B

WAZ

W2A

W2B

WBZ

W0Z

W3A

W0B

W3B

Node 1

Node A

Node 2

Node Z

Node B

Node 3

Simple Example of a Neural Network- Neural Network consists of layered, feedforward, completely connected network of nodes
- Feedforward restricts network flow to single direction
- Flow does not loop or cycle
- Network composed of two or more layers

Linearly separable VS Nonlinearly separable problems

X2 X2

X1 X1

“Linearly separable”

“Nonlinearly separable”

x1

y1

z1

x2

y2

Generally, 3 layers are sufficient to solve approximately any mathematical functions.

Linearly separable VS Nonlinearly separable problems

X2 X2

X1 X1

“Linearly separable”

“Nonlinearly separable”

Simple Example of a Neural Network (cont’d)

- Most networks have Input, Hidden, Output layers
- Network may contain more than one hidden layer
- Network is completely connected
- Each node in given layer, connected to every node in next layer
- Every connection has weight (Wij) associated with it
- Weight values randomly assigned 0 to 1 by algorithm
- Number of input nodes dependent on number of predictors
- Number of hidden and output nodes configurable

Simple Example of a Neural Network (cont’d)

- How many nodes in hidden layer?
- Large number of nodes increases complexity of model
- Detailed patterns uncovered in data
- Leads to overfitting, at expense of generalizability
- Reduce number of hidden nodes when overfitting occurs
- Increase number of hidden nodes when training accuracy unacceptably low
- Input layer accepts values from input variables
- Values passed to hidden layer nodes
- Input layer nodes lack detailed structure compared to hidden and output layer nodes

- Step0: Initialize weights
- Step1: While stopping condition is FALSE (i.e SSE > 0.01)
- Step2: For each training record
- Feedforward
- Step3: Each input nodes & a bias broadcast signals
- Step4: Compute net_input at each hidden node and calculate the output of each hidden node
- Step5: Compute net_input at each output node and calculate the output of each output node
- Backpropagation of error
- Step6: Compute error at each output node and weight corrections
- Step7: Compute error at each hidden node and weight corrections
- Step8: Update weights and biases in both layers
- Step9: Test stopping condition

Simple Example of a Neural Network (cont’d)

- Combination function produces linear combination of node inputs and connection weights to single scalar value
- For node j, xij is ith input
- Wij is weight associated with ith input node
- I+ 1 inputs to node j
- x1, x2, ..., xI are inputs from upstream nodes
- x0 is constant input value = 1.0
- Each input node has extra input W0jx0j = W0j

Simple Example of a Neural Network (cont’d)

- The scalar value computed for hidden layer Node A equals
- For Node A, netA = 1.32 is input to activation function
- Neurons “fire” in biological organisms
- Signals sent between neurons when combination of inputs cross threshold

Simple Example of a Neural Network (cont’d)

- Firing response not necessarily linearly related to increase in input stimulation
- Neural Networks model behavior using non-linear activation function
- Sigmoid function most commonly used
- In Node A, sigmoid function takes netA = 1.32 as input and produces output

Simple Example of a Neural Network (cont’d)

- Node A outputs 0.7892 along connection to Node Z, and becomes component of netZ
- Before netZ is computed, contribution from Node B required
- Node Z combines outputs from Node A and Node B, through netZ

Simple Example of a Neural Network (cont’d)

- Inputs to Node Z not data attribute values
- Rather, outputs are from sigmoid function in upstream nodes
- Value 0.8750 output from Neural Network on first pass
- Represents predicted value for target variable, given first observation

Sigmoid Activation Function

- Sigmoid function combines nearly linear, curvilinear, and nearly constant behavior depending on input value
- Function nearly linear for domain values -1 < x < 1
- Becomes curvilinear as values move away from center
- At extreme values, f(x) is nearly constant
- Moderate increments in x produce variable increase in f(x), depending on location of x
- Sometimes called “Squashing Function”
- Takes real-valued input and returns values [0, 1]

Back-Propagation

- Neural Networks are supervised learning method
- Require target variable
- Each observation passed through network results in output value
- Output value compared to actual value of target variable
- (Actual – Output) = Error
- Prediction error analogous to residuals in regression models
- Most networks use Sum of Squares (SSE) to measure how well predictions fit target values

Back-Propagation (cont’d)

- Squared prediction errors summed over all output nodes, and all records in data set
- Model weights constructed that minimize SSE
- Actual values that minimize SSE are unknown
- Weights estimated, given the data set
- Unlike least-squares regression, no closed-form solution exists for minimizing SSE

Gradient Descent Method

- Gradient Descent Method determines set of weights that minimize SSE
- Given a set of m weights w = w1, w2, ..., wm in network model
- Find values for weights that, together, minimize SSE
- Gradient Descent determines direction to adjust weights, that decreases SSE
- Gradient of SSE, with respect to vector of weights w, is vector derivative:

w1L

W*1

w1R

W1

Gradient Descent Method (cont’d)- Gradient Descent is illustrated using single weight w1
- Figure plots SSE error against range of values for w1
- Preferred values for w1 minimize SSE
- Optimal value for w1 is w*1

Gradient Descent Method (cont’d)

- Develop rule defining movement from current w1 to optimal value w*1
- If current weight near w1L, increasingw approaches w*1
- If current weight near w1R, decreasingw approaches w*1
- Gradient of SSE, with respect to weight wCURRENT, is slope of SSE curve at wCURRENT
- Value wCURRENT close to w1L, slope is negative
- Value wCURRENT close to w1R, slope is positive

Gradient Descent Method (cont’d)

- Direction for adjusting wCURRENT is negative sign of derivative at SSE at wCURRENT
- To adjust, use magnitude of the derivative of SSE at wCURRENT
- When curve steep, adjustment large
- When curve nearly flat, adjustment small
- Learning Rate (Greek “eta”) has values [0, 1]

Back-Propagation Rules

- Back-propagation percolates prediction error for record back through network
- Partitioned responsibility for prediction error assigned to various connections
- Weights of connections adjusted to decrease error, using Gradient Descent Method
- Back-propagation rules defined (Mitchell)

Back-Propagation Rules (cont’d)

- Error responsibility computed using partial derivative of the sigmoid function with respect to netj
- Values take one of two forms
- Rules show why input values require normalization
- Large input values xij would dominate weight adjustment
- Error propagation would be overwhelmed, and learning stifled

Example of Back-Propagation

- Recall that first pass through network yielded output = 0.8750
- Assume actual target value = 0.8, and learning rate = 0.01
- Prediction error = 0.8 - 0.8750 = -0.075
- Neural Networks use stochastic back-propagation
- Weights updated after each record processed by network
- Adjusting the weights using back-propagation shown next
- Error responsibility for Node Z, an output node, found first

Example of Back-Propagation (cont’d)

- Now adjust “constant” weight w0Z using rules
- Move upstream to Node A, a hidden layer node
- Only node downstream from Node A is Node Z

Example of Back-Propagation (cont’d)

- Adjust weight wAZ using back-propagation rules
- Connection weight between Node A and Node Z adjusted from 0.9 to 0.899353
- Next, Node B is hidden layer node
- Only node downstream from Node B is Node Z

Example of Back-Propagation (cont’d)

- Adjust weight wBZ using back-propagation rules
- Connection weight between Node B and Node Z adjusted from 0.9 to 0.89933
- Similarly, application of back-propagation rules continues to input layer nodes
- Weights {w1A, w2A, w3A , w0A} and {w1B, w2B, w3B , w0B} updated by process

Example of Back-Propagation (cont’d)

- Now, all network weights in model are updated
- Each iteration based on single record from data set
- Summary
- Network calculated predicted value for target variable
- Prediction error derived
- Prediction error percolated back through network
- Weights adjusted to generate smaller prediction error
- Process repeats record by record

Termination Criteria

- Many passes through data set performed
- Constantly adjusting weights to reduce prediction error
- When to terminate?
- Stopping criterion may be computational “clock” time?
- Short training times likely result in poor model
- Terminate when SSE reaches threshold level?
- Neural Networks are prone to overfitting
- Memorizing patterns rather than generalizing

Termination Criteria (cont’d)

- Cross-Validation Termination Procedure
- Retain portion of training set as “hold out” data set
- Train network on remaining data
- Apply weights learned from training set to validation set
- Measure two sets of weights
- “Current” weights for training set, “Best” weights with minimum SSE on validation set
- Terminate algorithm when current weights has significantly greater SSE than best weights
- However, Neural Networks not guaranteed to arrive at global minimum for SSE

Termination Criteria (cont’d)

- Algorithm may become stuck in local minimum
- Results in good, but not optimal solution
- Not necessarily an insuperable problem
- Multiple networks trained using different starting weights
- Best model from group chosen as “final”
- Stochastic back-propagation method acts to prevent getting stuck in local minimum
- Random element introduced to gradient descent
- Momentum term may be added to back-propagation algorithm

Learning Rate

- Recall Learning Rate (Greek “eta”) is a constant
- Helps adjust weights toward global minimum for SSE
- Small Learning Rate
- With small learning rate, weight adjustments small
- Network takes unacceptable time converging to solution
- Large Learning Rate
- Suppose algorithm close to optimal solution
- With large learning rate, network likely to “overshoot” optimal solution

wCURRENT

W*

wNEW

W

Learning Rate (cont’d)- w* optimum weight for w, which has current value wCURRENT
- According to Gradient Descent Rule, wCURRENT adjusted in direction of w*
- Learning rate acts as multiplier to formula Δ wCURRENT
- Large learning may cause wNEW to jump past w*
- wNEW may be farther away from w* than wCURRENT

Learning Rate (cont’d)

- Next adjusted weight value on opposite side of w*
- Leads to oscillation between two “slopes”
- Network never settles down to minimum between them
- Adjust Learning Rate as Training Progresses
- Learning rate initialized with large value
- Network quickly approaches general vicinity of optimal solution
- As network begins to converge, learning rate gradually reduced
- Avoids overshooting minimum

Momentum Term

- Momentum term (“alpha”) makes back-propagation more powerful, and represents inertia
- Large momentum values influence Δ wCURRENT to move same direction as previous adjustments
- Including momentum in back-propagation results in adjustments becoming exponential average of all previous adjustments (Reed and Marks)

Momentum Term (cont’d)

- Large alpha values enable algorithm to remember more terms in adjustment history
- Small alpha values reduce inertial effects, and influence of previous adjustments
- With alpha = 0, all previous components disappear
- Momentum term encourages adjustments in same direction
- Increases rate which algorithm approaches neighborhood of optimality
- Early adjustments in same direction
- Exponential average of adjustments, also in same direction

SSE

I

A

B

C

w

I

A

B

C

w

Small Momentum

Large Momentum

Momentum Term (cont’d)- Weight initialized to I, with local minima at A, C
- Global minimum at B
- Small Ball
- Small “ball”, symbolized as momentum, rolls down hill
- Gets stuck in first trough A
- Momentum helps find A (local minimum), but not global minimum B

Momentum Term (cont’d)

- Large Ball
- Now, large “ball” symbolizes momentum term
- Ball rolls down curve and passes over first and second hills
- Overshoots global minimum B because of too much momentum
- Settles to local minimum at C
- Values of learning rate and momentum require careful consideration
- Experimentation with different values necessary to obtain best results

Sensitivity Analysis

- Opacity is drawback of Neural Networks
- Flexibility enables modeling of non-linear behavior
- However, limits ability to interpret results
- No procedure exists for translating weights to decision rules
- Sensitivity Analysis measures relative influence attributes have on solution
- Sensitivity Analysis Procedure
- Generate new observation xMEAN
- Each attribute value of xMEAN equals mean value for attributes in data set
- Find network output, for input xMEAN, called outputMEAN

Sensitivity Analysis (cont’d)

- Attribute by attribute, vary xMEAN to reflect attribute Min and Max
- Find network output for each variation, compare to outputMEAN
- Determines attributes, varying from their Min to Max, having greater effect on network results, compared to other attributes
- For example, predicting stock price using price-earnings-ratio, dividend-yield
- Varying price-earnings-ratio from its Min to Max results in output increase = 0.20
- Varying dividend-yield from its Min to Max results in output increase = 0.30
- Conclude network more sensitive to changes in dividend-yield
- Therefore, more important predictor of stock price

Application of Neural Network Modeling

- Insightful Miner models adult data set using network
- Model uses one hidden layer with 8 nodes
- Algorithm iterates 47 times (epochs) before terminating
- Resulting network model shown

Application of Neural Network Modeling (cont’d)

- Input nodes appear on left, one per attribute
- Eight hidden nodes in center
- Includes single output node, whether income <= $50,000
- Table below shows network weights from model
- Input nodes numbered 1 – 9, hidden nodes 22 – 29
- Weight from input node Age (1) to topmost hidden node (22) is -0.97

Application of Neural Network Modeling (cont’d)

- Estimated prediction accuracy of model is 82%
- Approximately 75% of records have income <= $50,000
- Sensitivity Analysis performed with Clementine, shown below
- capital-gain and education-num attributes are most important predictors of those with income <= $50,000
- However, gender (sex) not highly-predictive
- Further analysis required to develop model

Download Presentation

Connecting to Server..