- By
**noah** - Follow User

- 99 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Data Mining: and Knowledge Acquizition — Chapter 5 —' - noah

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Classification

- Classification:
- predicts categorical class labels
- Typical Applications
- {credit history, salary}-> credit approval ( Yes/No)
- {Temp, Humidity} --> Rain (Yes/No)

Linear Classification

- Binary Classification problem
- The data above the red line belongs to class ‘x’
- The data below red line belongs to class ‘o’
- Examples – SVM, Perceptron, Probabilistic Classifiers

x

x

x

x

x

x

x

o

x

x

o

o

x

o

o

o

o

o

o

o

o

o

o

Neural Networks

- Analogy to Biological Systems (Indeed a great example of a good learning system)
- Massive Parallelism allowing for computational efficiency
- The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule

Neural Networks

- Advantages
- prediction accuracy is generally high
- robust, works when training examples contain errors
- output may be discrete, real-valued, or a vector of several discrete or real-valued attributes
- fast evaluation of the learned target function
- Criticism
- long training time
- difficult to understand the learned function (weights)
- not easy to incorporate domain knowledge

Network Topology

- Input variables number of inputs
- number of hidden layers
- # of nodes in each hidden layer
- # of output nodes
- can handle discrete or continuous variables
- normalisation for continuous to 0..1 interval
- for discrete variables
- use k inputs for each level
- use k output for each level if k>2
- A has three distinct values a1,a2,a3
- three input variables I1,I2I3 when A=a1 I1=1,I2,I3=0
- feed-forward:no cycle to input untis
- fully connected:each unit to each in the forward layer

Example: Sample iterations

- A network suggested to solve the XOR problem figure 4.7 of WK page 96-99
- learning rate is 1 for simplicity
- I1=I2=1
- T =0 true output
- I1 I2 T P
- 1.0 1.0 0 0.63

Variabe Encodings

- Continuous variables
- Ex:
- Dollar amounts
- Averages: averages sales,volume
- Ratio income-debt,payment to laoan
- Physical measures: area, temperature...
- Transfer between
- 0-1 or 0.1 – 0.9
- -1.0 - +1.0 or -0.9 – 0.9
- z scores z = x – mean_x/standard_dev_X

Continuous variables

- When a new observation comes
- it may be out of range
- What to do
- Plan for a larger range
- Reject out of range values
- Pag values lower then min to minrange
- higher then max to maxrange

Ordinal variables

- Discrete integers
- Ex:
- Age ranges : young mid old
- İncome : low,mid,high
- Number of children
- Transfer to 0-1 interval
- Ex: 5 categories of age
- 1 young,2 mid young,3 mid, 4 mid old 5 old
- Transfer between 0 to 1

Thermometer coding

- 0 0 0 0 0 0/16 = 0
- 1 1 0 0 0 8/16 = 0.5
- 2 1 1 0 0 12/16 = 0.75
- 3 1 1 1 0 14/16 =0.875
- Useful for academic grades or bond ratings
- Difference on one side of the scale is more important then on the other side of the scale

Nominal Variables

- Ex:
- Gender marital status,occupation
- 1- treat like ordinary variables
- Ex marital status 5 codes:
- Single,divorced,maried,widowed,unknown
- Mapped to -1,-0.5,0,0.5,1
- Network treat them ordinal
- Even though order does not make sence

2- break into flags

- One variable for each category
- 1 of N coding
- Gender has three values
- Male female unknown
- Male 1 -1 -1
- Female -1 1 -1
- Unknown -1 -1 1

1 of N-1 coding

- Male 1 -1
- Female -1 1
- Unknown -1 -1
- 3 replace the varible with an numerical one

Time Series variables

- Stock market prediction
- Output IMKB100 at t
- Inputs:
- IMKB100 at t-1, at t-2, at t-3...
- Dollar at t-1, t-2,t-3..
- İnterest rate at t-1,t-2,t-3
- Day of week variables
- Ordinal
- Monday 1 0 0 0 0 ,...,Friday 0 0 0 0 1
- Nominal Monday to Friday map
- from -1 to 1 or 0 to 1

- The ultimate objective of training
- obtain a set of weights that makes almost all the tuples in the training data classified correctly
- Steps
- Initialize weights with random values
- repeat until classification error is lower then a threshold (epoch)
- Feed the input tuples into the network one by one
- For each unit
- Compute the net input to the unit as a linear combination of all the inputs to the unit
- Compute the output value using the activation function
- Compute the error
- Update the weights and the bias

Example: Stock market prediction

- input variables:
- individual stock prices at t-1, t-2,t-3,...
- stock index at t-1, t-2, t-3,
- inflation rate, interest rate, exchange rates $
- output variable:
- predicted stock price next time
- train the network with known cases
- adjust weights
- experiment with different topologies
- test the network
- use the tested network for predicting unknown stock prices

Other business Applications (1)

- Marketing and sales
- Prediction
- Sales forecasting
- Price elasticity forecasting
- Customer responce
- Classification
- Target marketing
- Customer satisfaction
- Loyalty and retention
- Clustering
- segmentation

Other business Applications (1)

- Risk Management
- Credit scoring
- Financial health
- Clasification
- Bankruptcy clasification
- Fraud detection
- Credit scoring
- Clustering
- Credit scoring
- Risk assesment

Other business Applications (1)

- Finance
- Prediction
- Hedging
- Future prediction
- Forex stock prediction
- Clasification
- Stock trend clasification
- Bond rating
- Clustering
- Economic rating
- Mutual fond selection

Perceptrons

- WK 91 sec 4.2
- N inputs Ii i:1..N
- single output O
- two classes C0 and C1 denoted by 0 and 1
- one node
- output:
- O=1 if w1I1+ w2I2+...+wnIn+w0>0
- O=0 if w1I1+ w2I2+...+wnIn+w0<0
- sometimes is used for constant term for w0
- called bias or treshold in ANN

Perceptron training procedure(rule) (1)

- Find w weights to separate each training sample correctly
- Initial weights randomly chosen
- weight updating
- samples are presented in sequence
- after presenting each case weights are updated:
- wi(t+1) = wi(t)+ wi(t)
- i(t+1) = i(t)+ i(t)
- wi(t) = (T -O)Ii
- i(t) = (T -O)
- O: output of perceptron,T true output for each case, learning rate 0<<1 usually around 0.1

Perceptron training procedure (rule) (2)

- each case is presented and
- weights are updated
- after presenting each case if
- the error is not zero
- then present all cases ones
- each such cycle is called an epoch
- unit error is zero for perfectly separable samples

Perceptron convergence theorem:

- if the sample is linearly separable the perceptron will eventually converge: separate all the sample correctly
- error =0
- the learning rate can be even one
- This slows down the convergence
- to increase stability
- it is gradually decreased
- linearly separable: a line or hyperplane can separate all the sample correctly

If classes are not perfectly linearly separable

- if a plane or line can not separate classes completely
- The procedure will not converge and will keep on cycling through the data forever

Example calculations

- Two inputs w1=0.25 w2=0.5 w0 or =-0.5
- Suppose I1= 1.5 I2 =0.5
- learning rate=0.1
- and T = 0 true output
- perceptron separate this as:
- 0.25*1.5+0.5*0.5-0.5=0.125>0 O=1
- w1(t+1) = 0.25+0.1(0-1)1.5=0.1
- w2(t+1) = 0.5+ 0.1(0-1)0.5 =0.45
- (t+1) = -0.5+ 0.1(0-1)=-0.6
- with the new weights:
- O = 0.1*1.5+0.45*0.5-0.6=-0.225 O =0
- no error

0.25*I1+0.5*I2-0.5=0

1

true class is 0 but

classified as class 1

o

class 1

0.5

I1

2

1.5

class 0

I2

0.1*I1+0.45*I2-0.6=0

1.33

true class is 0 and

classified as class 0

class 1

o

0.5

class 0

6

I1

XOR: exclusive OR problem

- Two inputs I1 I2
- when both agree
- I1=0 and I2=0 or I1=1 and I2=1
- class 0, O=0
- when both disagree
- I1=0 and I2=1 or I1=1 and I2=0
- class 1, O=1
- one line can not solve XOR
- but two ilnes can

- Study 4.3 In WK
- One layer networks can separate a hyperplane
- two layer networks can any convex region
- and three layer networks can separate any non convex boundary
- Examples see notes

+

+

inside the triangle ABC is class O

outside the triangle is class +

class =0 if

I1+I2>=10

I1<=I2

I2<=10

+

C

B

o

o

o

o

o

o

+

o

+

+1

+1

o

+

+

A

+

+

a

+

+

+

I1

I1

d

b

output of hidden node a:

1 if class O

w11I1+w12I2+w10>=0

0 if class is +

w11I1+w12I2+w10<0

I2

c

so w1i s are W11=1,w12=1 and w10=-10

1 if that is O

w21I1+w22I2+w20>=0

0 if that is +

w21I1+w22I2+w20<0

so w2i s are W21=-1,w22=1 and w10=-0

I2

+

+

+

C

B

o

o

o

o

o

o

+

o

+

+1

+1

o

+

+

A

+

+

a

+

+

+

I1

I1

d

b

output of hidden node c:

1 if

w31I1+w32I2+w30>=0

0 if

w11I1+w12I2+w10<=0

I2

c

so w1i s are W31=0,w32=-1 and w10=10

an object is class O if all hidden units

predict is as class 0

output is 1 if

w’aHa+w’bHb+w’cHc+wd>=0

output is 0 if

w’aHa+w’bHb+w’cHc+wd<0

I2

+

+

+

C

B

o

o

o

o

o

o

+

o

+

+1

+1

o

+

+

A

+

+

a

+

+

+

I1

I1

d

b

weights of output node d:

wa=1,wb=1wc=1

wd=-3+x

where x a small number

I2

c

ADBC is the union of two convex

regions in this case triangles

each triangular region can be separated

by a two layer network

Two hidden layers

Can seperate any

Nonconvex region

I2

+

+

+

C

B

+1

o

o

o

+1

o

o

o

o

o

+

o

+

o

a

o

+

A

+

w’’f0

o

+

+

+

+

+

I1

I1

b

D

w’’f1

d

d separates ABC

e separates ADB

ADBC is union of

ABC and ADB

f

I2

c

w’’f2

e

output is class O if

w’’f0+w’’f1He+w’’f2Hf>=0

w’’f0=--0.99,w’’f=1,w’’f2=1

first hidden

layer

second hidden

layer

In practice boundaries are not known but increasing number of hidden node: two layer perceptron can separate any convex region

- if it is perfectly separable
- adding a second hidden layer and or ing the convex regions any nonconvex boundary can be separated
- if it is perfectly separable
- Weights are unknown but are found by training the network

Network Training

- The ultimate objective of training
- obtain a set of weights that makes almost all the tuples in the training data classified correctly
- Steps
- Initialize weights with random values
- Feed the input tuples into the network one by one
- For each unit
- Compute the net input to the unit as a linear combination of all the inputs to the unit
- Compute the output value using the activation function
- Compute the error
- Update the weights and the bias

Back propagation algorithm

- LMS uses a linear activation function
- not so useful
- threshold activation function is very good in separating but not differentiable
- back propagation uses logistic function
- O = 1/(1+exp(-N))=(1+exp(-N))-1
- N = w1I1+w2I2+... wNIN+
- the derivative of logistic function
- dO/dN = O*(1-O) expressed as a function of output where O =1/(1+exp(-N)), 0<=O<=1

Minimize total error again

- E= (1/2)Nd=1Mk=1(Tk,d-Ok,d)2
- where N is number of cases
- M number of output units
- Tk,d:true value of sample d in output unit k
- Ok,d:predicted value of sample d in output unit k
- the algorithm updates weights by a similar method to the delta rule
- for each output units
- wij=d=1 Od(1-Od)(Td-Od)Ii,dor
- wij(t) = O(1-O)(T -O)Ii|when objects are
- ij(t) = O(1-O)(T -O) | presented sequentially
- here O(1-O)(T -O)= is the error term

so wij(t)= *errorj*Ii or ij(t)= *errorj

- for all training samples
- new weights are
- wi(t+1) = wi(t)+ wi(t)
- i(t+1) = i(t)+ i(t)
- but for hidden layer weights no target value is available
- wij(t) = Od(1-Od) (Mk=1errork*wkh)Ii
- ij(t) = Od(1-Od)(Mk=1errork*wkh)
- the error rate of each output is weighted by its weight and summed up to find the error derivative
- The weights from hidden unit h to output unit k is responsible for the error in output unit k

Example: Sample iterations

- A network suggested to solve the XOR problem figure 4.7 of WK page 96-99
- learning rate is 1 for simplicity
- I1=I2=1
- T =0 true output

Exercise

- Carry out one more iteration for the XOR problem

Practical Applications of BP

- Revision by epoch or case
- dE/dwj = Ni=1Oi(1-Oi)(Ti-Oi)Iij
- where i= 1,..,N index for samples
- N sample size
- j: index for inputs Iij input variable j for
- sample i
- This is the theoretical and actual derivtives
- information in all samples are used in one update of the weight j
- weights are revised after each epoch

If samples are presented one by one weight j is updated after presenting each sample by

- dE/dwj = Oi(1-Oi)(Ti-Oi)Iij
- this is just one term in the epoch update or gradient formula of derivative
- called case update
- updating by case is more common and give better results
- less likely to stack to local minima
- Random or sequential presentation
- in each epoch case are presented in
- sequential order or
- in random order

1 2 3 3 5.. 1 2 3 4 5..1 2 3 4 5.. 1 2 3 4 5..

epoch 1 epoch 2 epoch 3

random presentation

1 2 5 4 3.. 3 2 1 4 5..5 1 4 2 3.. 2 5 4 1 3..

epoch 1 epoch 2 epoch 3

Neural Networks

- Random initial state
- weights and biases are initialized to random values usually between -0.5 to 0.5
- the final solution may depend on the initial values of weights
- the algorithm may converge to different local minima
- Learning rate and local minima
- learning rate
- too small: slow convergence
- too large: much fast but osilations
- With a small learning rate local minimum is less likely

Momentum

- Momentum is added to the update equations
- wij(t+1) = *errorderivativej*Ii+ mon*wij(t)
- ij(t+1) = *errorderivativej+ mom*ij(t)
- momentum term
- slows down the change of direction
- avoids falling into a local minima or
- speed up convergence by increasing the gradient by adding a value to it
- when it falls into flat regions

Stoping criteria

- Limit the number of epochs
- improvement in error is so small
- sample error after a fixed number of epochs
- measure the reduction in error
- no change in w values above a threshold

Overfitting Sec 4.6.5 pp 108-112 Mitchell

- E monotonically decreases as number of iterations increases Fig 4.9 in Mitchell
- Validation or test case error
- in general
- decreases first then start increasing
- Why
- as training progress some weights values are high
- fit noise in training data
- not representative features of the population

What to do

- Weight decay
- slowly decrease weights
- put a penalty to error function for high weights
- Monitoring the validation set error as well as the training set error as a function of iterations
- see figure 4.9 in Mitchell

Error and Complexity

- Sec 4.4 of WK pp 102-107
- error rate on the training set decreases as number of hidden units is increased
- error rate on test set first decreases flatten out then start increasing as number of hidden layers is increased
- Start with zero hidden units
- increase gradually the number of units in hidden layer
- at each network size
- 10 fold cross validation or
- sampling the different initial weights
- may be used to estimate error.
- error may be averaged

A General Network Training Procedure

- Define the problem
- Select input and output variables
- Make necessary transformations
- Decide on algorithm
- gradient decent or stochastic approximation (delta rule)
- Choose the transfer function
- logistic, hyperbolic tangent
- Select a learning rate a momentum
- after experimenting with possibly different rates

A General Network Training Procedure cnt

- Determine the stopping criteria
- after error decreases by to a level or
- number of epochs
- Start from zero hidden units
- increment number of hidden units
- for each number of hidden units repeat
- train the network on training data set
- perform cross validation to estimate test error rate by averaging on different test samples
- for a set of initial weights
- find best initial weights

Neural Network Approach

- Neural network approaches
- Represent each cluster as an exemplar, acting as a “prototype” of the cluster
- New objects are distributed to the cluster whose exemplar is the most similar according to some distance measure
- Typical methods
- SOM (Soft-Organizing feature Map)
- Competitive learning
- Involves a hierarchical architecture of several units (neurons)
- Neurons compete in a “winner-takes-all” fashion for the object currently being presented

Self-Organizing Feature Map (SOM)

- SOMs, also called topological ordered maps, or Kohonen Self-Organizing Feature Map (KSOMs)
- It maps all the points in a high-dimensional source space into a 2 to 3-d target space, s.t., the distance and proximity relationship (i.e., topology) are preserved as much as possible
- Similar to k-means: cluster centers tend to lie in a low-dimensional manifold in the feature space
- Clustering is performed by having several units competing for the current object
- The unit whose weight vector is closest to the current object wins
- The winner and its neighbors learn by having their weights adjusted
- SOMs are believed to resemble processing that can occur in the brain
- Useful for visualizing high-dimensional data in 2- or 3-D space

Web Document Clustering Using SOM

- The result of SOM clustering of 12088 Web articles
- The picture on the right: drilling down on the keyword “mining”
- Based on websom.hut.fi Web page

Download Presentation

Connecting to Server..