data mining and knowledge acquizition chapter 5 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining: and Knowledge Acquizition — Chapter 5 — PowerPoint Presentation
Download Presentation
Data Mining: and Knowledge Acquizition — Chapter 5 —

Loading in 2 Seconds...

play fullscreen
1 / 67

Data Mining: and Knowledge Acquizition — Chapter 5 — - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

Data Mining: and Knowledge Acquizition — Chapter 5 —. BIS 541 2013/2014 Summer. Mathematically. Classification. Classification: predicts categorical class labels Typical Applications {credit history, salary}-> credit approval ( Yes/No) {Temp, Humidity} --> Rain (Yes/No).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Mining: and Knowledge Acquizition — Chapter 5 —' - noah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
classification

Mathematically

Classification
  • Classification:
    • predicts categorical class labels
  • Typical Applications
    • {credit history, salary}-> credit approval ( Yes/No)
    • {Temp, Humidity} --> Rain (Yes/No)
linear classification
Linear Classification
  • Binary Classification problem
  • The data above the red line belongs to class ‘x’
  • The data below red line belongs to class ‘o’
  • Examples – SVM, Perceptron, Probabilistic Classifiers

x

x

x

x

x

x

x

o

x

x

o

o

x

o

o

o

o

o

o

o

o

o

o

neural networks
Neural Networks
  • Analogy to Biological Systems (Indeed a great example of a good learning system)
  • Massive Parallelism allowing for computational efficiency
  • The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule
neural networks1
Neural Networks
  • Advantages
    • prediction accuracy is generally high
    • robust, works when training examples contain errors
    • output may be discrete, real-valued, or a vector of several discrete or real-valued attributes
    • fast evaluation of the learned target function
  • Criticism
    • long training time
    • difficult to understand the learned function (weights)
    • not easy to incorporate domain knowledge
network topology
Network Topology
  • Input variables number of inputs
  • number of hidden layers
    • # of nodes in each hidden layer
  • # of output nodes
  • can handle discrete or continuous variables
    • normalisation for continuous to 0..1 interval
    • for discrete variables
      • use k inputs for each level
      • use k output for each level if k>2
      • A has three distinct values a1,a2,a3
      • three input variables I1,I2I3 when A=a1 I1=1,I2,I3=0
  • feed-forward:no cycle to input untis
  • fully connected:each unit to each in the forward layer
slide7

Multi-Layer Perceptron

Output vector

Output nodes

Hidden nodes

wij

Input nodes

Input vector: xi

example sample iterations
Example: Sample iterations
  • A network suggested to solve the XOR problem figure 4.7 of WK page 96-99
  • learning rate is 1 for simplicity
  • I1=I2=1
  • T =0 true output
  • I1 I2 T P
  • 1.0 1.0 0 0.63
slide9

0.1

O3:0.65

I1:1

0.5

O5:0.63

0.3

-0.2

-0.4

I2:1

0.4

O4:0.48

variabe encodings
Variabe Encodings
  • Continuous variables
  • Ex:
    • Dollar amounts
    • Averages: averages sales,volume
    • Ratio income-debt,payment to laoan
    • Physical measures: area, temperature...
    • Transfer between
      • 0-1 or 0.1 – 0.9
      • -1.0 - +1.0 or -0.9 – 0.9
      • z scores z = x – mean_x/standard_dev_X
continuous variables
Continuous variables
  • When a new observation comes
    • it may be out of range
  • What to do
    • Plan for a larger range
    • Reject out of range values
    • Pag values lower then min to minrange
    • higher then max to maxrange
ordinal variables
Ordinal variables
  • Discrete integers
  • Ex:
    • Age ranges : young mid old
    • İncome : low,mid,high
    • Number of children
  • Transfer to 0-1 interval
  • Ex: 5 categories of age
    • 1 young,2 mid young,3 mid, 4 mid old 5 old
    • Transfer between 0 to 1
slide13
Thermometer coding
  • 0  0 0 0 0 0/16 = 0
  • 1  1 0 0 0 8/16 = 0.5
  • 2 1 1 0 0 12/16 = 0.75
  • 3 1 1 1 0 14/16 =0.875
  • Useful for academic grades or bond ratings
  • Difference on one side of the scale is more important then on the other side of the scale
nominal variables
Nominal Variables
  • Ex:
  • Gender marital status,occupation
  • 1- treat like ordinary variables
    • Ex marital status 5 codes:
    • Single,divorced,maried,widowed,unknown
    • Mapped to -1,-0.5,0,0.5,1
  • Network treat them ordinal
  • Even though order does not make sence
slide15
2- break into flags
    • One variable for each category
  • 1 of N coding
    • Gender has three values
    • Male female unknown
    • Male 1 -1 -1
    • Female -1 1 -1
    • Unknown -1 -1 1
slide16
1 of N-1 coding
    • Male 1 -1
    • Female -1 1
    • Unknown -1 -1
  • 3 replace the varible with an numerical one
time series variables
Time Series variables
  • Stock market prediction
  • Output IMKB100 at t
  • Inputs:
    • IMKB100 at t-1, at t-2, at t-3...
    • Dollar at t-1, t-2,t-3..
    • İnterest rate at t-1,t-2,t-3
  • Day of week variables
    • Ordinal
      • Monday 1 0 0 0 0 ,...,Friday 0 0 0 0 1
    • Nominal Monday to Friday map
      • from -1 to 1 or 0 to 1
a neuron

-

mk

x0

w0

x1

w1

f

å

output y

xn

wn

Input

vector x

weight

vector w

weighted

sum

Activation

function

A Neuron
  • The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
a neuron1

-

mk

x0

w0

x1

w1

f

å

output y

xn

wn

Input

vector x

weight

vector w

weighted

sum

Activation

function

A Neuron
slide20

Network Training

  • The ultimate objective of training
    • obtain a set of weights that makes almost all the tuples in the training data classified correctly
  • Steps
    • Initialize weights with random values
    • repeat until classification error is lower then a threshold (epoch)
      • Feed the input tuples into the network one by one
      • For each unit
        • Compute the net input to the unit as a linear combination of all the inputs to the unit
        • Compute the output value using the activation function
      • Compute the error
      • Update the weights and the bias
example stock market prediction
Example: Stock market prediction
  • input variables:
    • individual stock prices at t-1, t-2,t-3,...
    • stock index at t-1, t-2, t-3,
    • inflation rate, interest rate, exchange rates $
  • output variable:
    • predicted stock price next time
  • train the network with known cases
    • adjust weights
    • experiment with different topologies
    • test the network
  • use the tested network for predicting unknown stock prices
other business applications 1
Other business Applications (1)
  • Marketing and sales
    • Prediction
      • Sales forecasting
      • Price elasticity forecasting
      • Customer responce
    • Classification
      • Target marketing
      • Customer satisfaction
      • Loyalty and retention
    • Clustering
      • segmentation
other business applications 11
Other business Applications (1)
  • Risk Management
    • Credit scoring
    • Financial health
  • Clasification
    • Bankruptcy clasification
    • Fraud detection
    • Credit scoring
  • Clustering
    • Credit scoring
    • Risk assesment
other business applications 12
Other business Applications (1)
  • Finance
    • Prediction
      • Hedging
      • Future prediction
      • Forex stock prediction
    • Clasification
      • Stock trend clasification
      • Bond rating
    • Clustering
      • Economic rating
      • Mutual fond selection
perceptrons
Perceptrons
  • WK 91 sec 4.2
  • N inputs Ii i:1..N
  • single output O
  • two classes C0 and C1 denoted by 0 and 1
  • one node
  • output:
  • O=1 if w1I1+ w2I2+...+wnIn+w0>0
  • O=0 if w1I1+ w2I2+...+wnIn+w0<0
  • sometimes  is used for constant term for w0
    • called bias or treshold in ANN
perceptron training procedure rule 1
Perceptron training procedure(rule) (1)
  • Find w weights to separate each training sample correctly
  • Initial weights randomly chosen
  • weight updating
  • samples are presented in sequence
  • after presenting each case weights are updated:
    • wi(t+1) = wi(t)+ wi(t)
    • i(t+1) = i(t)+ i(t)
  • wi(t) = (T -O)Ii
  • i(t) = (T -O)
  • O: output of perceptron,T true output for each case,  learning rate 0<<1 usually around 0.1
perceptron training procedure rule 2
Perceptron training procedure (rule) (2)
  • each case is presented and
  • weights are updated
  • after presenting each case if
    • the error is not zero
    • then present all cases ones
    • each such cycle is called an epoch
  • unit error is zero for perfectly separable samples
slide29
Perceptron convergence theorem:
    • if the sample is linearly separable the perceptron will eventually converge: separate all the sample correctly
      • error =0
  • the learning rate can be even one
    • This slows down the convergence
  • to increase stability
    • it is gradually decreased
  • linearly separable: a line or hyperplane can separate all the sample correctly
slide30
If classes are not perfectly linearly separable
  • if a plane or line can not separate classes completely
  • The procedure will not converge and will keep on cycling through the data forever
slide31

o

o

o

o

o

o

o

o

x

o

o

o

x

o

x

o

o

x

x

x

o

x

x

x

x

x

o

x

linearly separable

not linearly separable

example calculations
Example calculations
  • Two inputs w1=0.25 w2=0.5 w0 or =-0.5
  • Suppose I1= 1.5 I2 =0.5
  • learning rate=0.1
  • and T = 0 true output
  • perceptron separate this as:
    • 0.25*1.5+0.5*0.5-0.5=0.125>0 O=1
  • w1(t+1) = 0.25+0.1(0-1)1.5=0.1
  • w2(t+1) = 0.5+ 0.1(0-1)0.5 =0.45
  • (t+1) = -0.5+ 0.1(0-1)=-0.6
  • with the new weights:
  • O = 0.1*1.5+0.45*0.5-0.6=-0.225 O =0
  • no error
slide33

I2

0.25*I1+0.5*I2-0.5=0

1

true class is 0 but

classified as class 1

o

class 1

0.5

I1

2

1.5

class 0

I2

0.1*I1+0.45*I2-0.6=0

1.33

true class is 0 and

classified as class 0

class 1

o

0.5

class 0

6

I1

xor exclusive or problem
XOR: exclusive OR problem
  • Two inputs I1 I2
  • when both agree
    • I1=0 and I2=0 or I1=1 and I2=1
    • class 0, O=0
  • when both disagree
    • I1=0 and I2=1 or I1=1 and I2=0
    • class 1, O=1
  • one line can not solve XOR
  • but two ilnes can
slide35

I2

class 0

1

class 1

I1

0

1

a single line can not separate these classes

slide36

Multi-layer networks

  • Study 4.3 In WK
  • One layer networks can separate a hyperplane
  • two layer networks can any convex region
  • and three layer networks can separate any non convex boundary
  • Examples see notes
slide37

o2

oK

o1

wKd

x1

xd

x2

x0=+1

ANN for classification

slide38

I2

+

+

inside the triangle ABC is class O

outside the triangle is class +

class =0 if

I1+I2>=10

I1<=I2

I2<=10

+

C

B

o

o

o

o

o

o

+

o

+

+1

+1

o

+

+

A

+

+

a

+

+

+

I1

I1

d

b

output of hidden node a:

1 if class O

w11I1+w12I2+w10>=0

0 if class is +

w11I1+w12I2+w10<0

I2

c

so w1i s are W11=1,w12=1 and w10=-10

slide39

output of hidden node b:

1 if that is O

w21I1+w22I2+w20>=0

0 if that is +

w21I1+w22I2+w20<0

so w2i s are W21=-1,w22=1 and w10=-0

I2

+

+

+

C

B

o

o

o

o

o

o

+

o

+

+1

+1

o

+

+

A

+

+

a

+

+

+

I1

I1

d

b

output of hidden node c:

1 if

w31I1+w32I2+w30>=0

0 if

w11I1+w12I2+w10<=0

I2

c

so w1i s are W31=0,w32=-1 and w10=10

slide40

an object is class O if all hidden units

predict is as class 0

output is 1 if

w’aHa+w’bHb+w’cHc+wd>=0

output is 0 if

w’aHa+w’bHb+w’cHc+wd<0

I2

+

+

+

C

B

o

o

o

o

o

o

+

o

+

+1

+1

o

+

+

A

+

+

a

+

+

+

I1

I1

d

b

weights of output node d:

wa=1,wb=1wc=1

wd=-3+x

where x a small number

I2

c

slide41

ADBC is the union of two convex

regions in this case triangles

each triangular region can be separated

by a two layer network

Two hidden layers

Can seperate any

Nonconvex region

I2

+

+

+

C

B

+1

o

o

o

+1

o

o

o

o

o

+

o

+

o

a

o

+

A

+

w’’f0

o

+

+

+

+

+

I1

I1

b

D

w’’f1

d

d separates ABC

e separates ADB

ADBC is union of

ABC and ADB

f

I2

c

w’’f2

e

output is class O if

w’’f0+w’’f1He+w’’f2Hf>=0

w’’f0=--0.99,w’’f=1,w’’f2=1

first hidden

layer

second hidden

layer

slide42
In practice boundaries are not known but increasing number of hidden node: two layer perceptron can separate any convex region
    • if it is perfectly separable
  • adding a second hidden layer and or ing the convex regions any nonconvex boundary can be separated
    • if it is perfectly separable
  • Weights are unknown but are found by training the network
for prediction problems
For prediction problems
  • Any function can be approximated with a oe hiden layer network

Y

X

network training
Network Training
  • The ultimate objective of training
    • obtain a set of weights that makes almost all the tuples in the training data classified correctly
  • Steps
    • Initialize weights with random values
    • Feed the input tuples into the network one by one
    • For each unit
      • Compute the net input to the unit as a linear combination of all the inputs to the unit
      • Compute the output value using the activation function
      • Compute the error
      • Update the weights and the bias
multi layer perceptron
Multi-Layer Perceptron

Output vector

Output nodes

Hidden nodes

wij

Input nodes

Input vector: xi

back propagation algorithm
Back propagation algorithm
  • LMS uses a linear activation function
    • not so useful
  • threshold activation function is very good in separating but not differentiable
  • back propagation uses logistic function
  • O = 1/(1+exp(-N))=(1+exp(-N))-1
  • N = w1I1+w2I2+... wNIN+
  • the derivative of logistic function
  • dO/dN = O*(1-O) expressed as a function of output where O =1/(1+exp(-N)), 0<=O<=1
slide47
Minimize total error again
  • E= (1/2)Nd=1Mk=1(Tk,d-Ok,d)2
  • where N is number of cases
  • M number of output units
  • Tk,d:true value of sample d in output unit k
  • Ok,d:predicted value of sample d in output unit k
  • the algorithm updates weights by a similar method to the delta rule
  • for each output units
  • wij=d=1 Od(1-Od)(Td-Od)Ii,dor
  • wij(t) = O(1-O)(T -O)Ii|when objects are
  • ij(t) = O(1-O)(T -O) | presented sequentially
  • here O(1-O)(T -O)= is the error term
slide48
so wij(t)= *errorj*Ii or ij(t)= *errorj
  • for all training samples
  • new weights are
    • wi(t+1) = wi(t)+ wi(t)
    • i(t+1) = i(t)+ i(t)
  • but for hidden layer weights no target value is available
  • wij(t) = Od(1-Od) (Mk=1errork*wkh)Ii
  • ij(t) = Od(1-Od)(Mk=1errork*wkh)
  • the error rate of each output is weighted by its weight and summed up to find the error derivative
  • The weights from hidden unit h to output unit k is responsible for the error in output unit k
example sample iterations1
Example: Sample iterations
  • A network suggested to solve the XOR problem figure 4.7 of WK page 96-99
  • learning rate is 1 for simplicity
  • I1=I2=1
  • T =0 true output
slide50

0.1

O3:0.65

I1:1

0.5

O5:0.63

0.3

-0.2

-0.4

I2:1

0.4

O4:0.48

exercise
Exercise
  • Carry out one more iteration for the XOR problem
practical applications of bp
Practical Applications of BP
  • Revision by epoch or case
  • dE/dwj = Ni=1Oi(1-Oi)(Ti-Oi)Iij
  • where i= 1,..,N index for samples
    • N sample size
    • j: index for inputs Iij input variable j for
    • sample i
  • This is the theoretical and actual derivtives
    • information in all samples are used in one update of the weight j
    • weights are revised after each epoch
slide54
If samples are presented one by one weight j is updated after presenting each sample by
    • dE/dwj = Oi(1-Oi)(Ti-Oi)Iij
    • this is just one term in the epoch update or gradient formula of derivative
    • called case update
    • updating by case is more common and give better results
    • less likely to stack to local minima
  • Random or sequential presentation
    • in each epoch case are presented in
      • sequential order or
      • in random order
slide55

sequential presentation

1 2 3 3 5.. 1 2 3 4 5..1 2 3 4 5.. 1 2 3 4 5..

epoch 1 epoch 2 epoch 3

random presentation

1 2 5 4 3.. 3 2 1 4 5..5 1 4 2 3.. 2 5 4 1 3..

epoch 1 epoch 2 epoch 3

neural networks2
Neural Networks
  • Random initial state
    • weights and biases are initialized to random values usually between -0.5 to 0.5
    • the final solution may depend on the initial values of weights
    • the algorithm may converge to different local minima
  • Learning rate and local minima
    • learning rate 
      • too small: slow convergence
      • too large: much fast but osilations
  • With a small learning rate local minimum is less likely
momentum
Momentum
  • Momentum is added to the update equations
  • wij(t+1) = *errorderivativej*Ii+ mon*wij(t)
  • ij(t+1) = *errorderivativej+ mom*ij(t)
  • momentum term
    • slows down the change of direction
    • avoids falling into a local minima or
    • speed up convergence by increasing the gradient by adding a value to it
      • when it falls into flat regions
stoping criteria
Stoping criteria
  • Limit the number of epochs
  • improvement in error is so small
    • sample error after a fixed number of epochs
    • measure the reduction in error
  • no change in w values above a threshold
overfitting sec 4 6 5 pp 108 112 mitchell
Overfitting Sec 4.6.5 pp 108-112 Mitchell
  • E monotonically decreases as number of iterations increases Fig 4.9 in Mitchell
  • Validation or test case error
    • in general
    • decreases first then start increasing
  • Why
    • as training progress some weights values are high
    • fit noise in training data
      • not representative features of the population
what to do
What to do
  • Weight decay
    • slowly decrease weights
    • put a penalty to error function for high weights
  • Monitoring the validation set error as well as the training set error as a function of iterations
    • see figure 4.9 in Mitchell
error and complexity
Error and Complexity
  • Sec 4.4 of WK pp 102-107
  • error rate on the training set decreases as number of hidden units is increased
  • error rate on test set first decreases flatten out then start increasing as number of hidden layers is increased
  • Start with zero hidden units
    • increase gradually the number of units in hidden layer
    • at each network size
      • 10 fold cross validation or
      • sampling the different initial weights
      • may be used to estimate error.
      • error may be averaged
a general network training procedure
A General Network Training Procedure
  • Define the problem
  • Select input and output variables
  • Make necessary transformations
  • Decide on algorithm
    • gradient decent or stochastic approximation (delta rule)
  • Choose the transfer function
    • logistic, hyperbolic tangent
  • Select a learning rate a momentum
    • after experimenting with possibly different rates
a general network training procedure cnt
A General Network Training Procedure cnt
  • Determine the stopping criteria
    • after error decreases by to a level or
    • number of epochs
  • Start from zero hidden units
  • increment number of hidden units
    • for each number of hidden units repeat
      • train the network on training data set
      • perform cross validation to estimate test error rate by averaging on different test samples
        • for a set of initial weights
        • find best initial weights
neural network approach
Neural Network Approach
  • Neural network approaches
    • Represent each cluster as an exemplar, acting as a “prototype” of the cluster
    • New objects are distributed to the cluster whose exemplar is the most similar according to some distance measure
  • Typical methods
    • SOM (Soft-Organizing feature Map)
    • Competitive learning
      • Involves a hierarchical architecture of several units (neurons)
      • Neurons compete in a “winner-takes-all” fashion for the object currently being presented
self organizing feature map som
Self-Organizing Feature Map (SOM)
  • SOMs, also called topological ordered maps, or Kohonen Self-Organizing Feature Map (KSOMs)
  • It maps all the points in a high-dimensional source space into a 2 to 3-d target space, s.t., the distance and proximity relationship (i.e., topology) are preserved as much as possible
  • Similar to k-means: cluster centers tend to lie in a low-dimensional manifold in the feature space
  • Clustering is performed by having several units competing for the current object
    • The unit whose weight vector is closest to the current object wins
    • The winner and its neighbors learn by having their weights adjusted
  • SOMs are believed to resemble processing that can occur in the brain
  • Useful for visualizing high-dimensional data in 2- or 3-D space
web document clustering using som
Web Document Clustering Using SOM
  • The result of SOM clustering of 12088 Web articles
  • The picture on the right: drilling down on the keyword “mining”
  • Based on websom.hut.fi Web page