1 / 67

Data Mining: and Knowledge Acquizition — Chapter 5 —

Data Mining: and Knowledge Acquizition — Chapter 5 —. BIS 541 2013/2014 Summer. Mathematically. Classification. Classification: predicts categorical class labels Typical Applications {credit history, salary}-> credit approval ( Yes/No) {Temp, Humidity} --> Rain (Yes/No).

noah
Download Presentation

Data Mining: and Knowledge Acquizition — Chapter 5 —

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining: and Knowledge Acquizition — Chapter 5 — BIS 541 2013/2014 Summer

  2. Mathematically Classification • Classification: • predicts categorical class labels • Typical Applications • {credit history, salary}-> credit approval ( Yes/No) • {Temp, Humidity} --> Rain (Yes/No)

  3. Linear Classification • Binary Classification problem • The data above the red line belongs to class ‘x’ • The data below red line belongs to class ‘o’ • Examples – SVM, Perceptron, Probabilistic Classifiers x x x x x x x o x x o o x o o o o o o o o o o

  4. Neural Networks • Analogy to Biological Systems (Indeed a great example of a good learning system) • Massive Parallelism allowing for computational efficiency • The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule

  5. Neural Networks • Advantages • prediction accuracy is generally high • robust, works when training examples contain errors • output may be discrete, real-valued, or a vector of several discrete or real-valued attributes • fast evaluation of the learned target function • Criticism • long training time • difficult to understand the learned function (weights) • not easy to incorporate domain knowledge

  6. Network Topology • Input variables number of inputs • number of hidden layers • # of nodes in each hidden layer • # of output nodes • can handle discrete or continuous variables • normalisation for continuous to 0..1 interval • for discrete variables • use k inputs for each level • use k output for each level if k>2 • A has three distinct values a1,a2,a3 • three input variables I1,I2I3 when A=a1 I1=1,I2,I3=0 • feed-forward:no cycle to input untis • fully connected:each unit to each in the forward layer

  7. Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi

  8. Example: Sample iterations • A network suggested to solve the XOR problem figure 4.7 of WK page 96-99 • learning rate is 1 for simplicity • I1=I2=1 • T =0 true output • I1 I2 T P • 1.0 1.0 0 0.63

  9. 0.1 O3:0.65 I1:1 0.5 O5:0.63 0.3 -0.2 -0.4 I2:1 0.4 O4:0.48

  10. Variabe Encodings • Continuous variables • Ex: • Dollar amounts • Averages: averages sales,volume • Ratio income-debt,payment to laoan • Physical measures: area, temperature... • Transfer between • 0-1 or 0.1 – 0.9 • -1.0 - +1.0 or -0.9 – 0.9 • z scores z = x – mean_x/standard_dev_X

  11. Continuous variables • When a new observation comes • it may be out of range • What to do • Plan for a larger range • Reject out of range values • Pag values lower then min to minrange • higher then max to maxrange

  12. Ordinal variables • Discrete integers • Ex: • Age ranges : young mid old • İncome : low,mid,high • Number of children • Transfer to 0-1 interval • Ex: 5 categories of age • 1 young,2 mid young,3 mid, 4 mid old 5 old • Transfer between 0 to 1

  13. Thermometer coding • 0  0 0 0 0 0/16 = 0 • 1  1 0 0 0 8/16 = 0.5 • 2 1 1 0 0 12/16 = 0.75 • 3 1 1 1 0 14/16 =0.875 • Useful for academic grades or bond ratings • Difference on one side of the scale is more important then on the other side of the scale

  14. Nominal Variables • Ex: • Gender marital status,occupation • 1- treat like ordinary variables • Ex marital status 5 codes: • Single,divorced,maried,widowed,unknown • Mapped to -1,-0.5,0,0.5,1 • Network treat them ordinal • Even though order does not make sence

  15. 2- break into flags • One variable for each category • 1 of N coding • Gender has three values • Male female unknown • Male 1 -1 -1 • Female -1 1 -1 • Unknown -1 -1 1

  16. 1 of N-1 coding • Male 1 -1 • Female -1 1 • Unknown -1 -1 • 3 replace the varible with an numerical one

  17. Time Series variables • Stock market prediction • Output IMKB100 at t • Inputs: • IMKB100 at t-1, at t-2, at t-3... • Dollar at t-1, t-2,t-3.. • İnterest rate at t-1,t-2,t-3 • Day of week variables • Ordinal • Monday 1 0 0 0 0 ,...,Friday 0 0 0 0 1 • Nominal Monday to Friday map • from -1 to 1 or 0 to 1

  18. - mk x0 w0 x1 w1 f å output y xn wn Input vector x weight vector w weighted sum Activation function A Neuron • The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping

  19. - mk x0 w0 x1 w1 f å output y xn wn Input vector x weight vector w weighted sum Activation function A Neuron

  20. Network Training • The ultimate objective of training • obtain a set of weights that makes almost all the tuples in the training data classified correctly • Steps • Initialize weights with random values • repeat until classification error is lower then a threshold (epoch) • Feed the input tuples into the network one by one • For each unit • Compute the net input to the unit as a linear combination of all the inputs to the unit • Compute the output value using the activation function • Compute the error • Update the weights and the bias

  21. Example: Stock market prediction • input variables: • individual stock prices at t-1, t-2,t-3,... • stock index at t-1, t-2, t-3, • inflation rate, interest rate, exchange rates $ • output variable: • predicted stock price next time • train the network with known cases • adjust weights • experiment with different topologies • test the network • use the tested network for predicting unknown stock prices

  22. Other business Applications (1) • Marketing and sales • Prediction • Sales forecasting • Price elasticity forecasting • Customer responce • Classification • Target marketing • Customer satisfaction • Loyalty and retention • Clustering • segmentation

  23. Other business Applications (1) • Risk Management • Credit scoring • Financial health • Clasification • Bankruptcy clasification • Fraud detection • Credit scoring • Clustering • Credit scoring • Risk assesment

  24. Other business Applications (1) • Finance • Prediction • Hedging • Future prediction • Forex stock prediction • Clasification • Stock trend clasification • Bond rating • Clustering • Economic rating • Mutual fond selection

  25. Perceptrons • WK 91 sec 4.2 • N inputs Ii i:1..N • single output O • two classes C0 and C1 denoted by 0 and 1 • one node • output: • O=1 if w1I1+ w2I2+...+wnIn+w0>0 • O=0 if w1I1+ w2I2+...+wnIn+w0<0 • sometimes  is used for constant term for w0 • called bias or treshold in ANN

  26. Artificial Neural Nets: Perseptron x0=+1 x1 w1 w0 x2 g w2 y wd xd

  27. Perceptron training procedure(rule) (1) • Find w weights to separate each training sample correctly • Initial weights randomly chosen • weight updating • samples are presented in sequence • after presenting each case weights are updated: • wi(t+1) = wi(t)+ wi(t) • i(t+1) = i(t)+ i(t) • wi(t) = (T -O)Ii • i(t) = (T -O) • O: output of perceptron,T true output for each case,  learning rate 0<<1 usually around 0.1

  28. Perceptron training procedure (rule) (2) • each case is presented and • weights are updated • after presenting each case if • the error is not zero • then present all cases ones • each such cycle is called an epoch • unit error is zero for perfectly separable samples

  29. Perceptron convergence theorem: • if the sample is linearly separable the perceptron will eventually converge: separate all the sample correctly • error =0 • the learning rate can be even one • This slows down the convergence • to increase stability • it is gradually decreased • linearly separable: a line or hyperplane can separate all the sample correctly

  30. If classes are not perfectly linearly separable • if a plane or line can not separate classes completely • The procedure will not converge and will keep on cycling through the data forever

  31. o o o o o o o o x o o o x o x o o x x x o x x x x x o x linearly separable not linearly separable

  32. Example calculations • Two inputs w1=0.25 w2=0.5 w0 or =-0.5 • Suppose I1= 1.5 I2 =0.5 • learning rate=0.1 • and T = 0 true output • perceptron separate this as: • 0.25*1.5+0.5*0.5-0.5=0.125>0 O=1 • w1(t+1) = 0.25+0.1(0-1)1.5=0.1 • w2(t+1) = 0.5+ 0.1(0-1)0.5 =0.45 • (t+1) = -0.5+ 0.1(0-1)=-0.6 • with the new weights: • O = 0.1*1.5+0.45*0.5-0.6=-0.225 O =0 • no error

  33. I2 0.25*I1+0.5*I2-0.5=0 1 true class is 0 but classified as class 1 o class 1 0.5 I1 2 1.5 class 0 I2 0.1*I1+0.45*I2-0.6=0 1.33 true class is 0 and classified as class 0 class 1 o 0.5 class 0 6 I1

  34. XOR: exclusive OR problem • Two inputs I1 I2 • when both agree • I1=0 and I2=0 or I1=1 and I2=1 • class 0, O=0 • when both disagree • I1=0 and I2=1 or I1=1 and I2=0 • class 1, O=1 • one line can not solve XOR • but two ilnes can

  35. I2 class 0 1 class 1 I1 0 1 a single line can not separate these classes

  36. Multi-layer networks • Study 4.3 In WK • One layer networks can separate a hyperplane • two layer networks can any convex region • and three layer networks can separate any non convex boundary • Examples see notes

  37. o2 oK o1 wKd x1 xd x2 x0=+1 ANN for classification

  38. I2 + + inside the triangle ABC is class O outside the triangle is class + class =0 if I1+I2>=10 I1<=I2 I2<=10 + C B o o o o o o + o + +1 +1 o + + A + + a + + + I1 I1 d b output of hidden node a: 1 if class O w11I1+w12I2+w10>=0 0 if class is + w11I1+w12I2+w10<0 I2 c so w1i s are W11=1,w12=1 and w10=-10

  39. output of hidden node b: 1 if that is O w21I1+w22I2+w20>=0 0 if that is + w21I1+w22I2+w20<0 so w2i s are W21=-1,w22=1 and w10=-0 I2 + + + C B o o o o o o + o + +1 +1 o + + A + + a + + + I1 I1 d b output of hidden node c: 1 if w31I1+w32I2+w30>=0 0 if w11I1+w12I2+w10<=0 I2 c so w1i s are W31=0,w32=-1 and w10=10

  40. an object is class O if all hidden units predict is as class 0 output is 1 if w’aHa+w’bHb+w’cHc+wd>=0 output is 0 if w’aHa+w’bHb+w’cHc+wd<0 I2 + + + C B o o o o o o + o + +1 +1 o + + A + + a + + + I1 I1 d b weights of output node d: wa=1,wb=1wc=1 wd=-3+x where x a small number I2 c

  41. ADBC is the union of two convex regions in this case triangles each triangular region can be separated by a two layer network Two hidden layers Can seperate any Nonconvex region I2 + + + C B +1 o o o +1 o o o o o + o + o a o + A + w’’f0 o + + + + + I1 I1 b D w’’f1 d d separates ABC e separates ADB ADBC is union of ABC and ADB f I2 c w’’f2 e output is class O if w’’f0+w’’f1He+w’’f2Hf>=0 w’’f0=--0.99,w’’f=1,w’’f2=1 first hidden layer second hidden layer

  42. In practice boundaries are not known but increasing number of hidden node: two layer perceptron can separate any convex region • if it is perfectly separable • adding a second hidden layer and or ing the convex regions any nonconvex boundary can be separated • if it is perfectly separable • Weights are unknown but are found by training the network

  43. For prediction problems • Any function can be approximated with a oe hiden layer network Y X

  44. Network Training • The ultimate objective of training • obtain a set of weights that makes almost all the tuples in the training data classified correctly • Steps • Initialize weights with random values • Feed the input tuples into the network one by one • For each unit • Compute the net input to the unit as a linear combination of all the inputs to the unit • Compute the output value using the activation function • Compute the error • Update the weights and the bias

  45. Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi

  46. Back propagation algorithm • LMS uses a linear activation function • not so useful • threshold activation function is very good in separating but not differentiable • back propagation uses logistic function • O = 1/(1+exp(-N))=(1+exp(-N))-1 • N = w1I1+w2I2+... wNIN+ • the derivative of logistic function • dO/dN = O*(1-O) expressed as a function of output where O =1/(1+exp(-N)), 0<=O<=1

  47. Minimize total error again • E= (1/2)Nd=1Mk=1(Tk,d-Ok,d)2 • where N is number of cases • M number of output units • Tk,d:true value of sample d in output unit k • Ok,d:predicted value of sample d in output unit k • the algorithm updates weights by a similar method to the delta rule • for each output units • wij=d=1 Od(1-Od)(Td-Od)Ii,dor • wij(t) = O(1-O)(T -O)Ii|when objects are • ij(t) = O(1-O)(T -O) | presented sequentially • here O(1-O)(T -O)= is the error term

  48. so wij(t)= *errorj*Ii or ij(t)= *errorj • for all training samples • new weights are • wi(t+1) = wi(t)+ wi(t) • i(t+1) = i(t)+ i(t) • but for hidden layer weights no target value is available • wij(t) = Od(1-Od) (Mk=1errork*wkh)Ii • ij(t) = Od(1-Od)(Mk=1errork*wkh) • the error rate of each output is weighted by its weight and summed up to find the error derivative • The weights from hidden unit h to output unit k is responsible for the error in output unit k

  49. Example: Sample iterations • A network suggested to solve the XOR problem figure 4.7 of WK page 96-99 • learning rate is 1 for simplicity • I1=I2=1 • T =0 true output

  50. 0.1 O3:0.65 I1:1 0.5 O5:0.63 0.3 -0.2 -0.4 I2:1 0.4 O4:0.48

More Related