1 / 38

Machine Learning Based Models for Time Series Prediction

Machine Learning Based Models for Time Series Prediction. 2014/3. Outline. Support Vector Regression Neural Network Adaptive Neuro -Fuzzy Inference System Comparison. Support Vector Regression. Basic Idea Given a dataset

cher
Download Presentation

Machine Learning Based Models for Time Series Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning Based Models for Time Series Prediction 2014/3

  2. Outline • Support Vector Regression • Neural Network • Adaptive Neuro-Fuzzy Inference System • Comparison

  3. Support Vector Regression • Basic Idea • Given a dataset • Our goal is to find a function which deviates by at most from the actual target for all training data. • The linear function case, the is in the form • denotes the dot product in • “Flatness” in this case means a small (less sensitive to the perturbations in the features). • Therefore, we can write the problem as following • Subject to

  4. Note that +ε and -ε are not actual geometric interpretation

  5. Soft Margin and Slack Variables • approximates all pairs with precision, however, we also may allow some errors. • The soft margin loss function and slack variables were introduced to the SVR. • Subject to • is the regularization parameter which determines the trade-off between flatness and the tolerance of errors. • are slack variables which determine the degree of error that far from -insensitive tube. • The -insensitive loss function

  6. -insensitive loss function

  7. Dual Problem and Quadratic Programs • The key idea is to construct a Lagrange function from the objective (primal) and the corresponding constraints, by introducing a dual set of variables. • The dual function has a saddle point with respect to the primal and dual variables at the solution. • Lagrange Function • Subject to and

  8. Taking the partial derivatives (saddle point condition), we get • The conditions for optimality yield the following dual problem: • Subject to

  9. Finally, we eliminate dual variables by substituting the partial derivatives and we get • This is called “Support Vector Expansion” in which can be completely described as a linear combination of the training patterns . • The function is represented by the SVs, therefore it’s independent of dimensionality of input space , and depends only on the number of SVs. • We will define the meaning of “Support Vector” later. • Computing and is a quadratic programming problem and the popular methods are shown below: • Interior point algorithm • Simplex algorithm

  10. Computing • The parameter can be computed by KKT conditions (slackness), which state that at the optimal solution the product between dual variables and constrains has to vanish.

  11. KKT(Karush–Kuhn–Tucker) conditions: • KKT conditions extend the idea of Lagrange multipliers to handle inequality constraints. • Consider the following nonlinear optimization problem: • Minimizing • Subject to , , where and • To solve the problem with inequalities, we consider the constraints as equalities when there exists critical points.

  12. The following necessary conditions hold: is the local minimum and there exists constants and called KKT multipliers. • Stationary condition: (This is the saddle point condition in the dual problem.) • Primal Feasibility: and • Dual Feasibility: • Complementary slackness: (This condition enforces either to be zero or to be zero)

  13. Original Problem: • Subject to • Standard Form for KKT • Objective • Constraints

  14. Complementary slackness condition: • There exists KKT multipliers and (Lagrange multipliers in ) that meet this condition. • From (1) and (2), we can get • From (3) and (4), we can see that for some , the slack variables can be nonzero. • Conclusion: • Only samples with corresponding lie outside the -insensitive tube. • , i.e. there can never be a set of dual variables and which are both simultaneously nonzero.

  15. From previous page, we can conclude: • We can form inequalities in conjunction with the two sets of inequalities above • If some the inequalities becomes equalities.

  16. Sparseness of the Support Vector • Previous conclusion show that only for the Lagrange multipliers can be nonzero. • In other words, for all samples inside the-insensitive tube the and vanish. • Therefore, we have a sparse expansion of in terms of and the samples that come with non-vanishing coefficients are called “Support Vectors”.

  17. Kernel Trick • The next step is to make the SV algorithm nonlinear. This could be achieved by simply preprocessing the training patterns by a map. • The dimensionality of this space is implicitly defined. • Example: , • It can easily become computationally infeasible • The number of different monomial features (polynomial mapping) of degree • The computationally cheaper way: • Kernel should follow Mercer's condition

  18. In the end, the nonlinear function takes the form: • Possible kernel functions • Linear kernel: • Polynomial kernel: • Multi-layer Perceptron kernel: • Gaussian Radial Basis Function kernel:

  19. Neural Network • The most common example is using feed-forward networks which employ the sliding window over the input sequence. • For each neuron, it consists of three parts: inputs, weights and (activated) output. • Sigmoid function • Hyperbolic tangent …

  20. Example: 2-layer feed-forward Neural Network • Neural Network: • Gradient-descent related methods • Evolutionary methods • Their simple implementation and the existence of mostly local dependencies exhibited in the structure allows for fast, parallel implementations in hardware. …

  21. Learning (Optimization) Algorithm • Error Function: • Chain Rule: where and ( input, hidden neuron) • Batch Learning and Online Learning using NN • the universal approximation theorem states that a feed-forward network with a single hidden layer can approximate continuous functions under mild assumptions on the activation function.

  22. Adaptive Neuro-Fuzzy Inference System • Combines the advantages of fuzzy logic and neural network. • Fuzzy rules is generated by input space partitioning or Fuzzy C-means clustering • Gaussian Membership function • TSK-type fuzzy IF-THEN rules: : IF IS AND IS AND … AND IS THEN

  23. 9 6 3 5 2 8 7 4 1 • Input space partitioning • For 2-dimensional data input:

  24. Fuzzy C-means clustering • For clusters, the degree of belonging : • , , … (5) • where … (6) • Objective function • … (7) • … (8) • To minimize , we take the derivatives of and we can get the mean of cluster … (9) • Fuzzy C-means algorithm • Randomly initialize and satisfies (5) • Calculate the means of each cluster • Calculate according to new updated in (6) • Stops when or is small enough

  25. 1st layer: fuzzification layer • 2nd layer: conjunction layer • 3rd layer: normalization layer • 4th layer: inference layer • 5th layer: output layer

  26. Comparison • Neural Network vs. SVR • Local minimum vs. global minimum • Choice of kernel/activation function • Computational complexity • Parallel computation of neural network • Online learning vs. batch learning • ANFIS vs. Neural Network • Convergence speed • Number of fuzzy rules

  27. Example: Function Approximation (1) x = (0:0.5:10)'; w=5*rand; b=4*rand; y = w*x+b; trnData = [x y]; tic; numMFs = 5; mfType = 'gbellmf'; epoch_n = 20; in_fis = genfis1(trnData,numMFs,mfType); out_fis = anfis(trnData,in_fis,20); time=toc; h=evalfis(x,out_fis); plot(x,y,x,h); legend('Training Data','ANFIS Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) • ANFIS Time = 0.015707 RMSE = 5.8766e-06

  28. x = (0:0.5:10)'; w=5*rand; b=4*rand; y = w*x+b; trnData = [x y]; tic; net = feedforwardnet(5,'trainlm'); model = train(net,trnData(:,1)', trnData(:,2)'); time=toc; h = model(x')'; plot(x,y,x,h); legend('Training Data','NN Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) • NN Time = 4.3306 RMSE = 0.00010074

  29. clear; clc; addpath'./LibSVM' addpath'./LibSVM/matlab' x = (0:0.5:10)'; w=5*rand; b=4*rand; y = w*x+b; trnData = [x y]; tic; model = svmtrain(y,x,['-s 3 -t 0 -c 2.2 -p 1e-7']); time=toc; h=svmpredict(y,x,model); plot(x,y,x,h); legend('Training Data','LS-SVR Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) • SVR Time = 0.00083499 RMSE = 6.0553e-08

  30. Given function w=3.277389450887783 and b=0.684746751246247 • % model struct • % SVs : sparse matrix of SVs • % sv_coef : SV coefficients • % model.rho : -b of f(x)=wx+b • % for lin_kernel : h_2 = full(model.SVs)'*model.sv_coef*x-model.rho; • full(model.SVs)'*model.sv_coef=3.277389430887783 • -model.rho=0.684746851246246

  31. Example: Function Approximation (2) • ANFIS x = (0:0.1:10)'; y = sin(2*x)./exp(x/5); trnData = [x y]; tic; numMFs = 5; mfType = 'gbellmf'; epoch_n = 20; in_fis= genfis1(trnData,numMFs,mfType); out_fis = anfis(trnData,in_fis,20); time=toc; h=evalfis(x,out_fis); plot(x,y,x,h); legend('Training Data','ANFIS Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) Time = 0.049087 RMSE = 0.042318

  32. NN x = (0:0.1:10)'; y = sin(2*x)./exp(x/5); trnData = [x y]; tic; net = feedforwardnet(5,'trainlm'); model = train(net,trnData(:,1)', trnData(:,2)'); time=toc; h = model(x')'; plot(x,y,x,h); legend('Training Data','NN Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) Time = 0.77625 RMSE = 0.012563

  33. clear; clc; addpath'./LibSVM' addpath'./LibSVM/matlab' x = (0:0.1:10)'; y = sin(2*x)./exp(x/5); trnData = [x y]; tic; model = svmtrain(y,x,['-s 3 -t 0 -c 2.2 -p 1e-7']); time=toc; h=svmpredict(y,x,model); plot(x,y,x,h); legend('Training Data','LS-SVR Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) • SVR

  34. RBF Kernel Time = 0.0039602 RMSE = 0.0036972

  35. Polynomial Kernel Time = 20.9686 RMSE = 0.34124

  36. Linear Kernel Time = 0.0038785 RMSE = 0.33304

More Related