Machine Learning Based Models for Time Series Prediction

Machine Learning Based Models for Time Series Prediction 2014/3

Outline • Support Vector Regression • Neural Network • Adaptive Neuro-Fuzzy Inference System • Comparison

Support Vector Regression • Basic Idea • Given a dataset • Our goal is to find a function which deviates by at most from the actual target for all training data. • The linear function case, the is in the form • denotes the dot product in • “Flatness” in this case means a small (less sensitive to the perturbations in the features). • Therefore, we can write the problem as following • Subject to

Note that +ε and -ε are not actual geometric interpretation

Soft Margin and Slack Variables • approximates all pairs with precision, however, we also may allow some errors. • The soft margin loss function and slack variables were introduced to the SVR. • Subject to • is the regularization parameter which determines the trade-off between flatness and the tolerance of errors. • are slack variables which determine the degree of error that far from -insensitive tube. • The -insensitive loss function

-insensitive loss function

Dual Problem and Quadratic Programs • The key idea is to construct a Lagrange function from the objective (primal) and the corresponding constraints, by introducing a dual set of variables. • The dual function has a saddle point with respect to the primal and dual variables at the solution. • Lagrange Function • Subject to and

Taking the partial derivatives (saddle point condition), we get • The conditions for optimality yield the following dual problem: • Subject to

Finally, we eliminate dual variables by substituting the partial derivatives and we get • This is called “Support Vector Expansion” in which can be completely described as a linear combination of the training patterns . • The function is represented by the SVs, therefore it’s independent of dimensionality of input space , and depends only on the number of SVs. • We will define the meaning of “Support Vector” later. • Computing and is a quadratic programming problem and the popular methods are shown below: • Interior point algorithm • Simplex algorithm

Computing • The parameter can be computed by KKT conditions (slackness), which state that at the optimal solution the product between dual variables and constrains has to vanish.

KKT(Karush–Kuhn–Tucker) conditions: • KKT conditions extend the idea of Lagrange multipliers to handle inequality constraints. • Consider the following nonlinear optimization problem: • Minimizing • Subject to , , where and • To solve the problem with inequalities, we consider the constraints as equalities when there exists critical points.

The following necessary conditions hold: is the local minimum and there exists constants and called KKT multipliers. • Stationary condition: (This is the saddle point condition in the dual problem.) • Primal Feasibility: and • Dual Feasibility: • Complementary slackness: (This condition enforces either to be zero or to be zero)

Original Problem: • Subject to • Standard Form for KKT • Objective • Constraints

Complementary slackness condition: • There exists KKT multipliers and (Lagrange multipliers in ) that meet this condition. • From (1) and (2), we can get • From (3) and (4), we can see that for some , the slack variables can be nonzero. • Conclusion: • Only samples with corresponding lie outside the -insensitive tube. • , i.e. there can never be a set of dual variables and which are both simultaneously nonzero.

From previous page, we can conclude: • We can form inequalities in conjunction with the two sets of inequalities above • If some the inequalities becomes equalities.

Sparseness of the Support Vector • Previous conclusion show that only for the Lagrange multipliers can be nonzero. • In other words, for all samples inside the-insensitive tube the and vanish. • Therefore, we have a sparse expansion of in terms of and the samples that come with non-vanishing coefficients are called “Support Vectors”.

Kernel Trick • The next step is to make the SV algorithm nonlinear. This could be achieved by simply preprocessing the training patterns by a map. • The dimensionality of this space is implicitly defined. • Example: , • It can easily become computationally infeasible • The number of different monomial features (polynomial mapping) of degree • The computationally cheaper way: • Kernel should follow Mercer's condition

In the end, the nonlinear function takes the form: • Possible kernel functions • Linear kernel: • Polynomial kernel: • Multi-layer Perceptron kernel: • Gaussian Radial Basis Function kernel:

Neural Network • The most common example is using feed-forward networks which employ the sliding window over the input sequence. • For each neuron, it consists of three parts: inputs, weights and (activated) output. • Sigmoid function • Hyperbolic tangent …

Example: 2-layer feed-forward Neural Network • Neural Network: • Gradient-descent related methods • Evolutionary methods • Their simple implementation and the existence of mostly local dependencies exhibited in the structure allows for fast, parallel implementations in hardware. …

Learning (Optimization) Algorithm • Error Function: • Chain Rule: where and ( input, hidden neuron) • Batch Learning and Online Learning using NN • the universal approximation theorem states that a feed-forward network with a single hidden layer can approximate continuous functions under mild assumptions on the activation function.

Adaptive Neuro-Fuzzy Inference System • Combines the advantages of fuzzy logic and neural network. • Fuzzy rules is generated by input space partitioning or Fuzzy C-means clustering • Gaussian Membership function • TSK-type fuzzy IF-THEN rules: : IF IS AND IS AND … AND IS THEN

9 6 3 5 2 8 7 4 1 • Input space partitioning • For 2-dimensional data input:

Fuzzy C-means clustering • For clusters, the degree of belonging : • , , … (5) • where … (6) • Objective function • … (7) • … (8) • To minimize , we take the derivatives of and we can get the mean of cluster … (9) • Fuzzy C-means algorithm • Randomly initialize and satisfies (5) • Calculate the means of each cluster • Calculate according to new updated in (6) • Stops when or is small enough

1st layer: fuzzification layer • 2nd layer: conjunction layer • 3rd layer: normalization layer • 4th layer: inference layer • 5th layer: output layer

Comparison • Neural Network vs. SVR • Local minimum vs. global minimum • Choice of kernel/activation function • Computational complexity • Parallel computation of neural network • Online learning vs. batch learning • ANFIS vs. Neural Network • Convergence speed • Number of fuzzy rules

Example: Function Approximation (1) x = (0:0.5:10)'; w=5*rand; b=4*rand; y = w*x+b; trnData = [x y]; tic; numMFs = 5; mfType = 'gbellmf'; epoch_n = 20; in_fis = genfis1(trnData,numMFs,mfType); out_fis = anfis(trnData,in_fis,20); time=toc; h=evalfis(x,out_fis); plot(x,y,x,h); legend('Training Data','ANFIS Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) • ANFIS Time = 0.015707 RMSE = 5.8766e-06

x = (0:0.5:10)'; w=5*rand; b=4*rand; y = w*x+b; trnData = [x y]; tic; net = feedforwardnet(5,'trainlm'); model = train(net,trnData(:,1)', trnData(:,2)'); time=toc; h = model(x')'; plot(x,y,x,h); legend('Training Data','NN Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) • NN Time = 4.3306 RMSE = 0.00010074

clear; clc; addpath'./LibSVM' addpath'./LibSVM/matlab' x = (0:0.5:10)'; w=5*rand; b=4*rand; y = w*x+b; trnData = [x y]; tic; model = svmtrain(y,x,['-s 3 -t 0 -c 2.2 -p 1e-7']); time=toc; h=svmpredict(y,x,model); plot(x,y,x,h); legend('Training Data','LS-SVR Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) • SVR Time = 0.00083499 RMSE = 6.0553e-08

Given function w=3.277389450887783 and b=0.684746751246247 • % model struct • % SVs : sparse matrix of SVs • % sv_coef : SV coefficients • % model.rho : -b of f(x)=wx+b • % for lin_kernel : h_2 = full(model.SVs)'*model.sv_coef*x-model.rho; • full(model.SVs)'*model.sv_coef=3.277389430887783 • -model.rho=0.684746851246246

Example: Function Approximation (2) • ANFIS x = (0:0.1:10)'; y = sin(2*x)./exp(x/5); trnData = [x y]; tic; numMFs = 5; mfType = 'gbellmf'; epoch_n = 20; in_fis= genfis1(trnData,numMFs,mfType); out_fis = anfis(trnData,in_fis,20); time=toc; h=evalfis(x,out_fis); plot(x,y,x,h); legend('Training Data','ANFIS Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) Time = 0.049087 RMSE = 0.042318

NN x = (0:0.1:10)'; y = sin(2*x)./exp(x/5); trnData = [x y]; tic; net = feedforwardnet(5,'trainlm'); model = train(net,trnData(:,1)', trnData(:,2)'); time=toc; h = model(x')'; plot(x,y,x,h); legend('Training Data','NN Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) Time = 0.77625 RMSE = 0.012563

clear; clc; addpath'./LibSVM' addpath'./LibSVM/matlab' x = (0:0.1:10)'; y = sin(2*x)./exp(x/5); trnData = [x y]; tic; model = svmtrain(y,x,['-s 3 -t 0 -c 2.2 -p 1e-7']); time=toc; h=svmpredict(y,x,model); plot(x,y,x,h); legend('Training Data','LS-SVR Output'); RMSE=sqrt(sum((h-y).^2)/length(h)); disp(['Time = ',num2str(time),' RMSE = ',num2str(RMSE)]) • SVR

RBF Kernel Time = 0.0039602 RMSE = 0.0036972

Polynomial Kernel Time = 20.9686 RMSE = 0.34124

Linear Kernel Time = 0.0038785 RMSE = 0.33304

Machine Learning Based Models for Time Series Prediction

Machine Learning Based Models for Time Series Prediction

Presentation Transcript

Stochastic models - time series.

Common Trend Models for Time Series

Seizure prediction and machine learning

MODELS FOR STATIONARY TIME SERIES

Learning Motion Prediction Models for Opponent Interception

MODELS FOR NONSTATIONARY TIME SERIES

Models for Problem-Based Learning

Time-Series Forecast Models

S tatistical models for time series prediction

Machine Learning Algorithms for Protein Structure Prediction

Time Series Models

Models for Non-Stationary Time Series

Stochastic models - time series.

Models for Problem-Based Learning

Models for Problem-Based Learning

Prediction of NBA games based on Machine Learning Methods

Models for Non-Stationary Time Series

Tools For Building Machine Learning Models

Optimization Algorithms for Machine Learning Models