Neural Networks for Time Series Forecasting

STAT 497LECTURE NOTE 11 NEURAL NETWORKS FOR TIME SERIES FORECASTING https://otexts.com/fpp2/nnetar.html

INTRODUCTION • Artificial neural networks are forecasting methods that are based on simple mathematical models of the brain. They allow complex nonlinear relationships between the response variable and its predictors. Source: https://medium.com/technologymadeeasy/for-dummies-the-introduction-to-neural-networks-we-all-need-c50f6012d5eb

Neural network architecture • A neural network can be thought of as a network of “neurons” which are organized in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) form the top layer. There may also be intermediate layers containing “hidden neurons”.

Networks of McCulloch-Pitts Neurons • Artificial neurons have the same basic components as biological neurons. The simplest ANNs consist of a set of McCulloch-Pitts neurons labelled by indices k, i, j and activation flows between them via synapses with strengths wki, wij:

MOTIVATION • Neural networks loosely mimic the way our brains solve the problem: by taking in inputs, processing them and generating an output. Like us, they learnto recognize patterns, but they do this by training on labelled datasets. Before we get to the learning part, let’s take a look at the most basic of artificial neurons: the perceptron, and how it processes inputs and produces an output.

THE PERCEPTRON • Perceptrons were developed way back in the 1950s-60s by the scientist Frank Rosenblatt, inspired by earlier work from Warren McCulloch and Walter Pitts. While today we use other models of artificial neurons, they follow the general principles set by the perceptron. Model of an artificial neuron • As you can see, the network of nodes sends signals in one direction. This is called a feed-forward network. • The figure depicts a neuron connected with n other neurons and thus receives n inputs (x1, x2, ….. xn). This configuration is called a Perceptron.

THE PERCEPTRON • Let’s understand this better with an example. Say you bike to work. You have two factors to make your decision to go to work: the weather must not be bad, and it must be a weekday. The weather’s not that big a deal, but working on weekends is a big no-no. The inputs have to be binary, so let’s propose the conditions as yes or no questions. Weather is fine? 1 for yes, 0 for no. Is it a weekday? 1 yes, 0 no. • We cannot tell the neural network these conditions; it has to learn them for itself. How will it know which information will be most important in making its decision? It does with something called weights. Weights are just a numerical representation of the events. A higher weight means the neural network considers that input more important compared to other inputs. It will learn this by learning from the training data.

TRAINING IN PERCEPTRONS • Input vectors from a training set are presented to the perceptron one after the other and weights are modified according to the following equation, • For all inputs i, W(i) = W(i) + a*g’(sum of all inputs)*(T-A)*P(i), where g’ is the derivative of the activation function, and a is the learning rate. • Here, W is the weight vector. P is the input vector. T is the correct output that the perceptron should have known and A is the output given by the perceptron.

ACTIVATION FUNCTION • A function that transforms the values or states the conditions for the decision of the output neuron is known as an activation function. • What does an artificial neuron do? Simply, it calculates a “weighted sum” of its input, adds a bias and then decides whether it should be “fired” or not. • So consider a neuron.

ACTIVATION FUNCTION • The value of Y can be anything ranging from -inf to +inf. The neuron really doesn’t know the bounds of the value. So how do we decide whether the neuron should fire or not. • We decided to add “activation functions” for this purpose. To check the Y value produced by a neuron and decide whether outside connections should consider this neuron as “fired” or not. Or rather let’s say — “activated” or not.

ACTIVATION FUNCTION • If we do not apply an Activation function, then the output signal would simply be a simple linear function. Alinear function is just a polynomial of one degree. • Alinear equation is easy to solve but they are limited in their complexity and have less power to learn complex functional mappings from data. • A Neural Network without Activation function would simply be a Linear Regression Model, which has limited power and does not performs good most of the times. • We want our Neural Network to not just learn and compute a linear function but something more complicated than that. • Also, without activation function our Neural network would not be able to learn and model other complicated kinds of data such as images, videos , audio , speech etc. That is why we use Artificial Neural network techniques such as Deep learning to make sense of something complicated ,high dimensional, non-linear -big datasets, where the model has lots and lots of hidden layers in between and has a very complicated architecture which helps us to make sense and extract knowledge form such complicated big datasets.

The simplest networks contain no hidden layers and are equivalent to linear regressions. The coefficients attached to these predictors are called “weights”. The forecasts are obtained by a linear combination of the inputs. The weights are selected in the neural network framework using a “learning algorithm” that minimizes a “cost function” such as the MSE. Of course, in this simple example, we can use linear regression which is a much more efficient method of training the model.

Once we add an intermediate layer with hidden neurons, the neural network becomes non-linear. • This is known as a multilayer feed-forward network, where each layer of nodes receives inputs from the previous layers. The outputs of the nodes in one layer are inputs to the next layer. The inputs to each node are combined using a weighted linear combination.

The weights take random values to begin with, and these are then updated using the observed data. Consequently, there is an element of randomness in the predictions produced by a neural network. Therefore, the network is usually trained several times using different random starting points, and the results are averaged. • The number of hidden layers, and the number of nodes in each hidden layer, must be specified in advance.

Neural network autoregression • With time series data, lagged values of the time series can be used as inputs to a neural network, just as we used lagged values in a linear autoregressionmodel. We call this a neural network autoregression or NNAR model. • We only consider feed-forward networks with one hidden layer, and we use the notation NNAR(p,k) to indicate there are p lagged inputs and k nodes in the hidden layer. For example, a NNAR(9,5) model is a neural network with the last nine observations (yt−1,yt−2,…,yt−9) used as inputs for forecasting the output yt, and with five neurons in the hidden layer. A NNAR(p,0) model is equivalent to an ARIMA(p,0,0) model, but without the restrictions on the parameters to ensure stationarity.

The nnetar function fits an NNAR(p,P,k)m model. If the values of p and P are not specified, they are selected automatically. For non-seasonal time series, the default is the optimal number of lags (according to the AIC) for a linear AR(p) model. For seasonal time series, the default values are P=1 and p is chosen from the optimal linear model fitted to the seasonally adjusted data. If k is not specified, it is set to k=(p+P+1)/2 (rounded to the nearest integer). • When it comes to forecasting, the network is applied iteratively. For forecasting one step ahead, we simply use the available historical inputs. For forecasting two steps ahead, we use the one-step forecast as an input, along with the historical data. This process proceeds until we have computed all the required forecasts.

Example: sunspots • The surface of the sun contains magnetic regions that appear as dark spots. These affect the propagation of radio waves, and so telecommunication companies like to predict sunspot activity in order to plan for any future difficulties. Sunspots follow a cycle of length between 9 and 14 years. In the figure, forecasts from an NNAR(10,6) are shown for the next 30 years. We have set a Box-Cox transformation with lambda=0 to ensure the forecasts stay positive. fit<- nnetar(sunspotarea, lambda=0) autoplot(forecast(fit, h=30)) Here, the last 10 observations are used as predictors, and there are 6 neurons in the hidden layer. The cyclicity in the data has been modelled well. We can also see the asymmetry of the cycles has been captured by the model, where the increasing part of the cycle is steeper than the decreasing part of the cycle. This is one difference between a NNAR model and a linear AR model — while linear AR models can model cyclicity, the modelled cycles are always symmetric.

Prediction intervals • Unlike most of the methods, neural networks are not based on a well-defined stochastic model, and so it is not straightforward to derive prediction intervals for the resultant forecasts. However, we can still compute prediction intervals using simulation where future sample paths are generated using bootstrapped residuals. • The neural network fitted to the sunspot data can be written as yt=f(yt−1)+εt whereyt−1=(yt−1,yt−2,…,yt−10)′ is a vector containing lagged values of the series, and f is a neural network with 6 hidden nodes in a single layer. The error series {εt} is assumed to be homoscedastic (and possibly also normally distributed).

Here is a simulation of 9 possible future sample paths for the sunspot data. Each sample path covers the next 30 years after the observed data. sim <- ts(matrix(0, nrow=30L, ncol=9L), start=end(sunspotarea)[1L]+1L) for(i in seq(9)) sim[,i] <- simulate(fit, nsim=30L) autoplot(sunspotarea) + autolayer(sim)

If we do this a few hundred or thousand times, we can get a good picture of the forecast distributions. This is how the forecast() function produces prediction intervals for NNAR models: fcast <- forecast(fit, PI=TRUE, h=30) autoplot(fcast) Because it is a little slow, PI=FALSE is the default, so prediction intervals are not computed unless requested. The npaths argument in forecast() controls how many simulations are done (default 1000). By default, the errors are drawn from a normal distribution. The bootstrap argument allows the errors to be “bootstrapped” (i.e., randomly drawn from the historical errors).

Recurrent neural networks and time series (http://software-tecnico-libre.es/en/article-by-topic/all_sections/all-topics/all-articles/recurrent-neural-network-and-time-series) • This is a type of network architecture that implements some kind of memory and, therefore, a sense of time. This is achieved by implementing some neurons receiving as input the output of one of the hidden layers, and injecting their output again in that layer. Here, we will show a simple way to use two neural networks of this kind, the Elman and Jordan ones. • In the Elman networks, the inputs of these neurons are taken from the outputs of the neurons in one of the hidden layers, and their outputs are connected back to the inputs of the same layer, providing a memory of the previous state of this layer.

Recurrent neural networks and time series • The scheme is as in the figure below, where X is the input, S the output and the yellow node is the neuron in the context layer: • In the Jordan networks, the difference is that the input of the neurons in the context layer is taken from the output of the network:

Example • As an example of time series, I will use first a series generated by the logistic function in the domain of chaotic dynamics. The chaotic dynamics gives the series a complex structure, making the prediction of future values very difficult or impossible. This series is obtained using the following equation: Xn+1 = µXn(1-Xn) • When the μ parameter takes values approximately over 3.5, the dynamics of the series generated becomes chaotic.

You have to start loading the packages and the series data: require(RSNNS) require(quantmod) slog<-as.ts(read.csv("logistic-x.csv",F)) • The values of the logistic equation are all in the range (0,1) and this does unnecessary to preprocess them, otherwise, it is convenient to scale them. As we have 1000 values, we will use 900 to train the neural network. To do this we define the train variable: train<-1:900

Let's define as series training variables, the n previous values of it. The choice of n is arbitrary, here 10 values are selected, but, depending on the nature of the problem we are dealing with, can be convenient another value. For example, if we have monthly values of a variable, 12 might be a better value for n. What we'll do is create a data frame with n columns, each of which is constructed advancing a value of the series in the future, through a variable of type zoo: y<-as.zoo(slog) x1<-Lag(y,k=1) x2<-Lag(y,k=2) x3<-Lag(y,k=3) x4<-Lag(y,k=4) x5<-Lag(y,k=5) x6<-Lag(y,k=6) x7<-Lag(y,k=7) x8<-Lag(y,k=8) x9<-Lag(y,k=9) x10<-Lag(y,k=10) slog<-cbind(y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10)

We eliminate the NA values produced by shifting the series: slog<-slog[-(1:10),] and we define, for convenience, the input and output values of the neural network: inputs<-slog[,2:11] outputs<-slog[,1] • Now we can create an Elman network and train it: fit<-elman(inputs[train], outputs[train], size=c(3,2), learnFuncParams=c(0.1), maxit=5000) • The third parameter indicates that we want to create two hidden layers, one with three neurons and another one with two; I have indicated a rate of learning of 0.1, and also a maximum number of iterations of 5000.

With the function plotIterativeError can see how it has evolved the network error along the training iterations: plotIterativeError(fit) As we can see, the error converges to zero very quickly.

Now let's make a prediction with the remaining terms of the series, which has the following graphical appearance: y<-as.vector(outputs[-train]) plot(y,type="l")

pred<-predict(fit,inputs[-train]) • If we superimpose the prediction over the original series, we can see that the approximation is very good: lines(pred,col="red")

library(RSNNS) # simulate an arima time series example of the length n set.seed(10001) n <- 100 ts.sim <- arima.sim(list(order = c(1,1,0), ar = 0.7), n = n-1) # create an input data set for ts.sim # sw = sliding-window size # the last point of the time series will not be used # in the training phase, only in the prediction/validation phase sw <- 1 X <- lapply(sw:(n-2), function(ind){ ts.sim[(ind-sw+1):ind] }) X <- do.call(rbind, X) Y <- sapply(sw:(n-2), function(ind){ ts.sim[ind+1] })

# used to validate prediction properties # on the last point of the series newX <- ts.sim[(n-sw):(n-1)] newY <- ts.sim[n] # build an elman network based on the input model <- elman(X, Y, size = c(10, 10), learnFuncParams = c(0.001), maxit = 500, linOut = TRUE) # plot the results limits <- range(c(Y, model$fitted.values)) plot(Y, type = "l", col="red", ylim=limits, xlim=c(0, length(Y)), ylab="", xlab="") lines(model$fitted.values, col = "green", type="l") points(length(Y)+1, newY, col="red", pch=16) points(length(Y)+1, predict(model, newdata=newX), pch="X", col="green")

First, we have decoupled/sliced the time-series example into inputs of the form (sw previous points, next point) for all pairs except the last one (with the next point as the last point of the time-series example). The parameter sw is used to define the "sliding window". Due to Elman networks having memory the sliding-window of size one is more than a reasonable approach. • After the preparations are done we built an Elman network. There are two important parameters: the size and the learnFuncParams. Size: no. of hidden layer. A rule of thumb for learnFuncParams is to keep it small if it is feasible.

For other R packages or different NN type application, you can visit • https://blogs.rstudio.com/tensorflow/posts/2017-12-20-time-series-forecasting-with-recurrent-neural-networks/ • https://kourentzes.com/forecasting/2017/02/10/forecasting-time-series-with-neural-networks-in-r/ • https://datascienceplus.com/neuralnet-train-and-test-neural-networks-using-r/ • https://rpubs.com/mr148/313595

Neural Networks for Time Series Forecasting

Neural Networks for Time Series Forecasting

Presentation Transcript

STAT 497 LECTURE NOTE 11

STAT 497 LECTURE NOTES 6

STAT 497 LECTURE NOTES 3

STAT 497 LECTURE NOTES 10

STAT 497 LECTURE NOTES 4

STAT 497 LECTURE NOTES 5

STAT 497

Stat 470-11

STAT 497 APPLIED TIME SERIES ANALYSIS

STAT 110 - Section 5 Lecture 11

STAT 497 LECTURE NOTES 7

STAT 497 LECTURE NOTE 9

STAT 497 LECTURE NOTES 8

STAT 497 LECTURE NOTES 4

STAT 497 LECTURE NOTES 6

STAT 497 LECTURE NOTES 5

STAT 497 LECTURE NOTES 8

STAT 497 LECTURE NOTE 10

STAT 497 LECTURE NOTE 13

STAT 497 LECTURE NOTES 9