Ethan Frome - UniFI

A Non Linear Regression Model for Time Series with
Heteroskedastic Conditional Variance
Un Modello Non Lineare per l’Analisi delle Serie Storiche con Varianza
Condizionale Eteroschedastica
Isabella Morlini
Dipartimento di Economia – Sezione di Statistica, Via Kennedy 6, 43100 Parma, Italy,
e-mail:morlini@economia.econ.unipr.it.
Riassunto Il lavoro presenta una metodologia per l’applicazione di modelli non lineari
all’analisi delle serie storiche con varianza condizionale eteroschedastica. Si considera il
generico modello yt = f(x;w)+t dove yt è il valore della serie al tempo t, w è un vettore
di parametri, x è un vettore di variabili esogene e/o endogene ritardate, f è una funzione
non lineare ed  t ~ N (0; 2 ) . Nella metodologia proposta tale modello viene adattato
minimizzando una funzione di costo con fattore di penalizzazione weight decay, per
descrivere il valore atteso condizionale yt|x. Successivamente, un secondo modello
viene adattato per descrivere la varianza condizionale ht2  ( yt   yt | x) 2 | x al
tempo t, considerando {yt - yt|x}2 come valori di target. La metodologia è applicata a
serie simulate da processi AR(2) con disturbi ARCH(1) e GARCH(1,1).
Keywords: Flexible modelling, Prediction intervals, Weight-decay.
1. Times Series Prediction Using Non Linear Modelling
Consider a non-linear regression model defined by:
yt  f (x; w)   t
(t=1, …, T)
(1)
where yt is the value of a series at time t, x is a vector of exogenous or lagged
endogenous variables, w is a vector of adaptive parameters,  t ~ N (0; 2 ) ,
 t | x ~ N (0; ht2 ) , and f is a smooth, non-linear mapping function. If the estimated
T
~ are chosen to minimise the sum-of-squares error E  ( y  yˆ (x; w
parameters w
 t t ~ )) 2 ,
t 1
~ ) approximate the conditional averages of the targets:
then the predicted values yˆ t (x; w
~ )   y | x
yˆ t (x; w
t
(t=1, …, T)
(2)
~ ) is the network mapping with the weight vector at the minimum of the
where yˆ t (x; w
error function and  yt | x   yt p ( yt | x)dyt . This result (Bishop, 1995, pp. 201-202) is
independent of the choice of the mapping function and only requires that the
representation for the non-linear mapping be sufficiently general and the data set and
the vector of adaptive parameters be sufficiently large. A non-parametric approach to
determine the input dependent variance of yt may be based on result (2). Once model (1)
is adapted, the predicted values yt|x are subtracted from the target values yt and the
results are then squared and used as targets for a second non linear model, which is also
fitted using a sum-of-squares error function. The output of this second model then
represents the conditional average of yt   yt | x2 and thus approximates the variance
ht2 (x)  ( y t   y t | x ) 2 | x . The principal limitation of this technique is that it requires
training until convergence and this may cause overfitting in practical applications,
where the training data are limited sized and are affected by a high degree of noise.
2. The Suggested Approach
The idea in this work is to present an alternative approach to estimate the input
dependent variance using weight decay to fit the models. Instead of minimising the
T
K
t 1
k 1
sum-of-squares error, we minimise the cost function C   ( y t  yˆ t (x; w )) 2    wk2 ,
where K is the total number of parameters and  is a regularisation term. Although this
method can be considered as an ad-hoc method to prevent overfitting, we will justify
this approach on Bayesian ground, in order to demonstrate that result (2) is still valid
~ ) is the non-linear mapping with the parameter vector at the minimum of
when yˆ t (x; w
C. Given the set of target values Y=(y1, …, yT), the approximated posterior distribution
1

~ )  1 (w  w
~ ) T H( w  w
~ )  , where Zw
of w is (MacKay, 1992) p(w | Y ) 
exp   C (w
Zw
2


~
is the normalisation constant, C (w ) is the cost function at its minimum value and H is
the Hessian matrix, whose klth entry is  2C / wk wl . Remembering equation (1),
which leads to p( yt | x; w )  exp  ( yt  yˆ (x; w )) 2 / 2ht2  , the posterior distribution
p( y t | x; Y )   p( y t | x; w ) p (w | Y ) dw can be written as follows
 ( y  yˆ t (x; w)) 2 
1
~ ) T H( w  w
~ )  dw
 exp   (w  w
p( yt | x; Y )   exp   t
2

2ht
 2



(3)
where any constant factor independent of yt has been dropped. Assume the width of this
distribution be sufficiently narrow so that the function ŷt (x;w) can be approximated by
~ )  g T (w  w
~ ) , where g is a
~ . Then yˆ (x; w )  yˆ (x; w
its linear expansion around w
t
t
vector whose kth entry is yˆ t / wk , and expression (3) can be written as follows:
~ )  gT (w  w
~ )) 2 (w  w
~ )T H(w  w
~) 
 ( yt  yˆt (x; w

 dw (4)
p( yt | x; Y )   exp  

2
2
h
2
t


The integral in (4) is evaluated to give the Gaussian distribution
p( yt | x; Y ) 
~ )) 2 
 ( y  yˆ (x; w
1
exp   t 2 t T 1  ,
Zy
 2(ht  g H g) 
(5)
~ ) is the mean yt|x;Y and ( h 2  g T H 1g ) is
where Zy is the normalisation factor, yˆ t (x; w
t
the variance ( yt   yt | x; Y  ) 2 | x; Y  . This variance is given by the sum of two terms:
the first is the variance ht2 of yt conditioned on the input vector, the second represents
the width of the posterior distribution of the parameters. The estimates hˆt2 (x; ~
v) of this
variance are given at a next stage, by considering a second non linear model governed
by a vector v of parameters, with the same input vector x and with target values given
~ )) 2 . This second model is also fitted by weight
by the squared differences ( y t  yˆ t (x; w
~ )  z hˆ (x; ~
decay. The procedure enables the approximated intervals [ yˆ (x; w
v) ] to be
t
/2 t
obtained, which are based on the distribution of yt conditioned on x and on the data set.
3. Simulation Study
Since a practical framework for approximating arbitrary non-linear mappings is
provided by neural networks, these models are used to validate the approach suggested
in section 2. Two multi-layer feedforward networks are developed to obtain first time
series predictions and then estimates of the heteroskedastic variances ht2(x). The four
most recent observations from a time series are used as inputs and the networks have
four nodes in the input layer, seven nodes in hidden layer and one node in the output
layer. The transfer functions are hyperbolic tangent for the first layer and linear
functions for the second layer. Five second order autoregressive (AR(2)) time series
with Gaussian error modelled by an ARCH(1) process (Engle, 1982) and five AR(2)
series with error modelled by a GARCH(1,1) process are simulated, considering
equation  t  ht t , with t  i.i.d. N(0, 1) being pseudo-random number generated from
a
gaussian
distribution
and
ht2   0  1 t21
for
the
ARCH
process
and
ht2   0  1 t21   1 ht21 for the GARCH process (0 and 1 being positive). All time
series have starting values y1=0.264, y2=0.532, h02=0.031 and 0=0.3. The signal-tonoise ratio, the ratio of the unconditional variance y2 of yt and the unconditional noise
variance 2, ranges from 1.85 to 116, while 2 ranges from 0.02 to 0.9. Each time
series consists of 9,000 observations: 5,000 are used for training and 4,000 for testing.
Table 1 shows the results from the experiment, training the networks with
backpropagation in order to minimise the weight decay cost function and then
~ ) , hˆ 2 (x; ~
calculating yˆ t (x; w
v) , and the 95% approximated prediction intervals for the
t
test points. Table 1 reports the percentage of observations falling within these intervals,
~ )  1.96h ], where the estimated values yˆ (x; w
~ ) but the true values ht2
within [ yˆ t (x; w
t
t
are used, and within [  y | x  1.96hˆ (x; ~
v) ], where the true values yt|x but the
t
t
estimated values hˆt2 (x; ~
v) are used. The average sizes of the real intervals and of the
approximated prediction intervals are also given. The accuracy of the fit between
predicted and real values is measured by the root mean square error (RMSE) and by the
mean absolute error (MAE). In order to evaluate the ability of the first network to
predict the change in direction, rather than in magnitude, the confusion rate, is also
~ ) , respectively.
given. The last two rows report the sample mean of ht2 and of hˆt 2 (x; w
Table 1: Simulation Results
Model
ARCH (1)
GARCH (1,1)
Signal-to-Noise Ratio
1.85 116 116 1.85 3.86 1.85 116 116 1.85 3.86
0.10 0.10 0.50 0.90 0.90 0.02 0.02 0.50 0.90 0.90
Noise Variance 2
% of Points in
~ )  1.96 hˆ (x; ~
( yˆt (x; w
v))
t
~
( yˆt (x; w)  1.96ht )
( y | x  1.96 hˆ (x; ~
v))
t
t
Average Size of
Real Intervals
Estimated Intervals
RMSE for yt|x
MAE for yt|x
Confusion Rate for yt|x
RMSE for ht2
MAE for ht2
Sample Mean of (ht2)
2
Sample Mean of ( hˆt (x; ~
v) )
90.6 92.9 96.0 88.8 92.2 91.9 91.7 90.1 92.5 92.9
94.5 92.5 93.1 94.9 94.6 94.4 93.7 93.3 94.9 83.8
90.7 94.5 98.6 94.5 92.8 92.2 93.2 91.1 92.7 96.6
1.05
1.10
0.09
0.03
0.03
0.15
0.05
0.10
0.11
1.05
1.20
0.16
0.08
0.01
0.24
0.08
0.10
0.10
2.61
2.45
0.56
0.16
0.00
0.87
0.46
0.50
0.57
2.25
2.26
0.33
0.08
0.03
7.55
0.60
0.75
0.54
2.25
2.35
0.67
0.10
0.02
6.71
0.52
0.75
0.58
0.73
0.84
0.07
0.03
0.05
0.11
0.04
0.05
0.06
0.73
0.84
0.10
0.04
0.00
0.12
0.04
0.05
0.05
2.71
2.75
0.44
0.17
0.01
1.10
0.37
0.50
0.60
2.46
2.42
0.25
0.07
0.03
6.31
0.46
0.75
0.52
2.46
3.22
0.75
0.49
0.20
5.94
0.71
0.75
0.84
The prediction intervals, getting between 88.8% and 96% of the points, are consistent
with the nominal coverage, but the percentage of points tends to be smaller than 95%.
When the coverage falls short of 95%, the average size of the prediction intervals is not
too narrow, with respect to the real prediction intervals. This means that the main source
~ ) . The better performances when the true values
of error is due to the estimates yˆ t (x; w
yt|x are used confirm this remark. A large value of 2 avoids overestimation of ht2 but
leads to less accurate fittings both for ht2 and for yt|x. The sample mean of the
estimated variances is close to the mean of real values, especially for small 2. In
conclusion, the approximated prediction intervals and the errors in fit reveal quite
satisfactory performances. The model appears to work adequately, especially
considering that it does not require the correct process generating the conditional
variance to be specified. More research is needed to validate these preliminary results.
Main References
Bishop C.M. (1995) Neural Networks for Pattern Recognition, Clarendon Press,
Oxford.
Engle R.F. (1982) Autoregressive conditional heteroskedasticity with estimates of the
variance of United Kingdom inflation, Econometrica, 50, 987-1007.
MacKay D.J.C. (1992) A practical Bayesian framework for back-propagation networks,
Neural Computation 4 (3), pp.448-472.