Slides - National Research Center for Statistics and the Environment

Sharp statistical tools
Statistics for extremes
Georg Lindgren
Lund University
October 18, 2012
SARMA
Background
Motivation
We want to predict outside the range of observations
Sums, averages and proportions – Normal distribution
Central limit theorem
Successes in large experiments
Extremes
Normal distribution inappropriate
Bulk of data may be misleading
Extremes are rare
Sparsity of data may be compensated by proper
(extreme value) theory
Background
Overview of contents
Topics
I
I
I
I
I
I
Extreme value distributions
Block maxima
POT - Peaks Over Threshold
Estimation, return period, uncertainties
Extremes with cyclic or linear trend
Extremes with other covariates – CO2, NAO, AO, ...
Useful statistics packages
I
I
The R environment - need packages extRemes and evd for the
computer experiments
The WAFO package in Matlab and Python (Downloadable from
code.google.com)
Background
A typical example: annual maximum of daily precipitation
3
2
1
precipitation (in)
4
100 years of yearly maximum of daily precipitation in Fort Collins
1900
1920
1940
1960
1980
2000
Background
A second typical example
2
1
0
Precipitation (in)
3
4
Monthly maximum precipitation values has 12 times as many data – but
season
2
4
6
8
10
12
Background
Some basic facts about statistical extremes
Notation: Mn = max(X1 , X2 , . . . , Xn ) is the maximum of n
observations of a random quantity.
Example: Xk = maximum precipitation year k
Problem: Find the probability Prob(Mn > x) for large n and
reasonable x. In particular when x > largest observed data!
Background
Return period – return level
X1 , X2 , . . . a sequence of random quantities with common distribution,
e.g. one observation per year (a mean, a yearly maximum, ...,
anything)
”100 year return value x100 is the value that is exceeded on average
once in 100 years. The same as
Prob(X1 > x100 ) = 1/100
1 100
Prob(M100 > x100 ) = 1 − 1 −
100
≈ 1 − 1/e = 0.63.
The probability that the 100-year return level is exceeded at least one
time during a 100 year period is 63%. The 1000-year value exceeded
with the same probability at least once during 1000 years, etc.
IF YEARS ARE INDEPENDENT OF EACH OTHER. – USUALLY,
Background
Wave data from the Pacific Ocean
8
6
4
2
Hs
10
12
582 measurements of Hs = ”Significant wave-height” at buoy 46005 in the
Pacific. About 14 data per month, Dec-Feb, during 7 years.
0
100
200
300
t
400
500
600
Background
Histogram of Pacific wave height
CDF of Hs exceedances over 7m
Histogram of Hs
1
80
0.9
60
0.8
0.7
F(x)
40
0.6
0.5
20
0.4
CDF of exponential distribution +7m
0.3
0
0.2
2
4
6
8
Hs
10
12
0.1
0
7
8
9
10
x
11
Conclusion: the tail of the distribution has a more ”standardized”
distribution than the central part.
12
13
GEV and GPD
Goals for statistical extreme value analysis
One-dimensional: From a series of observations (often in time)
estimate the probability of high values, of the order of the largest
observation; OR HIGHER, even MUCH HIGHER
Make a statement about how uncertain the prediction is
Identify possible non-stationarity in extremes – can be different from
non-stationarity in averages or standard deviation
Find covariates that explain the occurrence of extreme values
Utilize data as much as possible – balance between number of data
available and ”bias”
Multi-dimensional: find probabilities of combinations of extreme (or
not so extreme) values in two or more series (research is going on)
Extremes in random fields (even more research is going on)
GEV and GPD
The GEV and the GPD
Block maxima: Take one extreme value per time unit (e.g. day, month,
year, ...). Ideal situation: stationary conditions, no trend.
Exceedance analysis: Take all values over a certain threshold – gives more
data that are more representative for the global maximum
that the bulk data.
Statistical extreme value analysis relies on two mathematical facts
Block maxima: The maximum of many observations has
a GEV distribution.
Exceedance analysis: The exceedances have a GPD
distribution.
GEV and GPD
The ”Extremal types theorem”
An ”almost true” theorem: When n is large, the distribution of the
maximum Mn is a GEV = Generalized Extreme Value
distribution:
)
( x − µ −1/ξ
Prob(Mn ≤ x) ≈ exp − 1 + ξ
ψ
+
where ξ = shape (type), ψ = scale, µ = location
ξ < 0 = an upper limit exists, ξ > 0 = a lower limit exists,
ξ = 0 = ”Gumbel distribution”, no limits exist
GEV and GPD
Light, Gumbel, or Heavy tails
Gumbel Probability Plot
10
5
−log(−log(F))
20
Light
tail ξ <
0
0
−20
0
0
−5
−10
20
0
10
X
Gumbel Probability Plot
5
10
0
−5
0
Double
tail ξ
=0
5
−log(−log(F))
20
0
−5
−5
10
0
5
10
X
Gumbel Probability Plot
20
−log(−log(F))
600
400
200
0
−5
0
Heavy
tail ξ >
0
5
10
10
0
−10
−5
0
5
X
10
GEV and GPD
The ”Tail theorem”
Another ”almost true” theorem: Almost all statistical distribution have a
”tail” that is GPD – they have a Generalized Pareto
Distribution. Take only observations that above a fixed
threshold u and let Y = X − u be the exceedance.
y −1/ξ
Prob(Y ≤ y ) ≈ 1 − 1 + ξ
σ +
where ξ = shape (type), σ = scale, (µ = location = 0).
ξ = 0 = exponential tail, ξ > 0 = heavy tail, ξ < 0 = limited
tail
GEV and GPD
GPD-tail in normal distribution
Normal tail is GPD with ξ = 0
Normal distribution
800
600
400
200
0
−4
−3
−2
−1
0
1
2
3
4
The tail > 2 of a normal distribution
1
F(x)
0.8
0.6
Red = empirical cdf of exceedances over 2
0.4
Blue = estimated GPD
0.2
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6 x 1.8
Block maxima
Block maxima with ”block” = one year
Many observations over a year (hourly, semi-daily, daily) – take the
maximum. Assume a GEV distribution for yearly maximum.
Estimate parameters in GEV by ”Maximum likelihood” method in
R-package extRemes (or WAFO)
fit <- gev.fit(data)
Block maxima
100 years of precipitation in Fort Collins, CO
Fort Collins, CO, daily precipitation:
http://ccc.atmos.colostate.edu/ odie/rain.html
Time series of daily precipitation, 1900 – 1999
Strong annual cycle – wet in late spring, dry in winter
No long-term trend
Recent flood, July 28, 1997
Block maxima
3
2
1
precipitation (in)
4
Annual daily maximum precipitation, Fort Collins
1900
1920
1940
1960
year
1980
2000
Block maxima
2
1
0
Precipitation (in)
3
4
Seasons – can be handled by making separate analyses for
each month
2
4
6
8
10
12
Block maxima
Some R-code
data <- read.table(file="FtCprec.prn", header=TRUE)
plot(data$Mn,data$Prec/100,ylab="Precipitation (in)",
xlab="month")
YearlyMax = aggregate(Prec Year,data=data,max)
YearlyMax[,2]= YearlyMax[,2]/100
plot(YearlyMax,type=’l’,col=’blue’)
library(extRemes)
ftcanmax <- read.table(file="Ftcanmax.prn", header=TRUE)
fit <- gev.fit(ftcanmax$Prec/100)
gev.diag(fit)
Block maxima
Estimated parameter values + standard errror
Parameter
Location µ
Scale ψ
Shape ξ
Estimate
1.35
0.53
0.17
Standard error
0.062
0.049
0.092
Block maxima
Is the GEV a Gumbel distribution?
Is the shape parameter ξ = 0.174 > 0 really significantly different from 0?
Can be tested by a likelihood ratio test:
Dev = −2 log
max likelihood under restricted model (ξ = 0)
max likelihood under full model
If restricted model is correct, this has a chi-squared distribution;
d.f. = # parameters in full model - # parameters in restricted model
Test by extRemes (ξ = 0 = Gumbel):
fit0 <- gum.fit(ftcanmax$Prec/100)
Dev <- 2*(fit0$nllh - fit$nllh)
pval = pchisq( Dev, 1 , lower.tail=F) (= 0.038)
Block maxima
A 95% confidence interval for shape estimate is (0.009,
0.369)
−107
−108
−109
−110
Profile Log−likelihood
−106
−105
gev.profxi(fit, fit$mle[3] -0.25, fit$mle[3] +0.25)
Block maxima
Diagnostics and conclusions
We want return values for N = 10, 100, 100, 1000 years in GEV:
xN = µ −
o
ψn
1 − (− ln(1 − 1/N))−ξ
ξ
Estimated standard errors gives confidence limits for return levels.
gev.diag(fit)
An example for the exercises;
fit.rl <- return.level(fit,rperiods=c(10,100), conf = 0.05)
Block maxima
GEV summary
Quantile Plot
3
Empirical
2
0.6
0.4
0.0
1
0.2
Model
0.8
4
1.0
Probability Plot
0.2
0.4
0.6
0.8
1.0
1
2
3
Empirical
Model
Return Level Plot
Density Plot
4
5
0.2
0.4
f(z)
6
4
0.0
2
Return Level
0.6
8
0.0
1e−01
1e+00
1e+01
1e+02
Return Period
1e+03
1
2
3
z
4
5
Block maxima
Trends in extremes – Water level in the Japan Sea
Water level
20
(m)
15
10
5
0
5
10
15
20
25
20
25
Maximum 5 min water level
(m)
20
15
10
0
5
10
15
Time (h)
Block maxima
Alt I: Normalize (simplistic)
Take average and standard deviation for each 5-minute period, subtract
and divide. Take maximum over 5-minute period, and fit a GPD.
Diagnostic plot:
Probability plot
Density plot
1
0.5
0.8
0.4
F(x)
0.6
0.3
0.4
0.2
0.2
0.1
0
10
12
14
16
18
20
0
12
x
16
18
20
x
Residual Probability Plot
Residual Quantile Plot
1
18
0.8
Model (gev)
Model (gev)
14
16
14
0.6
0.4
0.2
12
12
14
16
Empirical
Fit method: PWM, Fit p−value: 1.00
18
0
0
0.5
Empirical
1
Block maxima
Alt II: fit with trend
yura5fit<-gev.fit(y5max[,1])
Est.:
S.e.:
13.362742 0.7083573 -0.1156161
0.0453072 0.0307655 0.0249203
yura5fit.mul<-gev.fit(y5max[,1],
ydat=as.matrix(1:length(y5max[,1])),mul=1)
Est. 13.90486754 -0.0036693 0.59705809 -0.0564354
S.e.: 0.06510006 0.00036877 0.02695204 0.03153350
yura5fit.sigl<-gev.fit(y5max[,1],
ydat=as.matrix(1:length(y5max[,1])),sigl=1)
Est.:
S.E.:
13.337740339 0.406178243 0.001745658 0.098749706
0.04233340 0.02916609 0.00004856 0.05472873
Exceedance analysis
The dilemma of statistical extremes
We want to predict events that have never been observed!
From 20 years data – can one say something about the 100 year return
value?
Exceedance analysis
20 years of monthly data
20 years of monthly data
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200
Exceedance analysis
Use more high values
Waste of data to use only yearly maximum
20 years of data = 240 monthly data
The smallest yearly maximum (year 7) is X7 = 1.67. There are 42
monthly values greater than 1.67
Can one use all 42?
Or all those 48 greater than 1.5?
Or the 84 values greater than 1?
Exceedance analysis
High values are rare and occur randomly
Take a reasonably high level u – try many!
Estimate the rate by which this level is exceeded per time unit (e.g.
per year), λ = λu :
λ∗ =
Observed number of exceedances over u
Total observation time
u = 1.5 gives λ∗1.5 = 48/20 = 2.4
Reasonable to assume that N = the number of exceedances over 1.5
any year has a Poisson distribution with mean λ = 2.4:
Prob(N = k) = e −λ λk /k!
Exceedance analysis
Generaliserad Pareto fördelning - GPD
Exceedances over a high threshold are more representative for the
global extremes than the buld data
“Almost all distributions” have a Generalized Pareto-tail, GPD
With Y = X − u = exceedance over level u:
y −1/ξ
Prob(Y ≤ y ) ≈ 1 − 1 + ξ
σ +
Exponential tail: ξ = 0; Heavy tail: ξ > 0; Limited tail: ξ < 0
Exceedance analysis
GPD-tail in normal distribution
Normal tail is GPD with ξ = 0
Normal distribution
800
600
400
200
0
−4
−3
−2
−1
0
1
2
3
4
The tail > 2 of a normal distribution
1
F(x)
0.8
0.6
Red = empirical cdf of exceedances over 2
0.4
Blue = estimated GPD
0.2
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6 x 1.8
Exceedance analysis
Poisson + GPD = GEV
N = number of exceedances Yj = Xj − u over u is (approximately)
Poisson distributed with expectation λ
The size of exceedances Y1 , . . . , YN , have (approximately) a GPD
If M = yearly maximum = u + max(Y1 , . . . , YN ), then for x > u:
Prob(M ≤ x) = Prob(N = 0) +
(
= . . . = exp
∞
X
x −u
−λ 1 + ξ
σ
Prob(N = n, Y1 , . . . , Yn ≤ x − u)
n=1
−1/ξ )
(1)
+
Exceedance analysis
Poisson + GPD = GEV, contd
(1) is a GEV distribution
(
Prob(M ≤ x) = exp
)
x − µ −1/ξ
− 1 +ξ
ψ
+
Translation from Poisson+GPD to GEV:
ψ = σ λξ
µ=u+
ψ−σ
ξ
To get maximum over n years, replace λ with nλ in (1)
Exceedance analysis
Choice of threshold
How to choose u? Note: assume GPD above threshold u
Diagnostic: A GPD has linear mean excess
E(X − u | X > u) =
σ + ξu
1−ξ
Plot the mean of all excesses over u as function of u. Take u as the
smallest threshold for which the curve to the right is ”linear”
The slope is ξ/(1 − ξ) om ξ < 1.
Exceedance analysis
Mean excess plot
Plot of E(X − u | X > u) for 20 years of monthly data:
Mean exceedance over threshold
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
1
1.5
2
2.5
3
3.5
4
Exceedance analysis
Diagnostics in GPD analysis
A plot of mean excess can be had to interpret
Alternative: Estimate a full GPD for different thresholds
If the tail above u0 is GPD then all exceedances over u > u0 are also
GPD
with the same shape parameter ξ
but with different scal parameter
σu = σu0 + ξ (u − u0 )
“Modified scale” = σu − ξ u should be constant, independent of u,
when GPD is appropriate
Exceedance analysis
Estimated CDF for yearly maximum
From 20 yearly maxima: c = 0.14, µ = 2.81, a = 0.77
Empirical and GEV estimated cdf (PWM method)
1
0.9
0.8
0.7
True CDF for yearly maximum
F(x)
0.6
0.5
0.4
0.3
CDF for estimated GEV
0.2
0.1
0
0
1
2
3
x
4
5
Exceedance analysis
Estimated CDF by POT method
84 exceedances over u = 1 and GPD estimate gives ξ = 0.04,
b = 2.38, a = 0.93
Tail probability
0
10
Tail by POT−method
−1
10
True CDF
−2
10
Tail by direct GEV−estimation
−3
10
−4
10
2
3
4
5
6
7
8
9
10
Exceedance analysis
Fatalities in English coal mines
Time and number of fatalities 1861 - 1962
450
400
350
300
250
200
150
100
50
0
1860
1880
1900
1920
1940
1960
Exceedance analysis
GEV?
We try GEV on all data (not really motivated!):
Empirical and GEV estimated cdf (PWM method)
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
100
200
300
x
400
500
Exceedance analysis
Consider only big accidents – POT
There are 25 accidents with > 100 dead. Fit GPD to data > 100. 10% of
these exceed 350.
CDF for deaths > 100 and GPD
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
0.1
0
100
200
300
400
x
500
600
Exceedance analysis
Vargas, Venezuela, catastrophe 1999
In December 15, 1999, the Vargas province was hit by 410 mm rain.
Exceedance analysis
Rain in Venezuela - GEV or Gumbel?
Gumbel distribution is GEV with shape parameter ξ = 0. Based on rain
statistics 1951-1998 the maximum daily rain distribution was estimated
with GEV: scale = 19.9, location = 49.2, shape = 0.16.
The shape parameter ξ is not significantly different from 0, so Gumbel
could ”perhaps” be used instead of GEV. Estimated 10000 year return value
for daily rain is x10000 = 249 mm. In December 1999 the Vargas province
was hit by 410 mm rain during one day.
From the full GEV the estimate of the 10000 year value had been
x10000 = 468 mm, much closer to the real value. A 95% one-sided
confidence interval is x10000 < 1030 mm.
Multivariate extremes
General recipe for bivariate extremes
Assume, for a sequence of days,
X1 , X2 , . . . is the wind speed and
Y1 , Y2 , . . . is the wave height measured at a North Sea platform.
The platform is designed for extreme
waves. It is also designed for extreme
winds. What about the combined
effect of wind and waves (and current)?
Make marginal EVA and transform
x and y to standard scale. Plot
pairs of transformed extremes to examine joint extreme behavior. Three
main types of dependence/nondependence:
2
1.8
Extremes come togehter
1.6
1.4
1.2
Extremes just add
1
0.8
0.6
0.4
Extremes come alone
0.2
0
0
0.5
1
1.5
2
Some literature
Some literature
Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J. (2004), Statistics of
Extremes: Theory and Applications. Wiley.
Coles S. (2001), An Introduction to Statistical Modeling of Extreme
Values. Springer-Verlag.
Gilleland, E., Katz, R., Young, G. (Feb. 2012), Package ’extRemes’.
Nelsen, R.B. (2006), An Introduction to Copulas. Wiley.
The WAFO group (2011), WAFO – a MATLAB toolbox for analysis of
random waves and loads. Lund univ.
http://www.maths.lth.se/matstat/wafo/ and
http://code.google.com/p/wafo/