Sharp statistical tools Statistics for extremes Georg Lindgren Lund University October 18, 2012 SARMA Background Motivation We want to predict outside the range of observations Sums, averages and proportions – Normal distribution Central limit theorem Successes in large experiments Extremes Normal distribution inappropriate Bulk of data may be misleading Extremes are rare Sparsity of data may be compensated by proper (extreme value) theory Background Overview of contents Topics I I I I I I Extreme value distributions Block maxima POT - Peaks Over Threshold Estimation, return period, uncertainties Extremes with cyclic or linear trend Extremes with other covariates – CO2, NAO, AO, ... Useful statistics packages I I The R environment - need packages extRemes and evd for the computer experiments The WAFO package in Matlab and Python (Downloadable from code.google.com) Background A typical example: annual maximum of daily precipitation 3 2 1 precipitation (in) 4 100 years of yearly maximum of daily precipitation in Fort Collins 1900 1920 1940 1960 1980 2000 Background A second typical example 2 1 0 Precipitation (in) 3 4 Monthly maximum precipitation values has 12 times as many data – but season 2 4 6 8 10 12 Background Some basic facts about statistical extremes Notation: Mn = max(X1 , X2 , . . . , Xn ) is the maximum of n observations of a random quantity. Example: Xk = maximum precipitation year k Problem: Find the probability Prob(Mn > x) for large n and reasonable x. In particular when x > largest observed data! Background Return period – return level X1 , X2 , . . . a sequence of random quantities with common distribution, e.g. one observation per year (a mean, a yearly maximum, ..., anything) ”100 year return value x100 is the value that is exceeded on average once in 100 years. The same as Prob(X1 > x100 ) = 1/100 1 100 Prob(M100 > x100 ) = 1 − 1 − 100 ≈ 1 − 1/e = 0.63. The probability that the 100-year return level is exceeded at least one time during a 100 year period is 63%. The 1000-year value exceeded with the same probability at least once during 1000 years, etc. IF YEARS ARE INDEPENDENT OF EACH OTHER. – USUALLY, Background Wave data from the Pacific Ocean 8 6 4 2 Hs 10 12 582 measurements of Hs = ”Significant wave-height” at buoy 46005 in the Pacific. About 14 data per month, Dec-Feb, during 7 years. 0 100 200 300 t 400 500 600 Background Histogram of Pacific wave height CDF of Hs exceedances over 7m Histogram of Hs 1 80 0.9 60 0.8 0.7 F(x) 40 0.6 0.5 20 0.4 CDF of exponential distribution +7m 0.3 0 0.2 2 4 6 8 Hs 10 12 0.1 0 7 8 9 10 x 11 Conclusion: the tail of the distribution has a more ”standardized” distribution than the central part. 12 13 GEV and GPD Goals for statistical extreme value analysis One-dimensional: From a series of observations (often in time) estimate the probability of high values, of the order of the largest observation; OR HIGHER, even MUCH HIGHER Make a statement about how uncertain the prediction is Identify possible non-stationarity in extremes – can be different from non-stationarity in averages or standard deviation Find covariates that explain the occurrence of extreme values Utilize data as much as possible – balance between number of data available and ”bias” Multi-dimensional: find probabilities of combinations of extreme (or not so extreme) values in two or more series (research is going on) Extremes in random fields (even more research is going on) GEV and GPD The GEV and the GPD Block maxima: Take one extreme value per time unit (e.g. day, month, year, ...). Ideal situation: stationary conditions, no trend. Exceedance analysis: Take all values over a certain threshold – gives more data that are more representative for the global maximum that the bulk data. Statistical extreme value analysis relies on two mathematical facts Block maxima: The maximum of many observations has a GEV distribution. Exceedance analysis: The exceedances have a GPD distribution. GEV and GPD The ”Extremal types theorem” An ”almost true” theorem: When n is large, the distribution of the maximum Mn is a GEV = Generalized Extreme Value distribution: ) ( x − µ −1/ξ Prob(Mn ≤ x) ≈ exp − 1 + ξ ψ + where ξ = shape (type), ψ = scale, µ = location ξ < 0 = an upper limit exists, ξ > 0 = a lower limit exists, ξ = 0 = ”Gumbel distribution”, no limits exist GEV and GPD Light, Gumbel, or Heavy tails Gumbel Probability Plot 10 5 −log(−log(F)) 20 Light tail ξ < 0 0 −20 0 0 −5 −10 20 0 10 X Gumbel Probability Plot 5 10 0 −5 0 Double tail ξ =0 5 −log(−log(F)) 20 0 −5 −5 10 0 5 10 X Gumbel Probability Plot 20 −log(−log(F)) 600 400 200 0 −5 0 Heavy tail ξ > 0 5 10 10 0 −10 −5 0 5 X 10 GEV and GPD The ”Tail theorem” Another ”almost true” theorem: Almost all statistical distribution have a ”tail” that is GPD – they have a Generalized Pareto Distribution. Take only observations that above a fixed threshold u and let Y = X − u be the exceedance. y −1/ξ Prob(Y ≤ y ) ≈ 1 − 1 + ξ σ + where ξ = shape (type), σ = scale, (µ = location = 0). ξ = 0 = exponential tail, ξ > 0 = heavy tail, ξ < 0 = limited tail GEV and GPD GPD-tail in normal distribution Normal tail is GPD with ξ = 0 Normal distribution 800 600 400 200 0 −4 −3 −2 −1 0 1 2 3 4 The tail > 2 of a normal distribution 1 F(x) 0.8 0.6 Red = empirical cdf of exceedances over 2 0.4 Blue = estimated GPD 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 1.8 Block maxima Block maxima with ”block” = one year Many observations over a year (hourly, semi-daily, daily) – take the maximum. Assume a GEV distribution for yearly maximum. Estimate parameters in GEV by ”Maximum likelihood” method in R-package extRemes (or WAFO) fit <- gev.fit(data) Block maxima 100 years of precipitation in Fort Collins, CO Fort Collins, CO, daily precipitation: http://ccc.atmos.colostate.edu/ odie/rain.html Time series of daily precipitation, 1900 – 1999 Strong annual cycle – wet in late spring, dry in winter No long-term trend Recent flood, July 28, 1997 Block maxima 3 2 1 precipitation (in) 4 Annual daily maximum precipitation, Fort Collins 1900 1920 1940 1960 year 1980 2000 Block maxima 2 1 0 Precipitation (in) 3 4 Seasons – can be handled by making separate analyses for each month 2 4 6 8 10 12 Block maxima Some R-code data <- read.table(file="FtCprec.prn", header=TRUE) plot(data$Mn,data$Prec/100,ylab="Precipitation (in)", xlab="month") YearlyMax = aggregate(Prec Year,data=data,max) YearlyMax[,2]= YearlyMax[,2]/100 plot(YearlyMax,type=’l’,col=’blue’) library(extRemes) ftcanmax <- read.table(file="Ftcanmax.prn", header=TRUE) fit <- gev.fit(ftcanmax$Prec/100) gev.diag(fit) Block maxima Estimated parameter values + standard errror Parameter Location µ Scale ψ Shape ξ Estimate 1.35 0.53 0.17 Standard error 0.062 0.049 0.092 Block maxima Is the GEV a Gumbel distribution? Is the shape parameter ξ = 0.174 > 0 really significantly different from 0? Can be tested by a likelihood ratio test: Dev = −2 log max likelihood under restricted model (ξ = 0) max likelihood under full model If restricted model is correct, this has a chi-squared distribution; d.f. = # parameters in full model - # parameters in restricted model Test by extRemes (ξ = 0 = Gumbel): fit0 <- gum.fit(ftcanmax$Prec/100) Dev <- 2*(fit0$nllh - fit$nllh) pval = pchisq( Dev, 1 , lower.tail=F) (= 0.038) Block maxima A 95% confidence interval for shape estimate is (0.009, 0.369) −107 −108 −109 −110 Profile Log−likelihood −106 −105 gev.profxi(fit, fit$mle[3] -0.25, fit$mle[3] +0.25) Block maxima Diagnostics and conclusions We want return values for N = 10, 100, 100, 1000 years in GEV: xN = µ − o ψn 1 − (− ln(1 − 1/N))−ξ ξ Estimated standard errors gives confidence limits for return levels. gev.diag(fit) An example for the exercises; fit.rl <- return.level(fit,rperiods=c(10,100), conf = 0.05) Block maxima GEV summary Quantile Plot 3 Empirical 2 0.6 0.4 0.0 1 0.2 Model 0.8 4 1.0 Probability Plot 0.2 0.4 0.6 0.8 1.0 1 2 3 Empirical Model Return Level Plot Density Plot 4 5 0.2 0.4 f(z) 6 4 0.0 2 Return Level 0.6 8 0.0 1e−01 1e+00 1e+01 1e+02 Return Period 1e+03 1 2 3 z 4 5 Block maxima Trends in extremes – Water level in the Japan Sea Water level 20 (m) 15 10 5 0 5 10 15 20 25 20 25 Maximum 5 min water level (m) 20 15 10 0 5 10 15 Time (h) Block maxima Alt I: Normalize (simplistic) Take average and standard deviation for each 5-minute period, subtract and divide. Take maximum over 5-minute period, and fit a GPD. Diagnostic plot: Probability plot Density plot 1 0.5 0.8 0.4 F(x) 0.6 0.3 0.4 0.2 0.2 0.1 0 10 12 14 16 18 20 0 12 x 16 18 20 x Residual Probability Plot Residual Quantile Plot 1 18 0.8 Model (gev) Model (gev) 14 16 14 0.6 0.4 0.2 12 12 14 16 Empirical Fit method: PWM, Fit p−value: 1.00 18 0 0 0.5 Empirical 1 Block maxima Alt II: fit with trend yura5fit<-gev.fit(y5max[,1]) Est.: S.e.: 13.362742 0.7083573 -0.1156161 0.0453072 0.0307655 0.0249203 yura5fit.mul<-gev.fit(y5max[,1], ydat=as.matrix(1:length(y5max[,1])),mul=1) Est. 13.90486754 -0.0036693 0.59705809 -0.0564354 S.e.: 0.06510006 0.00036877 0.02695204 0.03153350 yura5fit.sigl<-gev.fit(y5max[,1], ydat=as.matrix(1:length(y5max[,1])),sigl=1) Est.: S.E.: 13.337740339 0.406178243 0.001745658 0.098749706 0.04233340 0.02916609 0.00004856 0.05472873 Exceedance analysis The dilemma of statistical extremes We want to predict events that have never been observed! From 20 years data – can one say something about the 100 year return value? Exceedance analysis 20 years of monthly data 20 years of monthly data 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 50 100 150 200 Exceedance analysis Use more high values Waste of data to use only yearly maximum 20 years of data = 240 monthly data The smallest yearly maximum (year 7) is X7 = 1.67. There are 42 monthly values greater than 1.67 Can one use all 42? Or all those 48 greater than 1.5? Or the 84 values greater than 1? Exceedance analysis High values are rare and occur randomly Take a reasonably high level u – try many! Estimate the rate by which this level is exceeded per time unit (e.g. per year), λ = λu : λ∗ = Observed number of exceedances over u Total observation time u = 1.5 gives λ∗1.5 = 48/20 = 2.4 Reasonable to assume that N = the number of exceedances over 1.5 any year has a Poisson distribution with mean λ = 2.4: Prob(N = k) = e −λ λk /k! Exceedance analysis Generaliserad Pareto fördelning - GPD Exceedances over a high threshold are more representative for the global extremes than the buld data “Almost all distributions” have a Generalized Pareto-tail, GPD With Y = X − u = exceedance over level u: y −1/ξ Prob(Y ≤ y ) ≈ 1 − 1 + ξ σ + Exponential tail: ξ = 0; Heavy tail: ξ > 0; Limited tail: ξ < 0 Exceedance analysis GPD-tail in normal distribution Normal tail is GPD with ξ = 0 Normal distribution 800 600 400 200 0 −4 −3 −2 −1 0 1 2 3 4 The tail > 2 of a normal distribution 1 F(x) 0.8 0.6 Red = empirical cdf of exceedances over 2 0.4 Blue = estimated GPD 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 1.8 Exceedance analysis Poisson + GPD = GEV N = number of exceedances Yj = Xj − u over u is (approximately) Poisson distributed with expectation λ The size of exceedances Y1 , . . . , YN , have (approximately) a GPD If M = yearly maximum = u + max(Y1 , . . . , YN ), then for x > u: Prob(M ≤ x) = Prob(N = 0) + ( = . . . = exp ∞ X x −u −λ 1 + ξ σ Prob(N = n, Y1 , . . . , Yn ≤ x − u) n=1 −1/ξ ) (1) + Exceedance analysis Poisson + GPD = GEV, contd (1) is a GEV distribution ( Prob(M ≤ x) = exp ) x − µ −1/ξ − 1 +ξ ψ + Translation from Poisson+GPD to GEV: ψ = σ λξ µ=u+ ψ−σ ξ To get maximum over n years, replace λ with nλ in (1) Exceedance analysis Choice of threshold How to choose u? Note: assume GPD above threshold u Diagnostic: A GPD has linear mean excess E(X − u | X > u) = σ + ξu 1−ξ Plot the mean of all excesses over u as function of u. Take u as the smallest threshold for which the curve to the right is ”linear” The slope is ξ/(1 − ξ) om ξ < 1. Exceedance analysis Mean excess plot Plot of E(X − u | X > u) for 20 years of monthly data: Mean exceedance over threshold 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 1.5 2 2.5 3 3.5 4 Exceedance analysis Diagnostics in GPD analysis A plot of mean excess can be had to interpret Alternative: Estimate a full GPD for different thresholds If the tail above u0 is GPD then all exceedances over u > u0 are also GPD with the same shape parameter ξ but with different scal parameter σu = σu0 + ξ (u − u0 ) “Modified scale” = σu − ξ u should be constant, independent of u, when GPD is appropriate Exceedance analysis Estimated CDF for yearly maximum From 20 yearly maxima: c = 0.14, µ = 2.81, a = 0.77 Empirical and GEV estimated cdf (PWM method) 1 0.9 0.8 0.7 True CDF for yearly maximum F(x) 0.6 0.5 0.4 0.3 CDF for estimated GEV 0.2 0.1 0 0 1 2 3 x 4 5 Exceedance analysis Estimated CDF by POT method 84 exceedances over u = 1 and GPD estimate gives ξ = 0.04, b = 2.38, a = 0.93 Tail probability 0 10 Tail by POT−method −1 10 True CDF −2 10 Tail by direct GEV−estimation −3 10 −4 10 2 3 4 5 6 7 8 9 10 Exceedance analysis Fatalities in English coal mines Time and number of fatalities 1861 - 1962 450 400 350 300 250 200 150 100 50 0 1860 1880 1900 1920 1940 1960 Exceedance analysis GEV? We try GEV on all data (not really motivated!): Empirical and GEV estimated cdf (PWM method) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 x 400 500 Exceedance analysis Consider only big accidents – POT There are 25 accidents with > 100 dead. Fit GPD to data > 100. 10% of these exceed 350. CDF for deaths > 100 and GPD 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 0.1 0 100 200 300 400 x 500 600 Exceedance analysis Vargas, Venezuela, catastrophe 1999 In December 15, 1999, the Vargas province was hit by 410 mm rain. Exceedance analysis Rain in Venezuela - GEV or Gumbel? Gumbel distribution is GEV with shape parameter ξ = 0. Based on rain statistics 1951-1998 the maximum daily rain distribution was estimated with GEV: scale = 19.9, location = 49.2, shape = 0.16. The shape parameter ξ is not significantly different from 0, so Gumbel could ”perhaps” be used instead of GEV. Estimated 10000 year return value for daily rain is x10000 = 249 mm. In December 1999 the Vargas province was hit by 410 mm rain during one day. From the full GEV the estimate of the 10000 year value had been x10000 = 468 mm, much closer to the real value. A 95% one-sided confidence interval is x10000 < 1030 mm. Multivariate extremes General recipe for bivariate extremes Assume, for a sequence of days, X1 , X2 , . . . is the wind speed and Y1 , Y2 , . . . is the wave height measured at a North Sea platform. The platform is designed for extreme waves. It is also designed for extreme winds. What about the combined effect of wind and waves (and current)? Make marginal EVA and transform x and y to standard scale. Plot pairs of transformed extremes to examine joint extreme behavior. Three main types of dependence/nondependence: 2 1.8 Extremes come togehter 1.6 1.4 1.2 Extremes just add 1 0.8 0.6 0.4 Extremes come alone 0.2 0 0 0.5 1 1.5 2 Some literature Some literature Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J. (2004), Statistics of Extremes: Theory and Applications. Wiley. Coles S. (2001), An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag. Gilleland, E., Katz, R., Young, G. (Feb. 2012), Package ’extRemes’. Nelsen, R.B. (2006), An Introduction to Copulas. Wiley. The WAFO group (2011), WAFO – a MATLAB toolbox for analysis of random waves and loads. Lund univ. http://www.maths.lth.se/matstat/wafo/ and http://code.google.com/p/wafo/
© Copyright 2025 Paperzz