Hi all,
I am trying to perform a hypothesis test of whether some observed data X
are coming from a Generalized Pareto Distribution or not. Naturally, tests from HypothesisTests.jl like the Kolmogorov Smirnov or the Anderson Darling seemed like a good fit. To my surprise, these tests completely failed even when provided with data sampled directly from a GPD.
Initially, I thought this was a bug with HypothesisTests.jl, so I opened an issue: Incorrect p-values for nonparameteric statistical tests of Generalized Pareto Distribution · Issue #305 · JuliaStats/HypothesisTests.jl · GitHub
In this issue you will find a slim MWE code that produces GPD data, and then estimates pvalue
for them with standard hypothesis tests. The pvalues vary wildly between 0-1 instead of being consistently very small.
Now, I am not so sure whether the problem is with HypotheissTests.jl. It may be with the GPD in general. I’ve written my own crude version of an Cramer von MIses test, but this test behaves just like the above tests: its pvalues vary wildly instead of being close to 0.
What is wrong?
Cramer code:
using Distributions
sigma = 1 / 2.0
xi = -0.1
gpd = GeneralizedPareto(0.0, sigma, xi)
X = rand(gpd, 10000)
# test Cramer direct implementation
n = length(X)
xs = sort(X)
T = 1/(12n) + sum(i -> (cdf(gpd, xs[i]) - (2i - 1)/2n)^2, 1:n)
# An approximation of the T statistic is normally distributed
# with std of approximately sqrt(1/45)
# From there the z statistic is just the ratio
zstat = T/sqrt(1/45)
newp = 1 - 2*(1 - cdf(Normal(0,1), zstat))
@show newp
You are probably aware of the extreme value theory literature, but sharing here in case it has references to books discussing the issue:
It is been a while since the last time I touched this literature, but contributions are welcome if you find a good hypothesis test for GPDs.
There is nothing wrong here, p-values are statistics too, i.e., functions of the data, and accordingly vary across different data sets. Further, they are constructed to have a uniform distribution when the data have indeed been generated from the null hypothesis, i.e., as in your case.
What is used in hypothesis testing is that their distribution becomes skewed towards small values when the data are not from the null-hypothesis. Thus, a small p-value can be interpreted as evidence against the null-hypothesis – whereas a large p-value tells you basically nothing.
@bertschi I am completely confused now. HypothesisTests.OneSampleADTest says:
As far as I have understood the concept of a p-value in the context of this test, a large p-value means “reject the hypothesis”. Here the hypothesis is “the data come from the given distribution”. If the hypothesis is correct, the pvalue should be very small.
But you are right, what I observed regarding p-values in 0-1 is true for any distribution. Replacing my code snippet’s distribution with a Normal gives exactly the same result: fluctuating p-value in 0-1.
Therefore, can you please educate me: How can I make a statement of e.g., 95% confidence that the data at hand come from a given distribution? I used to think that the way was: “get the p-value of Anderson Darling and compare it with 0.05”. but that’s incorrect.
Oh, I think I get it now. It is the other way around due to the formulation of the hypothesis. Right? Instead of checking if p < 0.05, I should be checking if p > 0.05!
From wikipedia:
In null-hypothesis significance testing, the p-value [note 1] is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.[2][3] A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.
Here we want large p-values since I want to find scenarios where the null hypothesis is satisfied.
Well, according to common reading of frequentist statistics you do not accept the null-hypothesis, but just fail to reject it if the p-value is above your threshold. To me, null-hypothesis testing is logically not very convincing.
If you want evidence for a model/hypothesis, it’s imho better to compare it against several explicitly specified alternative models, either using likelihood-ratio tests, cross-validation or Bayesian methods. Obviously, you can never prove – in a precise mathematical sense – that a model is true as there might always be an alternative explanation that you just did not consider.
Yes we are completely on the same level with everything you said! I was very confused overall, but now I am back on track!
Unfortunately using other routes is not very easy because there aren’t any alternative models I could be using for this particular application. Finding such models would be a research project in itself. But thank you for the suggestions, I will keep it in mind in future applications!