Testing if data is Gamma Distributed


#1

I have a (large) array of data points. When plotting the histogram it looks somewhat gamma distributed. Exactly what would be the procedure for testing whenever this is the case, using Julia?

I think I could fit the data to a gamma distribution using fit_mle and then do a hypothesis test whenever the data come from that distribution using ExactOneSampleKSTest. However, if I do this, do I somehow need to take into account that the distribution I am testing against have actually been selected to test the data as well as possible?

Now I tried something like:

gd = fit_mle(Gamma,data)
ExactOneSampleKSTest(data, gd)

which gives the output

WARNING: This test is inaccurate with ties
WARNING: cdf(d::UnivariateDistribution, X::AbstractArray) is deprecated, use cdf.(d, X) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] cdf(::Distributions.Gamma{Float64}, ::Array{Float64,1}) at ./deprecated.jl:57
 [3] ksstats(::Array{Float64,1}, ::Distributions.Gamma{Float64}) at /home/...
 [4] HypothesisTests.ExactOneSampleKSTest(::Array{Float64,1}, ::Distributions.Gamma{Float64}) at /home/...
 [5] include_string(::String, ::String) at ./loading.jl:522
 [6] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /home/...
 [7] (::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})() at /home/...
 [8] eventloop(::ZMQ.Socket) at /home/...
 [9] (::IJulia.##14#17)() at ./task.jl:335
while loading In[17], in expression starting on line 2

Exact one sample Kolmogorov-Smirnov test
----------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.029463211572353598

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.7666836151907656

Details:
    number of observations:   500

(I am able to get a lot for observations than 500, I just wanted to make a quick example here)


#2

I’d do a qq plot. https://en.wikipedia.org/wiki/Q–Q_plot


#3

Except for some special cases (mostly in physics), all models are “wrong”, so if this is real data, it is very unlikely to have one of the frequently used distributions.

You need to make a methodological choice here before coding the exercise in Julia, eg whether you want to use classical hypothesis testing (which will almost certainly “reject” given enough data, but not tell you how the fit is bad), Bayesian p-values for various features of the distribution that you are about (eg tail probabilities), or something else.


#4

Construct a chi squared variable or do a qq plot.


#5

One relatively generic way to do this is to use a goodness-of-fit test (e.g. a Pearson’s chi-squared test) on “binned” or “quantized” data with say, m bins. This is what I do for the tests in

one of which is for the Gamma distribution.

Since your post suggests that you want to test whether the data are consistent with realizations of i.i.d. Gamma random variables, but without specifying the parameters, an additional issue is that you must specify the appropriate parameters to use to conduct the test. One can use the ML estimate, but the limiting distribution of the chi-squared test statistic is then bounded between a χ2(m-1-2)=χ2(m-3) and a χ2(m-1) random variable, so you can only get lower and upper bounds on the (asymptotic) p-value.

The 2 in “m-1-2” comes from the fact that there are 2 parameters for a Gamma distribution.

Chernoff, H., & Lehmann, E. L. (1954). The use of maximum likelihood estimates in χ2 tests for goodness of fit. The Annals of Mathematical Statistics.


#6

In my experience as a data scientist I find that this is the only way to do this sort of thing that has any hope of having a reasonably understandable interpretation (though I don’t consider myself an expert in statistics by any stretch). Though, in the data science context the assumptions that go into such a test usually seem so dubious that you’d have a hard time convincing me that the resulting p-value means anything. In most cases it’s probably better to try to break down your assumption into simpler features rather than testing against the entire distribution, as @Tamas_Papp hinted. Of course, I have no idea what you’re using this for, if physics you can pretty much ignore everything I just said, though the fact that you didn’t just do a maximum likelihood fit and report statistical uncertainties in the standard way led me to believe that this is not physics.

By the way, this and this are still my go-to references on probability and statistics respectively. They are geared toward phsyicists, but should be easy enough for everyone to understand and are basically self-contained descriptions of the entire topics (though I know statisticians would disagree :wink:).


#7

Indeed, goodness-of-fit tests are much easier to interpret than many other statistical tests. In this case, the test statistic measures discrepancies between the expected number of values falling into each bin and the actual numbers falling into each bin. The larger the test statistic, the more implausible it is that the data is a realization of i.i.d. Gamma random variables. The p-value is the probability of seeing as large a test statistic as one has seen, if one calculated the test statistic using i.i.d. Gamma random variables with the most favourable parameters. Whether or not the p-value is actually meaningful depends on the questions one wants to answer.

It is not completely implausible for real processes to generate Gamma distributed data, but indeed if there is a discrepancy then very large samples will typically result in small p-values [as they should!]. In this case, the individual components of the chi-squared test statistic will tell you which bins have more or less than their expected number, and some people like to visualize this using a “hanging chi-gram”. Finally, for moderate data sizes it is often part of the art of statistics / data science to identify appropriate parametric models for the data, and this can lead to meaningful inference / decisions.

A very good introductory book on statistics is Larry Wasserman’s generously, but not inappropriately titled All of Statistics.


#8

Seems my problem were deeper than I though (which is so often the case). Thanks everyone, this is a lot of useful input. I will have to retire to my chamber and have a good think about this in!