Kolmogorov-Smirnov test

I have a sample consisting of a set of integers and want to test if can be from a specific distribution. I am not that strong in statistics, but my understanding is that the Kolmogorov-Smirnov test is a good choice for assessing this, so I tried the code below:

using HypothesisTests, Distributions
n = 50
x = 0.5 .^ (1:n)
d = DiscreteNonParametric(1:n, x ./ sum(x))
sample = vcat(ones(Int64,16), 2*ones(Int64,8), 3*ones(Int64,4), 4,4,5)
display(ExactOneSampleKSTest(sample, d))

The problem is that this rejects the hypothesis, even though the sample seems to fit the distribution almost perfectly. Can anyone explain why this happens? The output is given below:

β”Œ Warning: This test is inaccurate with ties
β”” @ HypothesisTests ~/.julia/packages/HypothesisTests/BgrVj/src/kolmogorov_smirnov.jl:68
Exact one sample Kolmogorov-Smirnov test
----------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.5

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-06

Details:
    number of observations:   31
2 Likes

It seems to me like the default tail value for pvalue(t::ExactOneSampleKSTest, tail=:both) is wrong. I think it should use tail = :right, since we’re interested in the probability of the KS statistic being less than the value we have.

If we do pvalue(ExactOneSampleKSTest(sample, d), tail=:right), we get p = 0.92, which seems more reasonable. To be honest, I’m a little confused by the implementation of this in HypothesisTests.

1 Like

Thank you, you certainly seem to know much more about this than me, so I think I will file an issue with the developers and see if I can get some clarity on whether this is a bug or just a misunderstanding.

1 Like

When you file the issue can you post a link here? People might be interested in the discussion, but it is definitely to discuss it there.

1 Like

In case you want to double check the results, cell 11 or so in these lecture notes has a simple code for that: FinancialEconometrics/Ch19_TestingDistributions.ipynb at master Β· PaulSoderlind/FinancialEconometrics Β· GitHub

2 Likes

Looking at this more closely, I don’t think it’s a problem with the choice of tails, but rather in the computation of the KS statistic.

See in comparison to the implementation in scipy.stats

HypothesisTests.jl:

julia> print(x)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5]
julia> print(y)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 5, 8]
julia> t = ApproximateTwoSampleKSTest(x, y)
β”Œ Warning: This test is inaccurate with ties
β”” @ HypothesisTests ~/.julia/packages/HypothesisTests/BgrVj/src/kolmogorov_smirnov.jl:167
Approximate two sample Kolmogorov-Smirnov test
----------------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.516129

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0005

Details:
    number of observations:   [31,31]
    KS-statistic:              2.032002032003047

vs scipy stats

In [13]: stats.kstest(x, y, alternative="two-sided", method="asymp")
Out[13]: KstestResult(statistic=0.032258064516129115, pvalue=1.0)

In [14]: stats.kstest(x, y, alternative="greater", method="asymp")
Out[14]: KstestResult(statistic=0.032258064516129115, pvalue=0.9375209928337668)

In [15]: stats.kstest(x, y, alternative="less", method="asymp")
Out[15]: KstestResult(statistic=0.032258064516129004, pvalue=0.9375209928337671)
1 Like

I have now filed the issue with the developers ExactOneSampleKSTest rejects hypothesis incorrectly Β· Issue #280 Β· JuliaStats/HypothesisTests.jl Β· GitHub

Thanks to all that participated in this thread.

SPSS28 test for this data

image

NPTESTS 
  /INDEPENDENT TEST (V) GROUP (G) KOLMOGOROV_SMIRNOV 
  /MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE 
  /CRITERIA ALPHA=0.05  CILEVEL=95.

I think this is the case to pay attention to the warning. The calculation of KS statistic involves sorting and because of stability of sort (i.e. preserve order of identical elements) problems can get exacerbated to extreme.

For example:

julia> pvalue(ApproximateTwoSampleKSTest(sample, sample), tail=:both)
β”Œ Warning: This test is inaccurate with ties
β”” @ HypothesisTests ~/.julia/packages/HypothesisTests/BgrVj/src/kolmogorov_smirnov.jl:167
0.000518320212633612    # rejects strongly

julia> pvalue(ApproximateTwoSampleKSTest(
  sample.+rand(length(sample))/100, 
  sample.+rand(length(sample))/100), tail=:both)

0.8148352946917117     # "nothing to see here"

julia> histogram([pvalue(
  ApproximateTwoSampleKSTest(sample.+rand(length(sample))/100,
  sample.+rand(length(sample))/100), tail=:both) for i in 1:1000])  
  # uses UnicodePlots
              β”Œ                                        ┐ 
   [0.0, 0.1) β”€β–Ž 3                                       
   [0.1, 0.2) ─▋ 11                                      
   [0.2, 0.3) β”€β–ˆβ– 21                                     
   [0.3, 0.4) ─  0                                       
   [0.4, 0.5) β”€β–ˆβ–ˆβ–ˆβ–‹ 55                                   
   [0.5, 0.6) ─  0                                       
   [0.6, 0.7) β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– 134                            
   [0.7, 0.8) ─  0                                       
   [0.8, 0.9) β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– 257                    
   [0.9, 1.0) β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  519  
              β””                                        β”˜ 
                               Frequency                 

julia> count([pvalue(
  ApproximateTwoSampleKSTest(
  sample.+rand(length(sample))/100, 
  sample.+rand(length(sample))/100), tail=:both) for i in 1:1000] .< 0.05)
0

It is possible to mitigate this tie-breaking problem in the HypothesisTests package (optionally, and in a way that allows reproducability of tie-breaking). But the results should be okay with a little jitter on the caller side (size of jitter dependent on underlying sample and distrubution).

Note:

  1. sample is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5], and jitter is small compared to samples.

  2. The distribution of p-values is a bit non-uniform, which merits a bit of investigating.

I vaguely recall that the KS test assumes no ties. Is that perhaps the problem here? Using a stable sort would essentially have the effect of breaking all ties by order of appearance, I’m not sure if that’s sufficient or necessary to fix this.

I tried to add jitter to my example and it removed the warning, but the result was the same.

I am beginning to suspect that the implementation in ExactOneSampleKSTest does not work for discrete distributions. According to The Kolmogorov-Smirnov Goodness-of-fit Test - YouTube the Kolmogorov-Smirnov test can be used on discrete distributions, but I would not expect that code made for continuous distributions would give reliable results for discrete distributions, regardless of whether jitter is added to the sample(s) or not.

Yes, the Kolmogorov Smirnov test uses the continuity of the distribution. For discrete distributions it is more conservative, so not clear why it rejects, can you plot the empirical CDF versus the theoretical CDF?

I would expect that it rejects because it bases its decision on the x value that has the largest difference between the two CDFs and since the theoretical CDF is discontinuous and e.g. jumps from 0 to 0.5 for x=1, then the best case scenario for the largest difference would be ~0.25 which it apparently feels is enough to reject.

They should work with discrete distributions but I think the implementations were written with the assumption of no ties, and so the tie case is not handled gracefully.

The jitter solution is good for the two-sample case (as shown in my previous response). The one-sample and two-sample cases are implemented separately and need different β€œfixes”.

Adding below an initial β€œfix” implementation, which you can test. Tried to pay attention to ties in generalizing the original implementation.

using HypothesisTests, Distributions, Random, StatsBase

import HypothesisTests: ksstats

#### UPDATE: This function isn't correct. Don't use.
# used in one-sided test
# function ksstats(x::AbstractVector{T}, d::UnivariateDistribution) where T<:Real
#     n = length(x)
#     cdfs = cdf.(Ref(d), sort(x))
#     Ξ΄p = maximum(competerank(x) ./ n .- cdfs)  # fixed ranks from 1:n
#     Ξ΄n = -minimum((n .- competerank(x, lt=(>))) ./ n .- cdfs)
#     Ξ΄ = max(Ξ΄n, Ξ΄p)
#     (n, Ξ΄, Ξ΄p, Ξ΄n)
# end

# used in two-sided test
function ksstats(x::AbstractVector{T}, y::AbstractVector{S}) where {T<:Real, S<:Real}
    n_x, n_y = length(x), length(y)
    n = n_x + n_y
    randperm = shuffle(1:n)   # add a random perm to random tie-break
    sort_idx = sortperm([x; y][randperm])
    pdf_diffs = [ones(n_x)/n_x; -ones(n_y)/n_y][randperm][sort_idx]
    cdf_diffs = cumsum(pdf_diffs)
    Ξ΄p = maximum(cdf_diffs)
    Ξ΄n = -minimum(cdf_diffs)
    Ξ΄ = max(Ξ΄p, Ξ΄n)
    (n_x, n_y, Ξ΄, Ξ΄p, Ξ΄n)
end

With these overwritten ksstats function, the original code should/might work.

Thank you, I ran this on the example in the original post and it β€œfailed to reject h_0”, but it now seems to do this even when I would expect it to reject:

julia> ExactOneSampleKSTest(fill(31,9999),d)
β”Œ Warning: This test is inaccurate with ties
β”” @ HypothesisTests ~/.julia/packages/HypothesisTests/BgrVj/src/kolmogorov_smirnov.jl:68
Exact one sample Kolmogorov-Smirnov test
----------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.00010001

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           1.0000

Details:
    number of observations:   9999


julia> 

using the same d as in the original post.

Oops. Need to have a look. In the meantime, the two-sample test is a little simpler and so probably correct now. It can be used as a stop gap:

julia> ApproximateTwoSampleKSTest(fill(1,9999), rand(d,9999))
β”Œ Warning: This test is inaccurate with ties
β”” @ HypothesisTests ~/.julia/packages/HypothesisTests/BgrVj/src/kolmogorov_smirnov.jl:167
Approximate two sample Kolmogorov-Smirnov test
----------------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.50155

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   [9999,9999]
    KS-statistic:              35.46317827975053

1 Like

It seems K-S test is not really suited for discrete distributions as-is, but has been extended to support them. The improved test is Cramer-von-Mises test. A good reference (I haven’t had time to read): http://www.stat.yale.edu/~jay/EmersonMaterials/DiscreteGOF.pdf.

Perhaps a Chi-square Goodness-of-fit test is worth exploring for this problem.

2 Likes

Continuing from last post, here is a Chi-squared test for the discussed problem:

onesampleGOF(sample, d::DiscreteNonParametric) = begin
    cm = [get(countmap(sample),k,0) for k in d.support]
    sum(values(cm)) < length(sample) && error("Sample has elements outside support")
    return ChisqTest(cm, d.p)
end

and then it is possible to:

julia> pvalue(onesampleGOF(2 .* sample, d))
7.485816224539387e-10
julia> pvalue(onesampleGOF(sample, d))
1.0
julia> pvalue(onesampleGOF(fill(31,10), d))
0.0
julia> pvalue(onesampleGOF(rand(d,100), d))
0.9999999116094588
julia> pvalue(onesampleGOF(rand(1:51,10), d))
0.0

Thank you, so this is how to do a Chi Squared test with HypothesisTests, I actually ended up making my own basic implementation.