Hypothesis testing in Julia

I am taking a statistical data science course that is taught using R, and I am trying to replicate all the practical things in Julia. As we are doing quite basic stuff, mostly everything has been quite straightforward and similar to R, but I have been having some problems with hypothesis testing.

The package I am using is HypothesisTests.jl.

  1. I don’t exactly understand the implementation of ChisqTest in that package. I get that if I want the goodness-of-fit test I have to provide a vector x and theta0 and that works as expected. But for the contingency table test, it seems to only accept x as a matrix and not x and y (for example ChisqTest([50,100,50], [50,100,50])) as it would seem from the docs (link). The error message shows that the closest candidates are
ChisqTest(::AbstractVector{T}, ::AbstractVector{T}, ::Tuple{UnitRange{T}, UnitRange{T}})
ChisqTest(::AbstractVector{T}, ::AbstractVector{T}, ::T)

I don’t really understand where these come from or what values I should use there. I tried ChisqTest([50,100,50], [50,100,50], (1:3,1:3)) and ChisqTest([50,100,50], [50,100,50], 3), but both give an error “ArgumentError: at least one entry must be positive”, which confuses me. So I guess the main question here is, what is the argument y for? I thought I could use it for a contingency table test with two vectors, but I might be wrong.

  1. There is prop.test (link) for testing probabilities/proportions in R. RDocumentation doesn’t really explain what is the test behind it. Googling has led me to believe that it is a z-test for proportions, which, if I understand correctly, isn’t available in the HypothesisTests.jl package. For a simpler case of comparing two proportions, I tried making a vectors of ones and zeros with correct proportions and then applied a two-sample z-test, but that didn’t yield the same result, so I don’t think that is the correct workaround. The R version gives chisq value for the test and I tried ChisqTest with contingency table, which gave a much more similar p-value to the R function, but not the same, and as I’ve understood, that should not be the correct approach. Any suggestions on how to replicate the R’s prop.test in Julia would be appreciated.
1 Like
  1. Regarding the chi-squared test, can you describe what your vectors are? The three-argument ChisqTest method expects x and y to be vectors with one value for each independent observations, and it computes the contingency table from that. The third argument to give the possible values for x and y. But it only supports ranges (so values must be contiguous), and is undocumented. This would give for example ChisqTest([50,100,50], [50,100,50], (50:100, 50:100)). We should probably fix this by not requiring the third argument, feel free to file an issue in GitHub.
    If you already have the counts, then you can put them in a matrix, like ChisqTest([x y]). AFAICT this is the same in R, isn’t it?

  2. prop.test in R is a Binomial test with the Wilson approximation. R also supports the Clopper–Pearson variant via binom.test. Both are supported via BinomialTest in HypothesisTests, see ?confint for details about supported variants.

1 Like

Thank you for the reply!

  1. The example I gave was indeed meant to be the counts in vectors. And constructing a matrix from the vectors works. As I saw from the docs that it was possible to use two vectors, I thought I could do so with counts, but apparently misinterpreted the docs. Nevertheless, I now also tried the three-argument method but found that it expects the vectors to be of the same length. It should be possible to compute a contingency table from samples of different sizes, right? This seemed strange, but maybe I am missing something.

  2. Thanks for the lead on the Wilson approximation. I forgot to mention that I was thinking of the usage of the prop.test for comparing two samples, basically two proportions. I am not sure how to do that with BinomialTest if it is possible. After some more googling I found that I could replicate the result from R without the continuity correction when calculating the z-statistic described here: https://online.stat.psu.edu/stat800/lesson/5/5. And then finding the two-tailed p-value with it. Is it possible to calculate the hypothesis test for two sample proportions?

By definition a contingency table crosses values of variables from a single sample. Maybe you can reformulate your data as a single variable crossed with a variable indicating which sample each value comes from?

Unfortunately we don’t support two sample binomial tests for now, though there’s a PR open for that. If you don’t want to use the code from that PR, you could use Fisher’s exact test instead.

1 Like

I am not that familiar with the chi-squared test besides some practical examples, so I was just toying around with it. Thank you for the answers!