Hypothesis testing in Julia

I am taking a statistical data science course that is taught using R, and I am trying to replicate all the practical things in Julia. As we are doing quite basic stuff, mostly everything has been quite straightforward and similar to R, but I have been having some problems with hypothesis testing.

The package I am using is HypothesisTests.jl.

  1. I don’t exactly understand the implementation of ChisqTest in that package. I get that if I want the goodness-of-fit test I have to provide a vector x and theta0 and that works as expected. But for the contingency table test, it seems to only accept x as a matrix and not x and y (for example ChisqTest([50,100,50], [50,100,50])) as it would seem from the docs (link). The error message shows that the closest candidates are
ChisqTest(::AbstractVector{T}, ::AbstractVector{T}, ::Tuple{UnitRange{T}, UnitRange{T}})
ChisqTest(::AbstractVector{T}, ::AbstractVector{T}, ::T)

I don’t really understand where these come from or what values I should use there. I tried ChisqTest([50,100,50], [50,100,50], (1:3,1:3)) and ChisqTest([50,100,50], [50,100,50], 3), but both give an error “ArgumentError: at least one entry must be positive”, which confuses me. So I guess the main question here is, what is the argument y for? I thought I could use it for a contingency table test with two vectors, but I might be wrong.

  1. There is prop.test (link) for testing probabilities/proportions in R. RDocumentation doesn’t really explain what is the test behind it. Googling has led me to believe that it is a z-test for proportions, which, if I understand correctly, isn’t available in the HypothesisTests.jl package. For a simpler case of comparing two proportions, I tried making a vectors of ones and zeros with correct proportions and then applied a two-sample z-test, but that didn’t yield the same result, so I don’t think that is the correct workaround. The R version gives chisq value for the test and I tried ChisqTest with contingency table, which gave a much more similar p-value to the R function, but not the same, and as I’ve understood, that should not be the correct approach. Any suggestions on how to replicate the R’s prop.test in Julia would be appreciated.
  1. Regarding the chi-squared test, can you describe what your vectors are? The three-argument ChisqTest method expects x and y to be vectors with one value for each independent observations, and it computes the contingency table from that. The third argument to give the possible values for x and y. But it only supports ranges (so values must be contiguous), and is undocumented. This would give for example ChisqTest([50,100,50], [50,100,50], (50:100, 50:100)). We should probably fix this by not requiring the third argument, feel free to file an issue in GitHub.
    If you already have the counts, then you can put them in a matrix, like ChisqTest([x y]). AFAICT this is the same in R, isn’t it?

  2. prop.test in R is a Binomial test with the Wilson approximation. R also supports the Clopper–Pearson variant via binom.test. Both are supported via BinomialTest in HypothesisTests, see ?confint for details about supported variants.

Thank you for the reply!

  1. The example I gave was indeed meant to be the counts in vectors. And constructing a matrix from the vectors works. As I saw from the docs that it was possible to use two vectors, I thought I could do so with counts, but apparently misinterpreted the docs. Nevertheless, I now also tried the three-argument method but found that it expects the vectors to be of the same length. It should be possible to compute a contingency table from samples of different sizes, right? This seemed strange, but maybe I am missing something.

  2. Thanks for the lead on the Wilson approximation. I forgot to mention that I was thinking of the usage of the prop.test for comparing two samples, basically two proportions. I am not sure how to do that with BinomialTest if it is possible. After some more googling I found that I could replicate the result from R without the continuity correction when calculating the z-statistic described here: https://online.stat.psu.edu/stat800/lesson/5/5. And then finding the two-tailed p-value with it. Is it possible to calculate the hypothesis test for two sample proportions?

By definition a contingency table crosses values of variables from a single sample. Maybe you can reformulate your data as a single variable crossed with a variable indicating which sample each value comes from?

Unfortunately we don’t support two sample binomial tests for now, though there’s a PR open for that. If you don’t want to use the code from that PR, you could use Fisher’s exact test instead.

I am not that familiar with the chi-squared test besides some practical examples, so I was just toying around with it. Thank you for the answers!