[ANN] Pingouin.jl: a simple yet exhaustive statistical package

Clement_POIRET · October 14, 2020, 12:41pm

Hi everyone,

I always wanted to learn Julia, and on the other hand, I never found a satisfying library to conduct statistical tests. I used Pingouin, a stats library in Python made by Raphael Vallat, and I always wished to have an equivalent package in Julia. So here is my version, completely coded in Julia, (pre-release with only a limited set of features).

As of now, Pingouin.jl 0.1.0 (GitHub - clementpoiret/Pingouin.jl: Reimplementation of Raphaelvallat's Pingouin in Julia) supports distribution-related functions such as:

Anderson-Darling test of distribution,
Geometric standard (Z) score,
Levene & Bartlett tests for homoscedasticity,
Shapiro-Wilk, Shapiro-Francia and Jarque Bera tests of normality,
Mauchly and JNS tests for sphericity,
Epsilon adjustement factor for repeated measures (e.g. i.e. Greenhouse-Geisser, Huynh-Feldt, Lower bound).

It also supports effect sizes-related functions:

Effect sizes between two Arrays:

Unbiased Cohen d,
Hedges g,
Glass delta,
correlation coefficient (pearson),
Eta-square,
Odds ratio,
Area Under the Curve,
Common Language Effect Size.

The conversion of pearson’s r and cohen’s d to:

Unbiased Cohen d,
Hedges g,
Eta-square,
Odds ratio,
Area Under the Curve.

But also the computation of effect sizes from T-values, parametric confidence intervals around a Cohen d or a correlation coefficient, and bootstrapped confidence intervals of univariate and bivariate functions.

The main goal is to provide a really a simple API, for simple and advanced statistics. The 0.1.0 will soon be published to the default julia package registry.

It is my first real project in Julia, so I really hope you’ll like it. I’m a newbie, so feel free to give any suggestions, contributions. Feel free to make any remarks, or whatever you want, I want to improve my Julia skills

The next release will include paired and unpaired non-parametric tests such as Mann-Whitney U, Wilcoxon Signed Rank, or Friedman.

nilshg · October 14, 2020, 3:41pm

Have you seen HypothesisTests.jl? I haven’t looked at your package in detail so I might be speaking out of turn here, but it seems to me that your work is closely related and might be worth contributing to the established package to prevent fragmentation?

Albert_Zevelev · October 14, 2020, 4:23pm

@nilshg raises a great point.
You’ve already filled a few gaps in Goodness-of-fit tests for the Julia ecosystem.

It would be convenient if these tests were all in one place.

Suppose a Julia user wants to test a hypothesis H_0. There are often many different tests, each w/ different properties (some have better power, others better size etc). Having more tests together allows for easier discovery, maintainability, and comparisons across tests.

For example, if I wanna test whether my sample is normally distributed in Mathematica, it automatically returns all relevant goodness-of-fit tests (along w/ stats & pvalues):

Tomas_Pevny · October 14, 2020, 5:42pm

I support that it would be better to fill a PR to HypothesisTesting package if possible. I believe the work will have a bigger impact.

Clement_POIRET · October 14, 2020, 5:57pm

Thank you for your comments! I agree with all of you guys, especially for the Shapiro-Wilk test which is fairly common but yet to be implemented in Julia (except here ahah). I’ll do some PR, but the end goal of Pingouin (as you can see is the readme or in the original Python package), is not only hypothesis tests. E.g., it’ll include some plotting methods like QQ-Plots, or even estimation statistics (which are not really testing hypothesis): Estimation Stats and I don’t think it’s the goal of HypothesisTests.jl; maybe out of scope?

As of now I started with some hypothesis tests because it’s what I use the most, but it could be a wrapper around HypothesisTests.jl (I already use it for example for Jarque-Bera)

I’ll work on the PR when I’ll have some time

Luapulu · October 15, 2020, 2:37pm

For plotting, it might be worth adding to Plots.jl or StatsPlots.jl.

Paulms · October 15, 2020, 2:39pm

The overall package idea seems good to me. I agree that it would be better to add the hypothesis tests to the existing HypothesisTests.jl and use it internally in your package. Then you can add the additional features that you have planned and still contribute to the common ecosystem infrastructure.

piever · October 15, 2020, 3:09pm

QQ-Plots are implemented in both StatsPlots and AbstractPlotting, but if you have other statistical visualizations that are not covered by those packages, a PR to integrate them would definitely be welcome!

Clement_POIRET · October 19, 2020, 8:12am

Thanks all of you for your kind advices, I’ll be happy to submit PRs then use them in my package

dmolina · October 19, 2020, 9:47am

Thank you, @Clement_POIRET, it is a nice package. I see that all functions are nicely documented, even with examples. I suggest you to use Documenter.jl or similar to document the package, it is very simple, it can be done in minutes, and could be useful for users.

Clement_POIRET · October 20, 2020, 8:17am

I’ll take a look at Documenter.jl, thanks for the tip @dmolina

abulak · October 29, 2020, 2:12pm

I see you took Shapiro Wilk from ExploreASL
https://github.com/clementpoiret/Pingouin.jl/blob/master/src/_shapiro.jl

However these codes are under incompatible licences? I remember that the original code (in fortran) was by no means GPL…

As a matter of fact there is a Shapiro-Wilk test PR in HT:
https://github.com/JuliaStats/HypothesisTests.jl/pull/124/files
that I did for the purpose of computing the exact coefficients (I’m yet to come back to this:)

Palli · October 29, 2020, 2:47pm

I remember that the original code (in fortran) was by no means GPL…

And neither is the MATLAB code you took (and translated): https://github.com/ExploreASL/ExploreASL/blob/495ecc662cd0fd2c59ebcce3469d615eb6b0a89d/Functions/xASL_stat_ShapiroWilk.m

so I would recommend to you @Clement_POIRET to ditch that code (and tag an new incompatible version according to semver, i.e. 0.2.0), as people and you could be liable (even for what others do) for non-commercial use. If there’s other similar code (@abulak seems to point to such MIT licensed non-problematic code); IF I recall you can take (that) MIT code however and put it in your package, stating MIT in a file, with the whole GPL, but I’m not a lawyer so please confirm.

Another option would be to prominently copy their license to your LICENSE file but since theirs is non-free (as in freedom)/non-GPL compatible, that’s not possible if some of the other code you made MUST be GPL. As that license states, “solely for non-commercial use” please don’t do that and apply such to other code you have. That’s worse than the GPL in my opinion, and why it make the library non-GPL compatible.

Since your package has a lot of different stat tools that may or may not be used together, it seems tempting to have one license for one part and another for another part, but it seems highly dubious, when those licenses are contradictory, and many users would use both parts. Even users that distribute their code and/or [with] your code, even non-commercially would be in violation with the GPL.

You can still distribute that translated code, in a separate package since you already made it and wouldn’t want it to go to waste. I see you like your package “exhaustive” a good goal, and people can then just depend on two packages, and do whatever they like privately, the GPL doesn’t prevent you doing anything with it in private (even within your company, and subsidiaries under their control, and employees (work-for-hire) people can use).

If your users would use that non-GPL code in private non-commercially, at least you’re off the hook, and it would be their problem if they distribute the code or a package depending on both of your packages.

The Software is distributed “AS IS” under this License solely for
non-commercial use in the hope that it will be useful, but

[…]

No part of the Software may be reproduced, modified, transmitted or
transferred in any form or by any means, electronic or mechanical,
without the express permission of the University. The permission of
the University is not required if the said reproduction, modification,
transmission or transference is done without financial return,
conditions of this License are imposed upon the receiver of the
product, and all original and amended source code is included in any
transmitted product. You may be held legally responsible for any
copyright infringement that is caused […]

Clement_POIRET · November 2, 2020, 8:01am

Thanks @abulak and @Palli for pointing this out!
I think that for now I’ll just remove Shapiro from the lib, while waiting for HypothesisTests.jl to have its own method. All in all, it’s not wasted, I learned something from translating it

Otherwise as of now, my PhD with lockdown constraints is taking most of my time, so I’m focusing on making wrapper functions around HypothesisTests for non-parametric tests. As some of you suggested, I’m also refactoring the doc (Pingouin.jl Documentation · Pingouin), and I’m optimizing the code (deleting eval calls, using multiple dispatch, etc.)

abulak · November 2, 2020, 9:30pm

You can either use the fortran version (as indicated in https://github.com/JuliaStats/HypothesisTests.jl/pull/124/files#diff-851fadd01a7f02254ea7f69728364d187a8e0ef3d9af1f2c016ffb7dc231e653R182), or use the version I wrote. In either case you need to specify clearly in your Readme and on the top of the file the license of the content. The source of swilk.f seems to be owned by the Journal of the Royal Statistical Society, (unclear license? someone please shed some light on this here?), or in case of my implementation MIT. PM me if you absolutely need GPL.

anon92994695 · November 2, 2020, 10:09pm

I have a QQ plots recipe and some other goodies in ChemometricsTools - feel free to lift it and disperse it elsewhere :).

Clement_POIRET · November 3, 2020, 10:23am

I think I’ll use your implementation @abulak, and specify it’s distributed under MIT.

@anon92994695 I’m still not working on plots, but I’ll keep your lib in my head when I’ll start

BadBoy · February 3, 2023, 6:37pm

Now, in Julia 18.5, HypothesisTests.jl(version is v0.10.11) doesnot have any function for Shapiro wilk test. So, how can perform this test.

Thanks in advance.

nilshg · February 4, 2023, 7:31am

Looks like it’s close to merging so you can give this PR a try:

github.com/JuliaStats/HypothesisTests.jl

Shapiro-Wilk normality test

JuliaStats:master ← kalmarek:ShapiroWilk

opened 06:51PM - 04 Jan 18 UTC

kalmarek

+405 -0

implements ShapiroWilkTest following > PATRICK ROYSTON > Approximating the Sha…piro-Wilk W-test for non-normality > *Statistics and Computing* (1992) **2**, 117-119 > DOI: [10.1007/BF01891203](https://doi.org/10.1007/BF01891203) This is work in progress, all comments are welcome! I tried to follow the paper closely, but **copied** e.g. computed constants from the original [swilk.f](https://github.com/scipy/scipy/blob/master/scipy/stats/statlib/swilk.f). Please let me know if You are ok (license-wise) with this. These polynomial interpolations are computed at loadtime for speed. Currently ``` julia> using BenchmarkTools julia> srand(1); julia> k = 5000; julia> X = sort(randn(k)); julia> @btime ShapiroWilkTest(X); 112.523 μs (18 allocations: 137.45 KiB) julia> swc = SWCoeffs(k); julia> @btime ShapiroWilkTest(X, swc); 62.476 μs (11 allocations: 78.50 KiB) ``` whereas calling `swilkfort` directly ``` julia> srand(1); julia> X = sort(randn(5000)); julia> A = zeros(2500); julia> @btime swilkfort!(X, A); 205.763 μs (10 allocations: 224 bytes) julia> @btime swilkfort!(X, A, false); 75.692 μs (10 allocations: 224 bytes) ``` Still missing: - [x] documentation - [ ] censored data - [ ] polynomial interpolation with better precision (?), and probably more. I tried to compute exact values of `SWCoeffs` (via MonteCarlo simulation), but the results I'm getting are off the reported ones in Table 1 *op. cit.* Woule be glad if anyone could help. ```julia > function approximate_A(N, samps) s = [sort(randn(N)) for i in 1:samps] m = sum(s[i][1:N] for i in 1:samps)/samps ss = vcat(s...) Vinv = inv(cov(ss, ss)) A = (-m'*Vinv/sqrt(m'*Vinv*Vinv*m))' return ((A - reverse(A))/2)[1:div(N,2)] end; ``` `approximate_A(10,1000000)` results in ``` [0.5469493640066541, 0.3559315327467397, 0.23322491989287789, 0.1336531031109471, 0.043609401791162974] ``` Compare `swilk.f`'s, and `SWCoeffs(10).A`: ``` [0.5737146934337275, 0.3289700603561813, 0.21434902886647542, 0.12279063037794058, 0.04008871216291867] [0.5737147066903874, 0.3289700464878161, 0.21434901803439887, 0.1227906248657577, 0.04008871105102476] ``` ## Rahman & Govindarajulu A different implementation of SWCoeffs following > M. Mahibbur Rahman & Z. Govindarajulu, > A modification of the test of Shapiro and Wilk for normality > *Journal of Applied Statistics* 24:2, 219-236, 1997 > DOI: [10.1080/02664769723828](http://dx.doi.org/10.1080/02664769723828) ```julia function HypothesisTests.SWCoeffs(N::Int, ::Type{Val{:Rahman}}) M = [norminvcdf(i/(N+1)) for i in 1:N]; normM = normpdf.(M) c = (N+1)*(N+2) d = 2c.*normM.^2 dl= [-c*normM[i]*normM[i+1] for i in 1:N-1] Vinv = SymTridiagonal(d, dl) C = sqrt(M'*(Vinv*Vinv)*M) Astar = -(M'*Vinv)'/C return SWCoeffs(N, Astar[1:div(N,2)]) end ``` But they don't provide critical values for their statistic (or I couldn't find it), so that would require further simulation. Also their `SWCoeffs(10, Val{:Rahman})` are quite different: ``` [0.6010309510135708, 0.2975089851946684, 0.19212508233094944, 0.10979215067183574, 0.03583065785524651] ``` On the other hand `ALGORITHM AS R94` is a **standard**.

Topic		Replies	Views
ANOVA Tests in Julia? Statistics	76	14026	August 11, 2022
[Question] Python `scipy.stats` alternative for Julia Statistics package , hypothesis-tests	2	981	August 17, 2020
Pushing Julia/statistics development Statistics	14	6125	August 8, 2022
Random variables in Julia (working list) Statistics distributions	36	6108	November 27, 2022
[ANN] Copulas.jl : A fully `Distributions.jl`-compliant copula package Package Announcements package , announcement , distributions , copula	31	3766	September 3, 2024

[ANN] Pingouin.jl: a simple yet exhaustive statistical package

Related topics