[ANN] Pingouin.jl: a simple yet exhaustive statistical package

Hi everyone,

I always wanted to learn Julia, and on the other hand, I never found a satisfying library to conduct statistical tests. I used Pingouin, a stats library in Python made by Raphael Vallat, and I always wished to have an equivalent package in Julia. So here is my version, completely coded in Julia, (pre-release with only a limited set of features).

As of now, Pingouin.jl 0.1.0 (GitHub - clementpoiret/Pingouin.jl: Reimplementation of Raphaelvallat's Pingouin in Julia) supports distribution-related functions such as:

  • Anderson-Darling test of distribution,
  • Geometric standard (Z) score,
  • Levene & Bartlett tests for homoscedasticity,
  • Shapiro-Wilk, Shapiro-Francia and Jarque Bera tests of normality,
  • Mauchly and JNS tests for sphericity,
  • Epsilon adjustement factor for repeated measures (e.g. i.e. Greenhouse-Geisser, Huynh-Feldt, Lower bound).

It also supports effect sizes-related functions:

Effect sizes between two Arrays:

  • Unbiased Cohen d,
  • Hedges g,
  • Glass delta,
  • correlation coefficient (pearson),
  • Eta-square,
  • Odds ratio,
  • Area Under the Curve,
  • Common Language Effect Size.

The conversion of pearsonā€™s r and cohenā€™s d to:

  • Unbiased Cohen d,
  • Hedges g,
  • Eta-square,
  • Odds ratio,
  • Area Under the Curve.

But also the computation of effect sizes from T-values, parametric confidence intervals around a Cohen d or a correlation coefficient, and bootstrapped confidence intervals of univariate and bivariate functions.

The main goal is to provide a really a simple API, for simple and advanced statistics. The 0.1.0 will soon be published to the default julia package registry.

It is my first real project in Julia, so I really hope youā€™ll like it. Iā€™m a newbie, so feel free to give any suggestions, contributions. Feel free to make any remarks, or whatever you want, I want to improve my Julia skills :slight_smile:

The next release will include paired and unpaired non-parametric tests such as Mann-Whitney U, Wilcoxon Signed Rank, or Friedman.

32 Likes

Have you seen HypothesisTests.jl? I havenā€™t looked at your package in detail so I might be speaking out of turn here, but it seems to me that your work is closely related and might be worth contributing to the established package to prevent fragmentation?

12 Likes

@nilshg raises a great point.
Youā€™ve already filled a few gaps in Goodness-of-fit tests for the Julia ecosystem.

It would be convenient if these tests were all in one place.

Suppose a Julia user wants to test a hypothesis H_0. There are often many different tests, each w/ different properties (some have better power, others better size etc). Having more tests together allows for easier discovery, maintainability, and comparisons across tests.

For example, if I wanna test whether my sample is normally distributed in Mathematica, it automatically returns all relevant goodness-of-fit tests (along w/ stats & pvalues):
image

6 Likes

I support that it would be better to fill a PR to HypothesisTesting package if possible. I believe the work will have a bigger impact.

1 Like

Thank you for your comments! I agree with all of you guys, especially for the Shapiro-Wilk test which is fairly common but yet to be implemented in Julia (except here ahah). Iā€™ll do some PR, but the end goal of Pingouin (as you can see is the readme or in the original Python package), is not only hypothesis tests. E.g., itā€™ll include some plotting methods like QQ-Plots, or even estimation statistics (which are not really testing hypothesis): Estimation Stats and I donā€™t think itā€™s the goal of HypothesisTests.jl; maybe out of scope?

As of now I started with some hypothesis tests because itā€™s what I use the most, but it could be a wrapper around HypothesisTests.jl (I already use it for example for Jarque-Bera) :slight_smile:

Iā€™ll work on the PR when Iā€™ll have some time

11 Likes

For plotting, it might be worth adding to Plots.jl or StatsPlots.jl.

The overall package idea seems good to me. I agree that it would be better to add the hypothesis tests to the existing HypothesisTests.jl and use it internally in your package. Then you can add the additional features that you have planned and still contribute to the common ecosystem infrastructure.

8 Likes

QQ-Plots are implemented in both StatsPlots and AbstractPlotting, but if you have other statistical visualizations that are not covered by those packages, a PR to integrate them would definitely be welcome!

6 Likes

Thanks all of you for your kind advices, Iā€™ll be happy to submit PRs then use them in my package :slight_smile:

Thank you, @Clement_POIRET, it is a nice package. I see that all functions are nicely documented, even with examples. I suggest you to use Documenter.jl or similar to document the package, it is very simple, it can be done in minutes, and could be useful for users.

1 Like

Iā€™ll take a look at Documenter.jl, thanks for the tip @dmolina :slight_smile:

I see you took Shapiro Wilk from ExploreASL
https://github.com/clementpoiret/Pingouin.jl/blob/master/src/_shapiro.jl

However these codes are under incompatible licences? I remember that the original code (in fortran) was by no means GPLā€¦

As a matter of fact there is a Shapiro-Wilk test PR in HT:
https://github.com/JuliaStats/HypothesisTests.jl/pull/124/files
that I did for the purpose of computing the exact coefficients (Iā€™m yet to come back to this:)

1 Like

I remember that the original code (in fortran) was by no means GPLā€¦

And neither is the MATLAB code you took (and translated): https://github.com/ExploreASL/ExploreASL/blob/495ecc662cd0fd2c59ebcce3469d615eb6b0a89d/Functions/xASL_stat_ShapiroWilk.m

so I would recommend to you @Clement_POIRET to ditch that code (and tag an new incompatible version according to semver, i.e. 0.2.0), as people and you could be liable (even for what others do) for non-commercial use. If thereā€™s other similar code (@abulak seems to point to such MIT licensed non-problematic code); IF I recall you can take (that) MIT code however and put it in your package, stating MIT in a file, with the whole GPL, but Iā€™m not a lawyer so please confirm.

Another option would be to prominently copy their license to your LICENSE file but since theirs is non-free (as in freedom)/non-GPL compatible, thatā€™s not possible if some of the other code you made MUST be GPL. As that license states, ā€œsolely for non-commercial useā€ please donā€™t do that and apply such to other code you have. Thatā€™s worse than the GPL in my opinion, and why it make the library non-GPL compatible.

Since your package has a lot of different stat tools that may or may not be used together, it seems tempting to have one license for one part and another for another part, but it seems highly dubious, when those licenses are contradictory, and many users would use both parts. Even users that distribute their code and/or [with] your code, even non-commercially would be in violation with the GPL.

You can still distribute that translated code, in a separate package since you already made it and wouldnā€™t want it to go to waste. I see you like your package ā€œexhaustiveā€ a good goal, and people can then just depend on two packages, and do whatever they like privately, the GPL doesnā€™t prevent you doing anything with it in private (even within your company, and subsidiaries under their control, and employees (work-for-hire) people can use).

If your users would use that non-GPL code in private non-commercially, at least youā€™re off the hook, and it would be their problem if they distribute the code or a package depending on both of your packages.

The Software is distributed ā€œAS ISā€ under this License solely for
non-commercial use in the hope that it will be useful, but

[ā€¦]

No part of the Software may be reproduced, modified, transmitted or
transferred in any form or by any means, electronic or mechanical,
without the express permission of the University. The permission of
the University is not required if the said reproduction, modification,
transmission or transference is done without financial return,
conditions of this License are imposed upon the receiver of the
product, and all original and amended source code is included in any
transmitted product. You may be held legally responsible for any
copyright infringement that is caused [ā€¦]

1 Like

Thanks @abulak and @Palli for pointing this out!
I think that for now Iā€™ll just remove Shapiro from the lib, while waiting for HypothesisTests.jl to have its own method. All in all, itā€™s not wasted, I learned something from translating it :slight_smile:

Otherwise as of now, my PhD with lockdown constraints is taking most of my time, so Iā€™m focusing on making wrapper functions around HypothesisTests for non-parametric tests. As some of you suggested, Iā€™m also refactoring the doc (Pingouin.jl Documentation Ā· Pingouin), and Iā€™m optimizing the code (deleting eval calls, using multiple dispatch, etc.)

2 Likes

You can either use the fortran version (as indicated in https://github.com/JuliaStats/HypothesisTests.jl/pull/124/files#diff-851fadd01a7f02254ea7f69728364d187a8e0ef3d9af1f2c016ffb7dc231e653R182), or use the version I wrote. In either case you need to specify clearly in your Readme and on the top of the file the license of the content. The source of swilk.f seems to be owned by the Journal of the Royal Statistical Society, (unclear license? someone please shed some light on this here?), or in case of my implementation MIT. PM me if you absolutely need GPL.

1 Like

I have a QQ plots recipe and some other goodies in ChemometricsTools - feel free to lift it and disperse it elsewhere :).

1 Like

I think Iā€™ll use your implementation @abulak, and specify itā€™s distributed under MIT.

@anon92994695 Iā€™m still not working on plots, but Iā€™ll keep your lib in my head when Iā€™ll start :slight_smile:

Now, in Julia 18.5, HypothesisTests.jl(version is v0.10.11) doesnot have any function for Shapiro wilk test. So, how can perform this test.

Thanks in advance.

Looks like itā€™s close to merging so you can give this PR a try: