Questions about contributing to Distributions.jl

Hi all,

I am thinking about contributing to Distributions.jl. Two questions:

  1. Is it of interest to contribute yet another distribution, or is the philosophy of that package to only include the most widely used distributions so as to not clutter up the package?
  2. If I contribute a distribution, is it ok if not all properties of the distribution (e.g. cf, pdf) are known?

Thanks,

Mattias

4 Likes

Welcome @matvil.

  1. The more distributions the better. It makes the Julia ecosystem stronger
  2. When you submit a PR w/ a new distribution you don’t necessarily need to include every property of that distribution. Various different properties can be added by you or others in the future.
4 Likes

Great, thanks!

I agree with the more the better!

I just noticed that there’s no SkewNormal Skew normal distribution - Wikipedia for example.

3 Likes
  1. @dlakelan I submitted a PR for Skew Normal
  2. I think it would be much easier for people to contribute if there was a template for submitting a distribution w/ say 50 properties for each distribution.
    To be clear, a contributor can add a distribution w/ only a few properties… Others can add more in the future.
  3. I find it unfortunate when others write packages w/ their own distribution instead of contributing to Julia’s main Distributions package…
1 Like

Nice! skew-normal is a very useful distribution for real-world models.

2 Likes

This was my first post here at Julialang on Discord. Impressed by the very quick responses. Very encouraging!

4 Likes

Just to color this a bit, there’s a difference between “is the property known” and “did you include it in your PR”. If it is not known, then it’s fine that you didn’t include it. If it is known, then whether or not your PR should include it depends on how core it is to the utility of the distribution. If you don’t implement the kurtosis or the entropy, we can probably leave that for another day. If you leave out the mean when it’s known, we’d probably wait on a merge until you added it.

For example, Distributions does not currently include the alpha stable family (PRs welcome!). That family doesn’t have a known pdf, so that’s fine. But if your PR left out the characteristic function or the CMS sampler, I doubt we would merge it until you added them.

3 Likes

Yes, I read that between the lines, but good to clarify. Thanks.

While I think that Distributions.jl is the best place for most commonly used distributions, it would be nice to have facilitate packages defining distributions with a lightweight dependency, so I opened

https://github.com/JuliaStats/Distributions.jl/issues/1139

5 Likes

If you’re going down this route, I’ll propose a huge amount of additonal work :slight_smile:

I often wonder if Distributions.jl would benefit from a refactor into UnsafeDistributions.jl that defines the types and does no error checking. That package could be wrapped with the current parameter checking to create Distributions.jl.

Unsure if people still complain about the error checking cost these days, but it was a theme of a few comments in the early days.

1 Like

Can you please give an example of what kind of error checking you would skip?

Unclear if these are worth removing, but there’s still some checking logic that is slightly wasteful in a tight loop like this: https://github.com/JuliaStats/Distributions.jl/blob/20c91d9efcc5f96913bf8e38be2e3fb14b21942b/src/multivariate/dirichlet.jl#L34

I don’t see any of the heavier checks I thought existed in the past for MvNormal, but I might have misremembered those.

1 Like

I would propose a uniform syntax to skip check, similar to Base.undef.

Eg a singleton Distributions.unsafe, as in

Dirichlet(unsafe, alpha)

as an inner constructor in addition.

2 Likes

I don’t have anything to add, but I have been idly wondering about this. When I look at the Turing stuff for example, I see a lot that makes me uneasy, but I think it ultimately reflects on the structure of Distributions. DistributionsAD.jl for instance includes line-for-line recoding of stuff from Distributions to get around certain conventions or inconsistencies about error checking or PDMats. And then AdvancedMH.jl introduces the idea of a “DensityModel” as the target for MH. I don’t know what this could mean other than a distribution, so it would at least be cute if you could just use Distribution subtypes there, but again I think the point is they’re getting around error checking and inconsistencies about what logpdf returns when you’re out of the support.

1 Like

I like the idea of having a core set of distributions in Distributions.jl and other distributions packages using core functionalities from Distributions.jl. There are simply too many distributions out there, and Distributions.jl would quickly be overwhelmed, which is why asked my original question.

The unsafe suggestion is something I always wanted in Matlab (I used to define my own unsafe versions). Not sure if the overhead for safe checking is a big burden in Julia though.

Yes, I’ve reading the code again lately and there seems to be a lack of coherent strategy when to return 0/-Inf, NaN or throw an error. Would be great to clean up.

2 Likes

Perhaps a related question.

In my filed (HEP), one often (always) needs to use a customary PDFs.
A few years ago I was considering adding the function I need to Distributions:

  • too many functions, and one always need some new shape of signal and background, also not worth it implementing all of them.
  • too much to code (although I still have a little idea what it takes to add a new fuction)

Here is a propotype of my package AlgebraPDF.jl for creating customary PDFs and combinations of them, generating and fitting.
I am using it in my analysis, but have not found time yet to polish and publish. It certaintly has overlaps with the Distributions.jl. I am wondering if we can make use of it.

1 Like

I’m a little confused about what’s going on.

  1. StatsFuns.jl currently has 14 core distributions.
    Each distribution has exactly 10 properties: pdf/cdf/invcdf …
    It says:
    “We recommend using the Distributions.jl package for a more convenient interface.”
  2. Distributions.jl has \approx 80 distributions w/ several more in progress.
    Each has different properties.
    Some properties are not defined for that distribution, some still need to be added.
    Distributions.jl allows users to create new distributions from existing ones via: truncation-mixture-products-convolution…
  3. One advantage of keeping distributions in the same repo is how easy it is to access all of them.
    Suppose I got a new dataset & wanna see which “name-brand” distribution best fits it. I can automatically fit all relevant distributions w/ a single package.
Code to fit all relevant distributions in Distributions.jl
using Distributions, Random, HypothesisTests;

Uni = subtypes(UnivariateDistribution)
#Cts_Uni = subtypes(ContinuousUnivariateDistribution)
DGP_True = LogNormal(17,7);
Random.seed!(123);
const d_train = rand(DGP_True, 1_000)
const d_test  = rand(DGP_True, 1_000)

Er =[]; D_fit  =[];
for d in Uni
    println(d)
    try
        dd = "$(d)"   |> Meta.parse |> eval
        DĚ‚ = fit(dd, d_train)
        Score = [loglikelihood(DĚ‚, d_test),
                OneSampleADTest(d_test, DĚ‚)            |> pvalue,
                ApproximateOneSampleKSTest(d_test, DĚ‚) |> pvalue,
                ExactOneSampleKSTest(d_test, DĚ‚)       |> pvalue,
                #PowerDivergenceTest(d_test,lambda=1)  Not working!!!
                JarqueBeraTest(d_test)                |> pvalue   #Only Normal 
        ];
        #Score = loglikelihood(DĚ‚, ds) #TODO: compute a better score.
        push!(D_fit, [d, DĚ‚, Score])
    catch e
        println(e, d)
        push!(Er, (d,e))
    end
end

a=hcat(D_fit...)
M_names =  a[1,:]; M_fit   =  a[2,:]; M_scores = a[3,:];
idx =sortperm(M_scores, rev=true);
Dfit_sort=hcat(M_names[idx], sort(M_scores, rev=true) )
Output
julia> Dfit_sort
11Ă—3 Array{Any,2}:
 LogNormal              …  [-20600.7, 0.823809, 0.789128, 0.781033, 0.0]
 Gamma                     [-21159.4, 6.0e-7, 2.45426e-68, 1.23247e-69, 0.0]
 Cauchy                    [-24823.3, 6.0e-7, 2.91142e-213, 8.6107e-227, 0.0]
 InverseGaussian           [-26918.1, 6.0e-7, 0.0, 0.0, 0.0]
 Exponential               [-33380.3, 6.0e-7, 0.0, 0.0, 0.0]
 Normal                 …  [-40611.5, 6.0e-7, 1.32495e-213, 3.51792e-227, 0.0]
 Rayleigh                  [-61404.6, 6.0e-7, 0.0, 0.0, 0.0]
 Laplace                   [-2.03419e9, 6.0e-7, 1.49234e-138, 5.47197e-144, 0.0]
 DiscreteNonParametric     [-Inf, 6.0e-7, 0.197933, 0.193494, 0.0]
 Pareto                    [-Inf, 6.0e-7, 6.69184e-108, 3.7704e-111, 0.0]
 Uniform                …  [-Inf, 6.0e-7, 0.0, 0.0, 0.0]
2 Likes

I’d like to better understand the advantage to keeping different distributions in different repos. The points I’ve heard are:

  1. it could “clutter up the package” so leave Distributions.jl “for most commonly used distributions”
    But Distributions.jl already has 80 distributions, should we move out the ones that are not widely used? where? who decides which are widely used?
    Isn’t that why we have StatsFuns.jl?
  2. other arguments I’ve heard have to do w/ reducing dependency.
  3. Something about safety checking.
    I don’t see why that is an obstacle that can’t be overcome.

Back in 2013 @johnmyleswhite made a nice list of distributions to add to Distributions.jl: https://github.com/JuliaStats/Distributions.jl/issues/124

My biggest concern about having various distributions scattered throughout the Julia ecosystem:

  1. Many great distributions in smaller repos will be neglected with probability one.
    They are more likely to be maintained in Distributions.jl.
    For example, SkewDist.jl has great stuff but was last updated in 2017 & currently doesn’t work.
    I would be A LOT more willing to submit a PR if SkewDist.jl was inside a “blessed repo” such as Distributions.jl.
  2. Look at the giant mess for distributions in the R ecosystem.
    Imagine how much stronger R would be if it had a culture that made it easy to submit PRs…
2 Likes