Questions about contributing to Distributions.jl

I would be a bit careful about claiming this: it’s not a given that people who are willing to add new distributions are willing to maintain existing ones. In general, OSS maintenance is a mess: the core Python dev team points out that they have the time commitment of 1.5 full-time employments to keep Python going.

1 Like

Maybe I don’t understand.
I’m not saying the person who adds a new distribution will maintain that one or other existing ones.
I’m saying at any point in time there are a lot more people looking @ Distributions.jl than @ SkewDist.jl.
If I add a new distribution to Distributions.jl and then “get hit by a bus” it is more likely to be maintained than if I put it in my own private GitHub repo.
Unless I’m mistaken?

Were you referring to “with probability one”
I should be more careful.
My point is things are more likely to be neglected in a private repo than a big repo w/ many users/contributors. They are also much easier to discover.
I guess this is a falsifiable empirical prediction.

It’s true that there are more eyes on Distributions.jl than on a smaller repo. But that potentially creates a stream of bugs and maintenance work that scales much faster than the rate at which new maintainers volunteer – so the burden of maintenance gets worse and worse on the few people who are actively maintaining the code. This experience is one reason why maintainers burn out or become jaded about new contributions.

3 Likes

That explains a lot. Thank you.
There must be ways to make it easier to maintain a package.

For example, maybe if there was a standardized template for contributing a distribution?
Then perhaps have a GSoC or JSoC student fill it up w/ 100s of brand name distributions.

1 Like

Yeah, I would love to see people figure that out. I did a terrible job of it when I was maintaining Julia packages, but I do believe others could do a good job. I think the struggle is that OSS isn’t able to employ the techniques I see in industry: assigning people levels based on demonstrated skillsets, giving them direct managers who are held responsible for helping people and empowered to ensure they provide value relative to their abilities, etc. But industry pulls that kind of structure off because it can pay a huge number of people very well.

1 Like

I think any “named” distribution with an iid sampler and an analytically tractable cdf/pdf/chf should be in Distributions.jl some day. Maybe even just the former. I like it being a one-stop-shop that enforces some uniformity about how they’re all handled.

2 Likes

Ofcourse there isn’t a right answer to this.

  1. Moving everything to Distributions.jl makes the package bulky which leads to long loading times if used as a dependency. It also increases the pressure on the maintainers are more bugs are filed and more features are requested.

  2. On the other hand, splitting distributions across various package obviously comes with its own problems, mainly fragmentation of the ecosystem and different styles/API usage.

How about this idea? A page (e.g. github page) that shows all possible distributions and what package they are in. Some distributions could be in two packages (e.g. StatsFun and Distributions) and the user can choose which one to bring as a dependency. But now the question is who will maintain this github page??

1 Like

StatsFuns.jl has the special functions that are necessary for a lot of distribution-related calculations. These special functions are tricky to construct and code, many of them actually come from

As the README explains, these are not meant to be used directly. But it’s great that they are collected and maintained in one place.

This may be an exercise with dubious foundations in statistical theory, so I don’t think any package needs to design its API for this use case.

2 Likes

FYI: It looks like there is a registered package for this: AlphaStableDistributions.jl

There is also: RandomMatrixDistributions.jl
As well as various unregistered code such as: SkewDist.jl often unmaintained…

I agree with this very strongly!!!
After all Distributions.jl is one of the signature Julia packages.

There are very many “named” distributions, used in a specialized setting and pretty much nowhere else. They should share the interface of Distributions.jl without living in that package

To make things concrete, the consider the growth charts (for height, weight, BMI) for children compiled in 2006 by the WHO. The charts use a Box-Cox transformed normal distributiom, adjusting the tails outside \pm 3\sigma back to normal (since there is insufficient data to fit the transformed version there). It is a very, very specialized distribution that is mostly useful with the actual estimated table data (eg see this R package). Conversely, including it in Distributions.jl may not be warranted.

2 Likes

There’s also RandomMatrices.jl, which is provided by a core Julia team, and is technically building on the Distributions hierarchy, but it’s not being developed much anymore, and in any case it’s intended user doesn’t overlap so closely with the folks that would use the matrix-variates in Distributions (imo).

Definitely agree with this. Where is the multivariate normal, again? And is it mvrnorm? rmvnorm?

Even if it had such a culture, submitting PRs could still be hard since R isn’t really written in R.

I’m not sure that your example shouldn’t be included, especially if we could make the transformation generic like LocationScale.

This is maybe too whimsical, but one small benefit of having everything in one place is that you stumble upon stuff you weren’t looking for. I’d never heard of the Studentized range until I passed it in the docs looking for something else.

2 Likes

I’d love to see this conversation take a slight turn. Here’s my two cents:

  • There is a balance between the number of maintainers to the number of distributions that can be safely maintained. So there must come a point at which a package has too much functionality.
  • But no one really knows what that point is.
  • But no one needs to know: the bigger insight is that the limiting factor is the number of maintainers.

So I think the big question is: how do you make it easier for new people to ramp up on maintenance work and make it easier to calibrate expectations for those new people to maximize some combination of their happiness and their productivity?

Years ago, I called this the onboarding problem: OSS projects don’t know how to articulate the minimum bar of skill required to start being a level X maintainer on a project or what kinds of projects are appropriat for a level X contributor. But if you had such expectations listed oout formally, I think you could quickly get more people involved in maintenance. And with more people, you (a) have a voting body that could decide how many distributions can be supported and (b) you can support more distributions.

5 Likes

One large cost of including very specialized distributions that require more esoteric expertise is that maintaining them becomes difficult if the original contributor is no longer available. This, unfortunately, happens in practice over the span of a few years.

(This is in addition to the general extra cost of “just” maintaining a more complex package.)

2 Likes

Might not necessarily be the fastest way to compute them. So for some common distributions I think we should probably create some specialized evaluation functions etc for speed.

But as a first step, I think a macro which lets you define @isakindofdist foo(a,b) bar(1,a,b,3) for example and have it write all the code to make foo(a,b) automatically would be great.

3 Likes

Precise examples will help make clear the different visions for Distributions.jl.

MLJ.jl is an interface to many ML models.
Some of their models are wrapped by MLJ maintainers, some models are wrapped by the original package authors using MLJModelInterface.jl.
Currently they have models from 14 packages w/ more on the way.
It is nice that the MLJ maintainers don’t have to maintain 14+ packages.

@Tamas_Papp is this the type vision you have in mind for Distributions.jl?
Distributions.jl will be the interface to various distributions in the Julia ecosystem w/ the main distributions inside the original package?
Then authors of private repos (such as SkewDist.jl) can make their stuff available via the Distributions.jl interface?

I prefer the current structure of Distributions.jl unless I misunderstood your point.

@tlienart @cscherrer I’m curious to hear your views (if you have anything to add)

@dlakelan you bring up a great point which is discussed here.
PERT (3 param) is a special case of Beta (4 param), which is a special case of GeneralizedBeta (5 param).
Actually 15+ distributions are a special case of GeneralizedBeta.

This could save A LOT of code bc we know the closed form for the n-th moment of the GeneralizedBeta & thus for it’s 17 sub-distributions:
image

This field-guide claims over 100 named continuous-uni distributions can be written as special cases of only 5 generalized families: Pearson, GeneralizedBeta, GeneralizedBetaPrime…

ack. I apparently edited a post rather than posting a new one… whoops.

see above: Questions about contributing to Distributions.jl - #34 by dlakelan

wow that’s a cool resource!

1 Like

I think that Tamas’s proposal of splitting out a DistributionsBase.jl is not necessarily about breaking up Distributions.jl and/or having lots of packages that implement a single distribution. His suggestion would allow for people to hook into the Distributions.jl functionality in situations where they need to implement their own distribution in a project or package for whatever purpose (like the example he mentioned). So maybe I’m solving a model and I need to use some weird parameterization of the Skew-Normal distribution, DistributionsBase.jl has what I need to get off the ground in a timely fashion and still have my distribution be consistent with the functionality that any distribution in Distributions.jl would provide (and with a smaller dependency).

That’s how I saw it anyway.

2 Likes

I thought users already can do that w/ using Distributions, StatsFuns...?
For example: SkewDist.jl
Unless the point is to have a light-weight package for doing this?

Yes I think the idea is a lightweight package for this, which doesn’t import hundreds of distributions and functions. Basically something with just the types and the basic API interface. Like Tables.jl is an interface for tabular data.

2 Likes

As @tbeason and @dlakelan said above, ideally I would like to have a

  1. small package that defines the API, not unlike the excellent Tables.jl. Users would not need to use this directly.
  2. have packages use this API as the common interface. This is a very successful model to emulate for Julia packages.
  3. among these packages, Distributions.jl would have a special role, and be the package to reach for when one wants to use a common distribution.
  4. at the same time, there would be less pressure to include more esoteric distributions in Distributions.jl because they could just go into their own package and still be equivalent from a user perspective.
3 Likes