Reworking Distributions.jl

Tamas_Papp · December 21, 2019, 9:15am

Sure it is. My point above was that you can simply

have a method for all permutations either “manually”, or
with a macro,
or pre-sort NamedTuples in some canonical order.

I think (3) is cleanest, but this should be checked.

And do you actually need all this detail to be available from the API directly, before sampling?

IMO the best approach to complicated types is heuristic: get a value, and suppose that all values are like that (maybe the compiler can infer it, which is nice), otherwise degrade gracefully with a branch. This is pretty much what collect does, for example.

I don’t think that we should go overboard with interfaces that operate on types, even if they are useful occasionally.

Eg have a function value_kind that returns a trait for being a discrete distribution, a real number, a vector, etc, to describe sampled values. When it is a real number, value_bounds would be defined, returning a tuple of bounds, may be infinite.

While I don’t think that the current API is ideal, we should realize that it is the accretion of many years of contributions, a lot of which happened before current patterns of Julia programming evolved. An API redesign is something that usually happens when someone gets sufficiently fed up with the current API to put in all the work

mschauer · December 21, 2019, 10:34am

I think the parameter space as type parameter can just be left out. I am not saying that the particular distributions should not be parametrised by the types of their parameters, just that the abstract type does not really need this. There is no point on dispatching on “Distribution with Float64 type parameter”, it is not even clear if the type parameters are values (e.g. a location parameter), whole numbers or probabilities (e.g. in a Binomial).

There are two quantities which perhaps would be useful to know statically on the abstract type level: the type of the samples and the type in which probabilities are measured.

Though right now I cannot really anticipate the use of dispatching a generic distribution on the type of probabilities. In any case, if we leave the probability type out, we can still use a trait with fallback probtype(::Any) = Float64 which maps to what we are doing.

The sample type could be part of the abstract type Sampleable, because if you solicit numbers from a random stream it would be good to have a default.

In total

abstract type Sampleable{X} end

should be more than enough and even

abstract type Sampleable end

would be workable by hooking into the the eltype trait which already exists and makes perfect sense as type trait for Sampleables .

cscherrer · December 21, 2019, 4:15pm

I’m very on-the-fence about this whole unordered parameters thing. It could be handy in some cases, but most distributions have a standard ordering. Anyway, I think this is a detail we don’t need to worry about too much until we have a few more examples.

“Need” is a strong word. I’m getting by without it, but things could be much better. I’m not sure what you mean by “all this detail”, which seems to imply unneeded complexity. But the proposal is much simpler than the current API.

I’ve done this in some cases, and have been disappointed it’s not cleaner. To me it feels heavy-handed and cumbersome to write, and there’s some definite performance overhead. This kind of thing is the whole idea of type systems, so why not take advantage of what we have?

This is an interesting perspective. In your perspective is the proposal “overboard” in this way?

Again interesting. Types give us a way to reason statically about the behavior of a program. It helps us catch bugs, prevent dynamic dispatch, and find opportunities for optimization. A great thing about Julia is the ability to encode whatever you know statically into the type system, without any of it being forced. In my experience the usefulness isn’t “occasional”, but “almost always”. The difficulty with the current API isn’t that is uses types, but that the information we need isn’t there, and what is there doesn’t lend itself to extension.

I don’t understand how this is simpler than just having the return type available as a parameter. How many of these functions would there need to be? Would adding a “tree” type require going back through all existing distributions and adding “not a tree”? Would adding a new distribution require manually specifying which of these traits it satisfies? In that case we’d be back to non-extensibility.

Absolutely. It’s unusual for any version 1.0 API to anticipate all use cases, and Distirbutions is clearly written with classical statistics in mind, rather than probabilistic programming. For classical stats, “here’s a fixed, finite set of distributions you can use” is really common. But for probabilistic programming that’s not enough.

There are certainly benefits to community effort, and building interest in this is one of the goals of this discussion. I’d much rather have something that builds to a widely-used and community-maintained library, than a quick one-off without many distributions or users. There may be some intermediate growing pains, but I think stats in Julia can be much better, and in the long run it’s well worth it.

I disagree. To me the most obvious argument for the proposed API is that it treats distributions as an abstraction of functions, and are typed in a way familiar to anyone who has used functional programming. This is a big benefit when you start thinking about abstraction and composition.

From a more traditional perspective, consider parameterizing a Gaussian by (mean, variance) vs (mean, precision). Or a binomial by probability vs logit(probability). These things can have significant benefits, both in terms of convenience and (statistical and computational) performance.

Do you mean the return type of pdf? You may be right, but it’s not clear to me yet that this is intrinsic to a distribution.

mschauer:

In total
abstract type Sampleable{X} end
should be more than enough and even
abstract type Sampleable end
would be workable by hooking into the the eltype trait which already exists and makes perfect sense as type trait for Sampleable s .

In my experience, the best way to get fast and safe code is to get as much information to the compiler as possible. Usually “the compiler” means Julia, but in the case of probabilistic programming, it may also be something written by the user.

Can you help me understand the point of eliminating the P in Sampleable{P,X}?

Tamas_Papp · December 21, 2019, 4:50pm

The proposal is not sufficiently detailed for me to form a definite opinion, since the devil is in the parameters X. If they end up being to complex, it may not be worth it (most most users). Unfortanately, I think that the smallest unit of code that will demonstrate the advantages and disadvantages of any proposal is a PR rewriting at least a significant portion of the package.

I have a very different take on this. I don’t want to reason about types if I can avoid it, it’s something that the compiler is good at, and for me it is a chore that I only do when needed (eg optimizing some code).

If I find myself calculating a lot of stuff in type space, I consider that bad. I do recognize that it is sometimes necessary, and occasionally we have to design for it, but it can get very complex rather quickly, so for me it is a red flag in API design in Julia.

Conversely, the fewer type parameters the user has to care about, the better. Most callers should never care about the type of some distribution parameters, and while some will need to know that eg Normal returns a Float64, the more callers can ignore it, the better the API will be.

cscherrer · December 21, 2019, 5:16pm

I disagree. This is enough of a change that it should be a separate package. My hope is that some will be enthusiastic enough about it to help, and that the benefits will soon become clear, allowing deprecation of the old API, or at least widespread support for the parameterized API.

I agree! Types should be “opt-in” when possible. All statically-available information should be easy for package designers to access, but we should minimize the need for end users to deal with them. My hope is that the proposed API will improve on this point.

Amin_Yahyaabadi · December 21, 2019, 9:30pm

Distributions.jl definitely needs some reworking.

For example, the current definition of Uniform is not type stable.
https://github.com/JuliaStats/Distributions.jl/issues/1041

I tried to make a method in https://github.com/JuliaStats/Distributions.jl/pull/1035/ to make it usable, but it doesn’t help.

using Distributions
set = Uniform(Float32(0.0), Float32(2.0))
v = rand(set, 3)

julia> eltype(v)
Float64

The definition should not be dependent on the type of a and b.
https://github.com/JuliaStats/Distributions.jl/blob/master/src/univariate/continuous/uniform.jl#L26

Instead user should provide the type explicitly. It should use Float64 if no type is provided.

The following definition is my proposed definition:
for array like constructor (Uniform{T}(a, b))

struct Uniform{T<:Real} <: ContinuousUnivariateDistribution
    a::T1 where {T1 <: Real}
    b::T2 where {T2 <: Real}
    Uniform{T}(a, b) where {T <: Real} = new{T}(a, b)
end

for zeros like constructor (Uniform(T, a, b))

function Uniform(::Type{T}, a, b) where {T <: Real}
    return Uniform{T}(a, b)
end

No type specified constructor:

function Uniform(a, b)
    return Uniform{Float64}(a, b)
 end

API examples:

julia> Uniform(1,2)
Uniform{Float64}(a=1, b=2)

julia> Uniform(Float32, 1,2)
Uniform{Float32}(a=1, b=2)

julia> Uniform{Int64}(1,2)
Uniform{Int64}(a=1, b=2)

cscherrer · December 22, 2019, 12:35am

This is a little different than the usual definition of type stability, which asks that return types be deterministic, but it’s certainly worth making more regular.

Ok, but your proposal still uses the old API. Have you read the original post in this thread? The proposed API would easily handle your request, since we could have

function Uniform(a::A, b::B) where {A <: Real, B <: Real}
    T = promote_type(A, B)
    return Uniform{Tuple{T,T}, T}(T(a), T(b))
end

mschauer · December 22, 2019, 12:58pm

I think the burden of proof is on your side. can you come up with a situation where you would like to dispatch/ work with the parameters of a distribution on the abstract type level without having a concrete type in mind? Especially in a way which cannot be handled using a function ˋparamsˋ?

cscherrer · December 22, 2019, 2:15pm

Haha, ok. The most obvious is composition of Sampleables:

∘(::Sampleable{B,C}, ::Sampleable{A,B}) :: Sampleable{A,C}

[You could do this for Distributions as well, but implicitly this intergrates over the intermediate variable. Maybe we also need something that doesn’t? This could be similar to Gen’s concept of a ChoiceMap. @Marco_Cusumano-Towne or @alex-lew, any thoughts on this?]

Soss’s For combinator will simplify a lot:

For(::AbstractArray{P,N}, ::Distribution{P,X}) <: Distribution{AbstractArray{P,N}, X}

And it will be great to be able to more easily build Markov chains:

Unfold(::Distribution{P,X}, Distribution{X,X}) :: Distribution{P, Sequence{X}}

Sequence will be an iterator of some sort, but we don’t seem to have an AbstractIterator type (which is super weird, why don’t we have that?)

Note that this isn’t assuming we have “bulit-in” Distribution{X,X} types; the whole point of this is to be able to easily build things on the fly.

cscherrer · December 24, 2019, 2:28am

Some things to think about:

A given type will sometimes have a rand or logpdf, or neither, or both. The “logpdf but no rand” case isn’t covered by the original API, but it comes up, for example in energy-based models.
Will there ever be cases where we know a given type has a rand method, but we don’t realize there’s also a logpdf? Or vice-versa? Can this cause trouble?
Should we focusing so narrowly on distributions, rather than on measures more broadly? (#manifolds Slack with @sethaxen got me thinking about this)
If we represent measures, we should have a way to easily get to Lebesgue measure on the same space.
How can we bring mathematical laws in for use in tests?
If we went in that direction, could we still keep it easy to extend and add new distributions and combinators?

Tamas_Papp · December 24, 2019, 2:44pm

I think we had this discussion in the past, but I would prefer the Distributions.jl to be used for the well-known families of distributions people use as building blocks, and have a PDF, CDF, and IID random sampler available.

While technically all simulation codes or log pdfs define distributions, they lack either an IID random sampler or a pdf, so it is not very useful to use the same API for them in practice.

cscherrer · December 24, 2019, 3:15pm

A few things about this are confusing to me:

As described above, this will not (at least at first) be Distributions.jl.
The current Distributions.jl includes Sampleable for the likelihood-free case
In Distributions.jl (and in life in general) CDF is typically only available for the one-dimensional case

This is currently true, at least in part because defining a Distribution or Sampleable currently takes a lot of work.

Overall, I think I’m missing what point you’re making here, could you give some more detail?

Tamas_Papp · December 24, 2019, 3:30pm

All I was suggesting is that I prefer Distributions.jl to be an implementation of the distributions which have some sort of a closed-form characterization (this is obviously subjective, but let’s say you need at least a pdf and IID sampling; CDF/quantiles are optional, having at least a mean is nice when known in closed form, other moments as available).

So, to me, the ideal Distributions.jl is just code implementing formulas and tables usually found in the appendix of some classical stats textbook or a paper. The “basic” stuff, which at the same time can be difficult to program because of all the finicky details.

Many other things are distributions, ie basically any finite measure can be used to define one, and outcome of a simulation also defines a distribution. I find it great that people want to experiment with describing these things in a single unified conceptual framework, and I follow this with interest, but I don’t think that Distributions.jl should be used to hammer out these ideas.

Of course, it it needs to be changed just a bit to support these projects, that’s great and it should be done. But I don’t think that Distributions.jl is the right place to think about distributions over arbitrary objects, DSLs to describe them, generalized sampling, etc.

cscherrer · December 24, 2019, 3:45pm

I mostly agree with this. I’m not proposing a redesign, but a replacement. The “basic” stuff should of course be included. But the fact that we have a commonly used thing called a Distribution that’s so painfully non-extensible is something that needs to be improved.

We’re well past band-aids; the problem is fundamental to the design.

I see this a lot like the view of Julia from the outside. People say they’d rather have an expressive language than a fast one, without considering that they can have both.

I think it’s entirely possible to have an API that’s simpler than the existing one, and also extensible.

Tamas_Papp · December 24, 2019, 4:07pm

I don’t think anyone is questioning this, it’s just that the discussion would be more focused with an actual implementation of this API.

cscherrer · December 24, 2019, 4:12pm

I agree, but it’s easier to lead with some discussion in case there are aspects of this I’m missing. I think my mental model of how this will work is converging, and I hope to get a proof of concept together over the next week or so

mohamed82008 · December 24, 2019, 4:37pm

Looking forward to it!

cscherrer · December 24, 2019, 8:29pm

Ari Katz (Are you on here Ari?) had a question on Slack #manifolds about GPU methods. This is another benefit of including the return type as a parameter - it’s easy to add a rand(SomeDistribution{P,X::CuArray}) method that specializes for GPU.

DoktorMike · December 25, 2019, 12:28pm

I’ll follow that with great interest.

cscherrer · January 13, 2020, 2:39am

Making some progress:
https://github.com/cscherrer/Measures.jl

Topic		Replies	Views
Distributions.jl interop for probprog Probabilistic programming	17	1351	July 31, 2019
getting the types right for transformed distributions General Usage	4	1208	March 29, 2018
Probability distribution with complex object support (e.g. `Tuple{Integer, Matrix}`) Statistics distributions	3	469	July 23, 2021
[ANN] Turing.jl 0.12.0 release Probabilistic programming	5	1649	May 14, 2020
Construct a generic discrete distribution in Distributions.jl? Statistics	11	3231	February 23, 2017

Reworking Distributions.jl

Related topics