Hello all,
I’ve been thinking some more about a refactoring of Distributions.jl, and I’d like to ask for feedback on a proof of concept.
First, the abstract types:
abstract type Sampleable{P,X} end
abstract type Distribution{P,X} <: Sampleable{P,X} end
Here P
is the parameter space, and X
is the type of the sample space. Critically, there are no constraints on either of these at the type level. If you want X
to be something other than <: Real
or <: Array
, go for it. No problem.
P
can be anything, but NamedTuple
s are especially convenient, because they make it easy to use different parameterizations. For example, a Gamma
could look like this:
struct Gamma{P,X} <: Distribution{P,X}
par :: P
Gamma(nt::NamedTuple) = Gamma{typeof(nt), typeof(nt[1])}(nt)
end
Again, the types are very flexible. A Gamma
over symbolic values, for example, is easy.
Passing a NamedTuple
is a bit awkward, but that’s easily fixed with
Gamma(;kwargs...) = Gamma((;kwargs...))
We may also want a default parameterization without requiring names:
Gamma(k :: Real, θ :: Real) = Gamma(k=k, θ=θ)
logpdf
then looks like this:
function logpdf(d::Gamma{P,X}, x::X) where {P <: NamedTuple{(:k,:θ)}, X <: Real}
k = d.par.k; θ = d.par.θ
(k - 1) * log(x) - x/θ - k * log(θ) - loggamma(k)
end
This is for a shape k
and scale θ
. Maybe we’d rather parameterize by a shape and rate? Ok:
function logpdf(d::Gamma{P,X}, x::X) where {P <: NamedTuple{(:α,:β)}, X <: Real}
α = d.par.α; β = d.par.β
(α - 1) * log(x) - β*x + α * log(β) - loggamma(α)
end
Or by a shape and a mean?
function logpdf(d::Gamma{P,X}, x::X) where {P <: NamedTuple{(:k,:μ)}, X <: Real}
μ = d.par.μ; k = d.par.k
α = k; β = k/μ
(α - 1) * log(x) - β*x + α * log(β) - loggamma(α)
end
This is already much faster than Distributions (at least in part due to lack of inlining in the latter):
julia> @btime logpdf(Gamma(2.0,1.0),4.0)
13.964 ns (0 allocations: 0 bytes)
-2.613705638880109
vs
julia> using Distributions
julia> @btime logpdf(Gamma(2.0,1.0),4.0)
36.097 ns (0 allocations: 0 bytes)
-2.613705638880109
Some things to consider:
- Lots of packages depend on
Distributions.jl
, and these breaking changes could take a while to be fully supported - Defining separate methods for each parameterization is a bit verbose, but there’s always the option to fall back on one canonical parameterization
- Some aspects of this would probably need to be implemented using traits. But let’s not rush to that until it’s clearly the best approach
- This approach specifies a parameterization in terms of the names for a
NamedTuple
. But there’s likely to be some contention exactly what those names should be, for example should we replacek
andθ
withshape
andscale
? - Fields in
NamedTuple
s are ordered, which is a little troubling for this use case. There may be a clever way to dispatch according to the set of names, rather than the tuple itself