Distributions.jl structure

How can I automatically find all univariate continuous cdfs in Distributions.jl?
I believe this can make a lot of tasks much easier including maintaining Distributions.jl.

Eg MLJ.jl: using MLJ; X, y = @load_boston;
m = models(): creates a vector of all 132 models.
m = models(matching(X, y)): vector of 53 models that work w/ the data
m = models(matching(X, y), x -> x.prediction_type == :deterministic): vector of 50 models

In Distributions.jl

using Distributions, Random;    
Random.seed!(123);

Currently: Distributions.continuous_distributions creates a vector{string} w/ 48 elements
We cannot use this bc arcsine should be Arcsine & betaprime should be BetaPrime etc
I scraped some from the repo:

Code

Cts_Uni =["Arcsine",
         "Beta",
         "BetaPrime",
         "Cauchy",
         "Chernoff",
         "Chi",
         "Chisq",
         "Cosine",
         "Epanechnikov",
         "Erlang",
         "Exponential",
         "FDist",
         "Frechet",
         "Gamma",
         "GeneralizedExtremeValue",
         "GeneralizedPareto",
         "Gumbel",
         "InverseGamma",
         "InverseGaussian",
         "Laplace",
         "Levy",
         "Logistic",
         "LogNormal",
         "NoncentralBeta",
         "NoncentralChisq",
         "NoncentralF",
         "NoncentralT",
         "Normal",
         "NormalInverseGaussian",
         "NormalCanon",
         "Pareto",
         "Rayleigh",
         "StudentizedRange",
         "SymTriangularDist",
         "TDist",
         "TriangularDist",
         "TruncatedNormal",
         "Uniform",
         "VonMises",
         "Weibull"];
Suppose I wanna compute the mean of a random variable w/ the default parametrization of each dist:
Er =[];
D_mean =[];
for d in Cts_Uni
    println(d)
    try
        d0 = "$(d)()" |> Meta.parse |> eval
        μ  = mean(d0)
        push!(D_mean, (d, μ))
    catch e
        println("!!! Error ", d, e)
        push!(Er, (d,e) )
    end
end
  1. We automatically see 13/40 distributions don’t have default parameters.
    Example: Chi() gives MethodError: no method matching Chi()
  2. 7 of the 27/40 dist w/ default parameters give mean= NaN or mean=Inf.
    Eg: BetaPrime(), Cauchy(), Pareto()…
Er =[];
D_ent =[];
for d in Cts_Uni
    println(d)
    try
        d0 = "$(d)()" |> Meta.parse |> eval
        ε = entropy(d0)
        push!(D_ent, (d, ε))
    catch e
        println("!!! Error ", d, e)
        push!(Er, (d,e) )
    end
end

5/27 dist w/ default parameters don’t have entropy.
(It doesn’t say NaN or Inf just gives an error message.)
Perhaps: “The entropy for this distribution has not been coded. Please submit a PR.”

Same w/ quantiles:

Er =[];
D_q =[];
for d in Cts_Uni
    println(d)
    try
        d0 = "$(d)()" |> Meta.parse |> eval
        q = quantile(d0, 0.025)
        push!(D_q, (d, q))
    catch e
        println("!!! Error ", d, e)
        push!(Er, (d,e) )
    end
end

Suppose I have some data & I want to find the probability distribution that best fits it.

DGP_True = LogNormal(-1.5);
const ds = rand(DGP_True, 1_000); #Training Data.
x = range(-2, stop=3, length=200);  #Test Data.
#
Er =[];
D_fit  =[];
for d in Cts_Uni
    println(d)
    try
        dd = "$(d)"   |> Meta.parse |> eval
        D̂ = fit(dd, ds)
        Score = loglikelihood(D̂, ds) #TODO: compute a better score.
        push!(D_fit, [d, D̂, Score])
    catch e
        println(e, d)
        push!(Er, (d,e))
    end
end
#
Er
D_fit
a=hcat(D_fit...)
M_names =  a[1,:]
M_fit   =  a[2,:]
M_scores = a[3,:]
idx =sortperm(M_scores, rev=true)
Dfit_sort=hcat(M_names[idx], M_fit[idx], sort(M_scores, rev=true) )
#
using Plots
plot( x, pdf.(DGP_True, x), label= "DGP_True" )
plot!(x, pdf.(Dfit_sort[1,2], x), label=Dfit_sort[1,1]  )
plot!(x, pdf.(Dfit_sort[2,2], x), label=Dfit_sort[2,1]   )
plot!(x, pdf.(Dfit_sort[3,2], x), label=Dfit_sort[3,1]   )
plot!(x, pdf.(Dfit_sort[4,2], x), label=Dfit_sort[4,1]   )

image

There is a growing beautiful literature on probabilistic forecasting that can benefit greatly from more structure in Distributions.jl:
ngboost, CatBoostLSS, XGBoostLSS , Gamlss, GamboostLSS, bamlss , disttree

Do you have ideas for better ways to organize Distributions.jl?

1 Like

I’ve recently started working with Distributions.jl quite a bit, and my impression is that it’s very well-organized. There are some gaps in what’s actually implemented, like the one you pointed out with entropy, but I don’t think that’s an issue with the structure of the package itself.

I can see the value in having a built-in collection of all the distribution types in Distributions.jl, rather than just their names, simliar to MLJ.models(). You can achieve something analogous to MLJ.matching by filtering a series of distributions on a condition like all(insupport.(d, y)).

You can do this a bit more easily by iterating over subtypes of ContinuousUnivariateDistribution:

using Distributions

for distribution = filter(!isabstracttype, subtypes(ContinuousUnivariateDistribution))
    try
        println(string(distribution, " distribution has mean ", mean(distribution()), "."))
    catch e
        if isa(e, MethodError)
            print(string(distribution, " distribution has no default constructor.\n"))
        else
            throw(e)
        end
    end
end

I don’t have a strong opinion either way on this. I can see the argument for making default constructors for every distribution, but not all distributions have a “standard” parameterization like Normal(), and I’m struggling to think of a real world use case where you’d want to use, say, a Chi distribution without specifying the degrees of freedom explicitly.

Some distributions don’t have a well-defined mean, either in general or for particular sets of parameters. I’m not sure why some would return NaN and others would return Inf, so there might be room for some clean-up there. (Or there might be a good reason for it.)

If those distributions do have undefined or infinite entropy in practice, I agree that entropy should return NaN or Inf as appropriate. Otherwise, throwing a MethodError makes sense - their entropy isn’t NaN/Inf, Julia just doesn’t know what it is! In general, if a method could and should exist but throws a MethodError, you can safely interpret that as “This method has not been coded. Please submit a PR.”

1 Like

Thank you @tfehring.
You gave a convenient way to get all distributions of a certain type (that I didn’t see in the docs)

filter(!isabstracttype, subtypes(Distribution))
filter(!isabstracttype, subtypes(UnivariateDistribution))
filter(!isabstracttype, subtypes(MultivariateDistribution))
filter(!isabstracttype, subtypes(MatrixDistribution))
##
filter(!isabstracttype, subtypes(ContinuousDistribution))
filter(!isabstracttype, subtypes(ContinuousUnivariateDistribution))
filter(!isabstracttype, subtypes(ContinuousMultivariateDistribution))
filter(!isabstracttype, subtypes(ContinuousMatrixDistribution))
##
filter(!isabstracttype, subtypes(DiscreteDistribution))
filter(!isabstracttype, subtypes(DiscreteUnivariateDistribution))
#no categorical.
filter(!isabstracttype, subtypes(DiscreteMultivariateDistribution))
filter(!isabstracttype, subtypes(DiscreteMatrixDistribution))

Why doesn’t filter(!isabstracttype, subtypes(MultivariateDistribution)) include MVNormal?

MvNormal is a subtype of AbstractMvNormal, which is a subtype of MultivariateDistribution. It’s structured this way because there’s an alternate parameterization of the multivariate normal distribution, MvNormalCanon, that’s also a subtype of AbstractMvNormal. You can use supertype(MvNormal) to confirm this.

I don’t know of a built-in way to recursively traverse the type hierarchy all the way down to concrete types, and having just tried to implement it myself it’s a little bit trickier than I expected. That would be a nice addition to core Julia though, IMO.

2 Likes

Quick update on my last comment - turns out this is about as easy as I originally expected:

function concrete_subtypes(x::Type)::Vector{Type}
    s = subtypes(x)
    sort!(vcat(filter(!isabstracttype, s), concrete_subtypes.(filter(isabstracttype, s))...), by=string)
end

That allows you to do something like the following:

using Distributions
concrete_subtypes(MultivariateDistribution)
10-element Array{Type,1}:
 Dirichlet
 DirichletMultinomial
 Distributions.GenericMvTDist
 MixtureModel{Multivariate,VS,C,CT} where CT<:Real where C<:Distribution where VS<:ValueSupport
 Multinomial
 MvLogNormal
 MvNormal
 MvNormalCanon
 Product
 VonMisesFisher
1 Like

Is it possible to fit a truncated distribution?

using Distributions
xs = rand(LogNormal(), 100)
fit(Normal, xs)
fit(truncated(Normal, lb, ub), xs) #ERROR