Construct a generic discrete distribution in Distributions.jl?

This may be a silly question. But what is the easies/most “julian” way to construct a generic discrete distribution type given a vector of outcomes and a vector of probabilities of the same length?

It seems a workaround is to leverage the fact that Uniform(x, x+eps(x)) is basically equal to x for any floating-point x. Then, a MixtureModel model can be constructed to “approximate” the discrete distribution by assuming each component is of form Uniform(x, x+eps(x)).

This is clearly a hack and is not the most performant way. Is there a plan to add such a distribution type to Distributions.jl or is there a convenient way to do it?

I think a pull request to implement such a distribution would be well-received.

I’d use the DiscreteUniform distribution as a workaround instead of the Uniform hack.

Isn’t the Categorical distribution from Distributions almost what you want here? It takes a custom probability vector but it doesn’t allow custom outcomes.

1 Like

Isn’t the Categorical distribution from Distributions almost what you want here?

Yeah, I was wrong to suggest DiscreteUniform – it doesn’t allow custom probability vectors.

It’s this issue. I adapted the Categorical code, but I didn’t make a PR out of it:

""" Provides AnyCategoricals, which is just like Distributions.Categorical, but
can have an arbitrary domain (aka support).  """
module AnyCategoricals

export AnyCategorical

importall Distributions
using Distributions: NoArgCheck, @check_args

doc"""
    AnyCategorical(support, p)
A *AnyCategorical distribution* is parameterized by a probability vector `p` over a support vector `support` of the same length.
$P(X = k) = p[k]  \quad \text{for } k = 1, 2, \ldots, K.$

`p` must be a real vector, of which all components are nonnegative and sum to one.
**Note:** The input vector `p` is directly used as a field of the constructed distribution, without being copied.
External links:
* [Categorical distribution on Wikipedia](http://en.wikipedia.org/wiki/Categorical_distribution)
"""
immutable AnyCategorical{U, T<:Real} <: DiscreteUnivariateDistribution
    support::Vector{U}
    p::Vector{T}

    AnyCategorical(support::Vector{U}, p::Vector{T}, ::NoArgCheck) =
        new(support, p)

    function AnyCategorical(support::Vector{U}, p::Vector{T})
        @assert isprobvec(p) "The given vector $p must sum to 1"
        @assert length(support) == length(p) "$support and $p must have the same length"
        @assert length(unique(support)) == length(support) "Duplicates in `support`"
        new(support, p)
    end
end
AnyCategorical{U, T}(support::Vector{U}, p::Vector{T}) = 
    AnyCategorical{U, T}(support, p)

AnyCategorical{K, V}(di::Associative{K, V}) =
    AnyCategorical{K, V}(collect(keys(di)), collect(values(di)))

ncategories(d::AnyCategorical) = length(d.support)
probs(d::AnyCategorical) = d.p
params(d::AnyCategorical) = (d.support, d.p,)
support(d::AnyCategorical) = d.support
@inline partype{T<:Real}(d::AnyCategorical{T}) = T

mode(d::AnyCategorical) = support(d)[indmax(probs(d))]

get_index(d::AnyCategorical, x) = find(y->y==x, d.support)[1]

pdf(d::AnyCategorical) = copy(d.p)
_pdf{U, T}(d::AnyCategorical{U, T}, x) =
    insupport(d, x) ? d.p[get_index(d, x)] : zero(T)
pdf(d::AnyCategorical, x) = _pdf(d, x)

insupport(d::AnyCategorical, x) = x in support(d)

# Disambiguations
pdf(d::AnyCategorical, x::Integer) = _pdf(d, x)
pdf(d::AnyCategorical, x::Real) = _pdf(d, x)
insupport(d::AnyCategorical, x::Integer) = x in support(d)
insupport(d::AnyCategorical, x::Real) = x in support(d)
end
1 Like

I don’t think that’s the same issue. An arbitrary discrete distribution is something like a distribution that returns γ with probability 1/3 and returns π with probability 2/3. But that issue talks about an integer-offset version of Categorical.

Yes, I agree. I can use Categorical random variable to generate the index of the outcome vector.

My issue is that I’d like to define a new distribution type. But it would require to implement a series of methods, many of which I probably won’t use, e.g. skewness, kurtosis… I was looking for a simpler approach.

For your own use, there’s no reason you need to define those methods. You can just inherit from the right type and define Base.rand. I do this all the time.

Could you share a boilerplate for it? I’ve tried the following (for the empirical distribution) but it doesn’t work.

using Distributions

type Empirical{T<:Real} <: Distributions.DiscreteUnivariateDistribution
     K::Int
     x::Vector{T}
    
    function (::Type{Empirical{T}}){T}(x::Vector{T})
        @check_args(Empirical, length(x) >= 1)
        new{T}(length(x), x)
    end    
end

function Distributions.rand(Empirical(x))
    x[rand(DiscreteUniform(1, length(x)))]
end

This destructuring / pattern matching as an argument is not supported. Try

using Distributions

type Empirical{T<:Real} <: Distributions.DiscreteUnivariateDistribution
     x::Vector{T}
    
    function (::Type{Empirical{T}}){T}(x::Vector{T})
        Distributions.@check_args(Empirical, length(x) >= 1)
        new{T}(x)
    end    
end

function Base.rand(dist::Empirical)
    dist.x[rand(DiscreteUniform(1, length(dist.x)))]
end

Then rand(Empirical{Float64}([42.0, 24.0])) works as expected for me.

You also seem to have an excess K::Int in your type definition.

Thanks! It appears the type info {Float64} must be added when calling the customized rand function. How can this be fixed?

Simply add an outer constructor.

(::Type{Empirical}){T}(x::Vector{T}) = Empirical{T}(x)

This will then take the type from the vector’s element type.