Cdf for multinomial

Is it possible to get a cdf for the multinomial distribution?

julia> let n = 1000000
           k = 10
           d = Multinomial(n,k)
           r = rand(1:k,n)
           cdf(d, r)
           end
ERROR: MethodError: no method matching cdf(::Multinomial{Float64, Vector{Float64}}, ::Vector{Int64})

I looked through the source code, but could not find a method for MultivariateDiscrete. One approach is to sum the pmf for all possible permutations from 0 to the values in r.

n = 10
k = 2
d = Multinomial(n, k)
r = rand(d)

sum_pmf = 0.0 
for i1 ∈ 0:r[1],i2 ∈ 0:r[2] 
    sum_pmf += pdf(d, [i1,i2])
end
sum_pmf

Recursion is one approach to generalize. Perhaps there is something better?

1 Like

I take that back. After searching Google, i could not find information on the cdf. I think it may not be defined for the multi nomial. If you think about it for a moment, the sum must be fixed. So its not clear to me how the cdf would be computed with changing bounds. You may want to do a search and double check

1 Like

I found something here:

3 Likes

Being a multivariate distribution, the CDF could be as giving the probability that the sampled outcome vector is (elementwise) smaller or equal to some given count vector. Unfortunately, I’m not aware of an efficient way to compute this. Here is a simple implementation just summing over all potential outcomes:

function mycdf(md::Multinomial, counts)
    reduce(+,
           pdf(md, collect(ns))
               for ns in Iterators.product((0:min(k, md.n) for k in counts)...);
           init = 0.0)
end

You might be able to speed it up a bit more by only summing over the ns vectors in the support of the distribution, i.e., where sum(ns) == md.n, but would need some clever way of only generating that set of vectors in the first place.

1 Like

Sorry for the retracted post. I noticed an error. The code below needs further testing, but it might serve as a good starting point.

I also found the same source. However, I thought the restriction that the last element c[k] = N - sum(c[1:(k-1)]) to be arbitrary. Why not c[1], c[2], or c[j]? But as written, the sum is fixed to N, which is a correct restriction, but arguably arbitrary. Using the notation defined in that source, here is my first attempt using recursion:


using Distributions 
"""
k is the number of outcomes 
n is the number of occurrences for outcomes 1-k 
N = sum(n) is the total number of trials
c is the maximum number of occurrences for outcomes 1-(k-1). c[k] = N - sum(c[1:(k-1)])
"""
function compute_cdf(dist, c::Vector{Int}, n::Vector{Int}=zeros(Int, length(c)+1), i::Int=1, sum_pmf=0.0)
    if i > length(c)
        println("n $n, N $(sum(n))")
        sum_pmf += pdf(dist, n)
        return sum_pmf
    end

    for j ∈ 0:c[i]
        n[i] = j
        # maybe inelegant but needs to omit cases where c[k] < 0
        if sum(n[1:i]) > dist.n
            break
        end
        n[end] = dist.n - sum(n[1:(end-1)])
        sum_pmf = compute_cdf(dist, c, n, i + 1, sum_pmf)
    end
    return sum_pmf
end

# number of trials 
N = 10
# number of outcomes 
k = 3
# maximum occurences. c[k] = N - sum(c[1:(k-1)])
c = [2,3]
dist = Multinomial(N, k)
sum_pmf = compute_cdf(dist, c)
println("cdf: ", sum_pmf)

In the output below, N is fixed to 10 and there are (2 +1) * (3 +1) = 12 permutations, which seems reasonable to me.

n [0, 0, 10], N 10
n [0, 1, 9], N 10
n [0, 2, 8], N 10
n [0, 3, 7], N 10
n [1, 0, 9], N 10
n [1, 1, 8], N 10
n [1, 2, 7], N 10
n [1, 3, 6], N 10
n [2, 0, 8], N 10
n [2, 1, 7], N 10
n [2, 2, 6], N 10
n [2, 3, 5], N 10
cdf: 0.09586953208352386

I fixed a bug so that the code escapes the for loop once the sum N is reached. If we set the max values to N, it sums to 1 as expected:

# number of trials 
N = 10
# number of outcomes 
k = 3
# maximum occurences. c[k] = N - sum(c[1:(k-1)])
c = [10,10]
dist = Multinomial(N, k)
sum_pmf = compute_cdf(dist, c)
println("cdf: ", sum_pmf)

Please play around with this and see if it breaks somewhere. If someone finds a solution, it might be worth opening a PR on Distributions.jl. Hopefully, this helps as a starting point.

2 Likes

This can also be calculated with the following:

using Combinatorics
using Distributions

MultinomialCDF(D, c) = 
  sum(s->pdf(D, foldl((r,x)->(r[x]+=1; r), s; init=zeros(Int,length(c)))),
  multiset_combinations(1:length(c),c,D.n))

For example:

julia> D = Multinomial(10,3)
Multinomial{Float64, Vector{Float64}}(n=10, p=[0.3333333333333333, 0.3333333333333333, 0.3333333333333333])

julia> MultinomialCDF(D, [4,4,4])
0.3734186861758879

julia> MultinomialCDF(D, [10,10,10])
1.0000000000000002

(the latter is ≈ 1 as needed)

Some parameter checking would not hurt, but in this form it is a 1-liner :wink:

Also, the form of multiset_combinations used here is not so well documented, but still exported from Combinatorics (and used to calculate the documented case).

2 Likes

Very nice solution! I was wondering if some type of integer partition technique with padding would work e.g. [0, 3, 1], [0,0,4] etc. Your approach does just that.

The expression for the cdf in this source is difficult to read. Once I translated it into LaTeX, it was easier to read and I could see I misunderstood the constraint in P(…). The code Dan has is correct and mine is incorrect. Please disregard it.

The only improvement I can suggest is breaking up the oneliner into two functions that the padded integer partition (or better technical name) can be tested separately from the cdf calculation. For example:

padded_partition(c, s) = foldl((r,x)  -> (r[x]+=1; r), s; init=zeros(Int,length(c)))

MultinomialCDF(D, c) = sum(s -> pdf(D, padded_partition(c, s)),
    multiset_combinations(1:length(c), c, D.n)
)

Now we can look at the partitions and verify they sum to 5:

using Combinatorics
using Distributions
D = Multinomial(5, 3)
c = [2,3,4]
map(s -> padded_partition(c, s),
    multiset_combinations(1:length(c), c, D.n))

output


11-element Vector{Vector{Int64}}:
 [2, 3, 0]
 [2, 2, 1]
 [2, 1, 2]
 [2, 0, 3]
 [1, 3, 1]
 [1, 2, 2]
 [1, 1, 3]
 [1, 0, 4]
 [0, 3, 2]
 [0, 2, 3]
 [0, 1, 4]

@jar1, I think you can mark this as an intermediate solution.

@Dan, would you be interested in opening an issue on GitHub to add your solution, or allowing me to open an issue and give you credit? I think this would be useful for the community.

See also

for a representation of the multinomial CDF in terms of (truncated) Poisson random variables.

3 Likes

@Christopher_Fisher You can certainly open an issue.

As for the function, there is another optimization which I’ve thought of, avoiding the allocation of a zero vector for each partition. Specifically, it looked like this:

function MultinomialCDF(D, c)
    k = length(c)
    tmp = zeros(Int, k)
    sum(s->pdf(D, foldl((r,x)->(r[x]+=1; r), s; init=((tmp .= 0;tmp)))),
      multiset_combinations(1:k,c,D.n))
end

This isn’t as clean, and it still allocates, but those come from multiset_combinations. There are probably methods to do the iteration over combinations without allocation, and getting those into Combinatorics would be a real improvement, as these enumerations creep into many serious time-consuming algorithms.

1 Like

@Dan, thanks again for your help. I submitted an issue on Distributions.jl and will provide the link for future reference.

I second this.

That doesn’t make the calculation more efficient though, because a closed expression for the sum of truncated poisson rvs is not available. At least I didn’t find it.

Calculating the sum by convolution is at least polynomial while enumerating the subsets might get exponentially large. So perhaps worth looking into. Convolution could be calculated in the Fourier domain.

1 Like

@skleinbo @Dan Yeah, the convolution thing works and one can express the characteristic function of the sum of iid truncated Poissons as a simple product of incomplete gamma functions (times a ratio involving only an exponential and a factorial) with first argument an integer. Haven’t tried the inverse transformation, which would be the final step.

Frey described an alternative and included R code. Lebrun described another alternative that is more efficient though the code is only available by request. From Lebrun

It is only recently that reasonably efficient algorithms have been described (see Corrado 2011; Frey 2009), using completely different roots than the previous authors. But even with these algorithms, the only distribution for which it is possible to compute arbitrary rectangular probabilities is the multinomial one, with a polynomial space and time complexity.

We propose to change this situation by providing an algorithm which is essentially exact up to machine precision for all the multivariate discrete distributions considered in Butler and Stutton (1998) and Levin (1983), amongst which the multinomial, multivariate hypergeometric and multivariate Pólya distributions. In the multinomial case, our algorithm is more efficient than the algorithm described in Frey (2009), both with respect to space and time complexity, for an equivalent or even better accuracy when implemented in double precision.

The derivations are beyond my quick understanding, but the code in Frey can very easily be rewritten in Julia without any special functions beyond log-gamma (fairly direct translation below).

using SpecialFunctions

function multinom_cdf(lb, ub, nobs, probs)
    k = length(probs)
    sum_lb = sum(lb)
    sum_ub = sum(ub)

    logc = loggamma(nobs + 1) + sum(lb .* log.(probs)) - sum(loggamma.(lb .+ 1))
    r = nobs - sum_lb
    exc = sum_ub - sum_lb
    fac = zeros(exc + 1)
    sind = zeros(exc + 1)
    loc = 1
    fac[1] = 0

    for i in 1:k
        if ub[i] > lb[i]
            mi = ub[i] - lb[i]
            for j in 1:mi
                loc += 1
                fac[loc] = probs[i] / (lb[i] + j)

                if j == 1
                    sind[loc] = 1
                end
            end
        end
    end

    cprob = zeros(exc + 1)
    cprob[1] = 1.0

    for t in 1:r
        tbelow = [0;cumsum(cprob[1:exc])]
        fac2 = tbelow .* sind + [0;cprob[1:exc]] .* (1.0 .- sind)
        cprob = fac .* fac2
        m = maximum(cprob)
        logc += log(m)
        cprob = cprob ./ m
    end

    exp(logc + log(sum(cprob)))
end

@time multinom_cdf([0,0,0,0], [4,4,4,4], 10, [0.25,0.25,0.25,0.25])
@time multinom_cdf([0,0,0,0], [20,20,20,20], 60, [0.25,0.25,0.25,0.25])
@time multinom_cdf([0,0,0,0], [200,200,200,200], 600, [0.25,0.25,0.25,0.25])

Output

  0.000030 seconds (108 allocations: 19.781 KiB)
0.6889343261718746

  0.000173 seconds (608 allocations: 433.875 KiB)
0.7849628688153565

  0.048421 seconds (6.01 k allocations: 37.629 MiB, 30.49% gc time)
0.9999922199658132

For comparison

function MultinomialCDF(D, c)
    k = length(c)
    tmp = zeros(Int, k)
    sum(s->pdf(D, foldl((r,x)->(r[x]+=1; r), s; init=((tmp .= 0;tmp)))),
      multiset_combinations(1:k,c,D.n))
end

D1 = Multinomial(10, [0.25,0.25,0.25,0.25])
D2 = Multinomial(60, [0.25,0.25,0.25,0.25])
D3 = Multinomial(600, [0.25,0.25,0.25,0.25])

@time MultinomialCDF(D1, [4,4,4,4])
@time MultinomialCDF(D2, [20,20,20,20])
@time MultinomialCDF(D3, [200,200,200,200])

Output

  0.000260 seconds (1.44 k allocations: 67.672 KiB)
0.6889343261718759

  0.038326 seconds (125.75 k allocations: 6.001 MiB, 31.74% gc time)
0.7849628688153811

187.492780 seconds (1.28 G allocations: 46.206 GiB, 4.31% gc time)
0.999992219962507
4 Likes

The papers are gated, so a PDF would also be nice.

UPDATE: No need. Obtained PDFs through the usual channels.

Awesome. The scaling of this algorithm is much better. My only question is whether there are any licensing issues.

The thesis of Lebrun has a chapter where he mentions that the algorithm is implemented in the OpenTURNS library. This file seems to contain the corresponding C++ code released under the LGPL.

True, but if I didn’t make a mistake (which is very possible!), the incomplete gamma functions need to be evaluated at complex values, for which SpecialFunctions.jl has no implementation.

But the algorithm by Lebrun seems really nice!

2 Likes