Cdf for multinomial

jar1 · September 15, 2023, 10:10pm

Is it possible to get a cdf for the multinomial distribution?

julia> let n = 1000000
           k = 10
           d = Multinomial(n,k)
           r = rand(1:k,n)
           cdf(d, r)
           end
ERROR: MethodError: no method matching cdf(::Multinomial{Float64, Vector{Float64}}, ::Vector{Int64})

Christopher_Fisher · September 15, 2023, 11:00pm

I looked through the source code, but could not find a method for MultivariateDiscrete. One approach is to sum the pmf for all possible permutations from 0 to the values in r.

n = 10
k = 2
d = Multinomial(n, k)
r = rand(d)

sum_pmf = 0.0 
for i1 ∈ 0:r[1],i2 ∈ 0:r[2] 
    sum_pmf += pdf(d, [i1,i2])
end
sum_pmf

Recursion is one approach to generalize. Perhaps there is something better?

Christopher_Fisher · September 16, 2023, 1:52am

I take that back. After searching Google, i could not find information on the cdf. I think it may not be defined for the multi nomial. If you think about it for a moment, the sum must be fixed. So its not clear to me how the cdf would be computed with changing bounds. You may want to do a search and double check

nilshg · September 16, 2023, 6:24am

I found something here:

bertschi · September 16, 2023, 8:13am

Being a multivariate distribution, the CDF could be as giving the probability that the sampled outcome vector is (elementwise) smaller or equal to some given count vector. Unfortunately, I’m not aware of an efficient way to compute this. Here is a simple implementation just summing over all potential outcomes:

function mycdf(md::Multinomial, counts)
    reduce(+,
           pdf(md, collect(ns))
               for ns in Iterators.product((0:min(k, md.n) for k in counts)...);
           init = 0.0)
end

You might be able to speed it up a bit more by only summing over the ns vectors in the support of the distribution, i.e., where sum(ns) == md.n, but would need some clever way of only generating that set of vectors in the first place.

Christopher_Fisher · September 16, 2023, 10:59am

Sorry for the retracted post. I noticed an error. The code below needs further testing, but it might serve as a good starting point.

I also found the same source. However, I thought the restriction that the last element c[k] = N - sum(c[1:(k-1)]) to be arbitrary. Why not c[1], c[2], or c[j]? But as written, the sum is fixed to N, which is a correct restriction, but arguably arbitrary. Using the notation defined in that source, here is my first attempt using recursion:


using Distributions 
"""
k is the number of outcomes 
n is the number of occurrences for outcomes 1-k 
N = sum(n) is the total number of trials
c is the maximum number of occurrences for outcomes 1-(k-1). c[k] = N - sum(c[1:(k-1)])
"""
function compute_cdf(dist, c::Vector{Int}, n::Vector{Int}=zeros(Int, length(c)+1), i::Int=1, sum_pmf=0.0)
    if i > length(c)
        println("n $n, N $(sum(n))")
        sum_pmf += pdf(dist, n)
        return sum_pmf
    end

    for j ∈ 0:c[i]
        n[i] = j
        # maybe inelegant but needs to omit cases where c[k] < 0
        if sum(n[1:i]) > dist.n
            break
        end
        n[end] = dist.n - sum(n[1:(end-1)])
        sum_pmf = compute_cdf(dist, c, n, i + 1, sum_pmf)
    end
    return sum_pmf
end

# number of trials 
N = 10
# number of outcomes 
k = 3
# maximum occurences. c[k] = N - sum(c[1:(k-1)])
c = [2,3]
dist = Multinomial(N, k)
sum_pmf = compute_cdf(dist, c)
println("cdf: ", sum_pmf)

In the output below, N is fixed to 10 and there are (2 +1) * (3 +1) = 12 permutations, which seems reasonable to me.

n [0, 0, 10], N 10
n [0, 1, 9], N 10
n [0, 2, 8], N 10
n [0, 3, 7], N 10
n [1, 0, 9], N 10
n [1, 1, 8], N 10
n [1, 2, 7], N 10
n [1, 3, 6], N 10
n [2, 0, 8], N 10
n [2, 1, 7], N 10
n [2, 2, 6], N 10
n [2, 3, 5], N 10
cdf: 0.09586953208352386

I fixed a bug so that the code escapes the for loop once the sum N is reached. If we set the max values to N, it sums to 1 as expected:

# number of trials 
N = 10
# number of outcomes 
k = 3
# maximum occurences. c[k] = N - sum(c[1:(k-1)])
c = [10,10]
dist = Multinomial(N, k)
sum_pmf = compute_cdf(dist, c)
println("cdf: ", sum_pmf)

Please play around with this and see if it breaks somewhere. If someone finds a solution, it might be worth opening a PR on Distributions.jl. Hopefully, this helps as a starting point.

Dan · September 16, 2023, 8:43pm

This can also be calculated with the following:

using Combinatorics
using Distributions

MultinomialCDF(D, c) = 
  sum(s->pdf(D, foldl((r,x)->(r[x]+=1; r), s; init=zeros(Int,length(c)))),
  multiset_combinations(1:length(c),c,D.n))

For example:

julia> D = Multinomial(10,3)
Multinomial{Float64, Vector{Float64}}(n=10, p=[0.3333333333333333, 0.3333333333333333, 0.3333333333333333])

julia> MultinomialCDF(D, [4,4,4])
0.3734186861758879

julia> MultinomialCDF(D, [10,10,10])
1.0000000000000002

(the latter is ≈ 1 as needed)

Some parameter checking would not hurt, but in this form it is a 1-liner

Also, the form of multiset_combinations used here is not so well documented, but still exported from Combinatorics (and used to calculate the documented case).

Christopher_Fisher · September 17, 2023, 10:26am

Very nice solution! I was wondering if some type of integer partition technique with padding would work e.g. [0, 3, 1], [0,0,4] etc. Your approach does just that.

The expression for the cdf in this source is difficult to read. Once I translated it into LaTeX, it was easier to read and I could see I misunderstood the constraint in P(…). The code Dan has is correct and mine is incorrect. Please disregard it.

The only improvement I can suggest is breaking up the oneliner into two functions that the padded integer partition (or better technical name) can be tested separately from the cdf calculation. For example:

padded_partition(c, s) = foldl((r,x)  -> (r[x]+=1; r), s; init=zeros(Int,length(c)))

MultinomialCDF(D, c) = sum(s -> pdf(D, padded_partition(c, s)),
    multiset_combinations(1:length(c), c, D.n)
)

Now we can look at the partitions and verify they sum to 5:

using Combinatorics
using Distributions
D = Multinomial(5, 3)
c = [2,3,4]
map(s -> padded_partition(c, s),
    multiset_combinations(1:length(c), c, D.n))

output


11-element Vector{Vector{Int64}}:
 [2, 3, 0]
 [2, 2, 1]
 [2, 1, 2]
 [2, 0, 3]
 [1, 3, 1]
 [1, 2, 2]
 [1, 1, 3]
 [1, 0, 4]
 [0, 3, 2]
 [0, 2, 3]
 [0, 1, 4]

@jar1, I think you can mark this as an intermediate solution.

@Dan, would you be interested in opening an issue on GitHub to add your solution, or allowing me to open an issue and give you credit? I think this would be useful for the community.

skleinbo · September 17, 2023, 11:12am

Add CDF for Multinomial Distribution

opened 02:36PM - 17 Sep 23 UTC

itsdfish

Hi, Currently, Distributions does not have a method for the CDF of the Multin…omial distribution. Another user on [Discourse](https://discourse.julialang.org/t/cdf-for-multinomial/103917/10) proposed a solution which I reproduce below. If this would be acceptable, I can submit a PR and add some tests. ``` using Distributions using Combinatorics padded_partition(c, s, tmp) = foldl((r,x) -> (r[x]+=1; r), s; init=((tmp .= 0;tmp))) function MultinomialCDF(D, c) k = length(c) tmp = zeros(Int, k) return sum(s -> pdf(D, padded_partition(c, s, temp)), multiset_combinations(1:k, c, D.n)) end D = Multinomial(5, 3) # upper bounds on n c = [2,3,4] MultinomialCDF(D, c) ```

Joris_Pinkse · September 18, 2023, 4:33am

I second this.

skleinbo · September 18, 2023, 4:13pm

That doesn’t make the calculation more efficient though, because a closed expression for the sum of truncated poisson rvs is not available. At least I didn’t find it.

Dan · September 18, 2023, 4:44pm

Calculating the sum by convolution is at least polynomial while enumerating the subsets might get exponentially large. So perhaps worth looking into. Convolution could be calculated in the Fourier domain.

Joris_Pinkse · September 18, 2023, 8:19pm

@skleinbo @Dan Yeah, the convolution thing works and one can express the characteristic function of the sum of iid truncated Poissons as a simple product of incomplete gamma functions (times a ratio involving only an exponential and a factorial) with first argument an integer. Haven’t tried the inverse transformation, which would be the final step.

simonbrauer · September 18, 2023, 8:21pm

Frey described an alternative and included R code. Lebrun described another alternative that is more efficient though the code is only available by request. From Lebrun

It is only recently that reasonably efficient algorithms have been described (see Corrado 2011; Frey 2009), using completely different roots than the previous authors. But even with these algorithms, the only distribution for which it is possible to compute arbitrary rectangular probabilities is the multinomial one, with a polynomial space and time complexity.

We propose to change this situation by providing an algorithm which is essentially exact up to machine precision for all the multivariate discrete distributions considered in Butler and Stutton (1998) and Levin (1983), amongst which the multinomial, multivariate hypergeometric and multivariate Pólya distributions. In the multinomial case, our algorithm is more efficient than the algorithm described in Frey (2009), both with respect to space and time complexity, for an equivalent or even better accuracy when implemented in double precision.

The derivations are beyond my quick understanding, but the code in Frey can very easily be rewritten in Julia without any special functions beyond log-gamma (fairly direct translation below).

using SpecialFunctions

function multinom_cdf(lb, ub, nobs, probs)
    k = length(probs)
    sum_lb = sum(lb)
    sum_ub = sum(ub)

    logc = loggamma(nobs + 1) + sum(lb .* log.(probs)) - sum(loggamma.(lb .+ 1))
    r = nobs - sum_lb
    exc = sum_ub - sum_lb
    fac = zeros(exc + 1)
    sind = zeros(exc + 1)
    loc = 1
    fac[1] = 0

    for i in 1:k
        if ub[i] > lb[i]
            mi = ub[i] - lb[i]
            for j in 1:mi
                loc += 1
                fac[loc] = probs[i] / (lb[i] + j)

                if j == 1
                    sind[loc] = 1
                end
            end
        end
    end

    cprob = zeros(exc + 1)
    cprob[1] = 1.0

    for t in 1:r
        tbelow = [0;cumsum(cprob[1:exc])]
        fac2 = tbelow .* sind + [0;cprob[1:exc]] .* (1.0 .- sind)
        cprob = fac .* fac2
        m = maximum(cprob)
        logc += log(m)
        cprob = cprob ./ m
    end

    exp(logc + log(sum(cprob)))
end

@time multinom_cdf([0,0,0,0], [4,4,4,4], 10, [0.25,0.25,0.25,0.25])
@time multinom_cdf([0,0,0,0], [20,20,20,20], 60, [0.25,0.25,0.25,0.25])
@time multinom_cdf([0,0,0,0], [200,200,200,200], 600, [0.25,0.25,0.25,0.25])

Output

  0.000030 seconds (108 allocations: 19.781 KiB)
0.6889343261718746

  0.000173 seconds (608 allocations: 433.875 KiB)
0.7849628688153565

  0.048421 seconds (6.01 k allocations: 37.629 MiB, 30.49% gc time)
0.9999922199658132

For comparison

function MultinomialCDF(D, c)
    k = length(c)
    tmp = zeros(Int, k)
    sum(s->pdf(D, foldl((r,x)->(r[x]+=1; r), s; init=((tmp .= 0;tmp)))),
      multiset_combinations(1:k,c,D.n))
end

D1 = Multinomial(10, [0.25,0.25,0.25,0.25])
D2 = Multinomial(60, [0.25,0.25,0.25,0.25])
D3 = Multinomial(600, [0.25,0.25,0.25,0.25])

@time MultinomialCDF(D1, [4,4,4,4])
@time MultinomialCDF(D2, [20,20,20,20])
@time MultinomialCDF(D3, [200,200,200,200])

Output

  0.000260 seconds (1.44 k allocations: 67.672 KiB)
0.6889343261718759

  0.038326 seconds (125.75 k allocations: 6.001 MiB, 31.74% gc time)
0.7849628688153811

187.492780 seconds (1.28 G allocations: 46.206 GiB, 4.31% gc time)
0.999992219962507

Dan · September 18, 2023, 8:56pm

The papers are gated, so a PDF would also be nice.

UPDATE: No need. Obtained PDFs through the usual channels.

Christopher_Fisher · September 18, 2023, 9:00pm

Awesome. The scaling of this algorithm is much better. My only question is whether there are any licensing issues.

bertschi · September 18, 2023, 9:34pm

The thesis of Lebrun has a chapter where he mentions that the algorithm is implemented in the OpenTURNS library. This file seems to contain the corresponding C++ code released under the LGPL.

skleinbo · September 19, 2023, 5:24pm

True, but if I didn’t make a mistake (which is very possible!), the incomplete gamma functions need to be evaluated at complex values, for which SpecialFunctions.jl has no implementation.

But the algorithm by Lebrun seems really nice!

Topic		Replies	Views
Cdf of multivariate normal in distributions.jl Statistics	24	10450	April 24, 2023
MVN CDF -- have it coded, need help getting integrating into Distributions.jl Statistics	18	2848	March 25, 2022
CDF of a copula from Copulas.jl? Statistics question , package	24	954	August 19, 2022
A comparison of common distributions in Julia, Python and R Statistics distributions	1	1337	May 21, 2021
Interface for a distribution defined by its cdf Statistics distributions	16	924	October 26, 2023

Cdf for multinomial

Related topics