MVN CDF -- have it coded, need help getting integrating into Distributions.jl

Summary
I have written a Julia function to compute the Multivariate Normal integral.
I need a partner who is an expert in Distributions.jl to properly integrate it into the code base.

Introduction

I have written a function in Julia to compute the central Multivariate Normal probabilities:

\Phi_k(\mathbf{a},\mathbf{b};\mathbf{\Sigma} ) = \frac{1}{\sqrt{\left | \mathbf{\Sigma} \right |{(2\pi )^k}}}\int_{a_1}^{b_1}\int_{a_2}^{b_2}\begin{align*} &...\end{align*} \int_{a_k}^{b_k}e^{^{-\frac{1}{2}}\mathbf{x}^t{\mathbf{\Sigma }}^{-1}\boldsymbol{\textbf{x}}}dx_k...dx_1

where \Sigma is a symmetric, positive definite k x k covariance matrix. Vector a is the lower integration limit, and b is the upper integration limit. As this is the “central” MVN, \mu = \vec{0}.

This is a translation of the MATLAB qsimvnv function written by Prof. Alan Genz.

Examples:

julia> r = [4 3 2 1;3 5 -1 1;2 -1 4 2;1 1 2 5]
4×4 Array{Int64,2}:
 4   3   2  1
 3   5  -1  1
 2  -1   4  2
 1   1   2  5
julia> a = [-Inf; -Inf; -Inf; -Inf]
4-element Array{Float64,1}:
 -Inf
 -Inf
 -Inf
 -Inf
julia> b = [1; 2; 3; 4 ]
4-element Array{Int64,1}:
 1
 2
 3
 4
julia> (p,e)=qsimvnv(r,a,b)
(0.6049240592769761, 0.0016812280451818128)

In this example, 0.604924… is the computed probability, and 0.00168122… is the estimated error of the calculation. The function uses quasi-random integration to compute the integral, so you get a slightly different result each time it’s run. The results of the function have been checked for correctness by comparing them to R, MATLAB, and published results.

julia> (p,e)=qsimvnv(r,a,b)
(0.6060880398660634, 0.001540997560720256)
julia> (p,e)=qsimvnv(r,a,b)
(0.6060880398660634, 0.001540997560720256)
julia> (p,e)=qsimvnv(r,a,b)
(0.6067495056529963, 0.0013883605551699985)

You can increase the iterations to reduce the error. The default is 1000*dimension of \Sigma. Decrease in error is roughly linearly with increase in iterations:

julia> (p,e)=qsimvnv(r,a,b;m=5000)
(0.6061177664363478, 0.0018974040888668662)

julia> (p,e)=qsimvnv(r,a,b;m=6000)
(0.605451604851963, 0.0009217826063167869)

julia> (p,e)=qsimvnv(r,a,b;m=7000)
(0.6057328187356109, 0.0009514371151440057)

julia> (p,e)=qsimvnv(r,a,b;m=80000)
(0.6056921305482306, 0.0002024445811897692)

julia> (p,e)=qsimvnv(r,a,b;m=9000)
(0.6060092106313155, 0.0009367451258961832)

julia> (p,e)=qsimvnv(r,a,b;m=10000)
(0.605821023964934, 0.0005674428432139305)

This could be the basis of a MVN cdf function for Distributions.jl which operates with the MvNormal objects.

julia> r=float(r)
4×4 Array{Float64,2}:
 4.0   3.0   2.0  1.0
 3.0   5.0  -1.0  1.0
 2.0  -1.0   4.0  2.0
 1.0   1.0   2.0  5.0
julia> mu=[0, 0, 0, 0]
4-element Array{Int64,1}:
 0
 0
 0
 0
julia> mvn1 = MvNormal(mu, r)
FullNormal(
dim: 4
μ: [0.0, 0.0, 0.0, 0.0]
Σ: [4.0 3.0 2.0 1.0; 3.0 5.0 -1.0 1.0; 2.0 -1.0 4.0 2.0; 1.0 1.0 2.0 5.0]
)
julia> mvn_cdf(D::AbstractMvNormal, a::AbstractVector, b::AbstractVector; m::Union{Integer, Nothing} = nothing) = qsimvnv(D.Σ.mat,a .- D.μ, b .- D.μ;m=m)
mvn_cdf (generic function with 1 method)

julia> mvn_cdf(mvn1,a,b;m=5000)
(0.6056845764526123, 0.0015238130294856398)

What I have

  • Julia implementation of qsimvnv()
  • Julia implementation of Prof. Genz’s function that 1) scales and re-arranges the integration vectors, and 2) computes the cholesky decomposition of \Sigma. This is called by qsimvnv().
  • Several test cases, consisting of matrices of various sizes, the expected probability and error estimate results, and tolerances for each. This could form the basis of test code.

What I Don’t Have

  • A good understanding of the structure of the Distributions.jl code and all of the different MvNormal objects. When I look at the code, I can’t even figure out where I would put the two functions I have.
  • Programming skills with Julia types, methods, multiple dispatch. The “mvn_cdf” example above is the pinnacle of my achievement, and it doesn’t work unless you specify m= in it (i.e., not specifying m to use the default results in an error).
  • Git experience

What I Need

  • I need a partner who is an expert in the structure of Distributions.jl and the MvNormal objects, who knows how to integrate it into the code base and write the methods for all of the different MvNormal objects. I would expect his partner would also identify improvements to the code I’ve written for qsimvnv() and _chlrdr (the cholesky decomposition function).
  • Note that MvNormal objects can have nonzero mean vectors; to use qsimvnv(), the \mu vector needs to be subtracted from the integration limits a and b.
  • Some of the MvNormal objects use Correlation matrices instead of Covariance matrices. These need to be converted to covariance matrices before calling the function.
  • Given a template to work from, I can probably contribute to writing the test code. I can also run tests, and I can compare the results to MATLAB as needed.

Next Steps
If someone wants to partner up on this, let me know what to do next. I could create a PR to dump the code in to the /src directory, but my partner would have to fix it afterwards. Or … let me know how to proceed.

Appendix - Statistics Notes

One way to define the Multivariate Normal CDF is for “simultaneous X” – that is, a for a given random vector \vec{x}, the probability that every component of \vec{x} is less than a reference vector, \vec{x_0}:

F(\vec{x_{0}}) = P(\vec{x} \leqslant \vec{x_{0}})

which can be generalized to:

F(\vec{a},\vec{b}) = P(\vec{a} \leqslant \vec{x} \leqslant \vec{b})

Fn(\vec{a},\vec{b};\mu,\Sigma) = P(ai < Xi < bi;i=1...n)

F_{n}(\vec{a},\vec{b};\vec{\mu} ,\Sigma ) = P(a_{i}<X_{i}<b_{i}; i=1...n) = \int_{a_{1}}^{b_{1}}\int_{a_{2}}^{b_{2}}\cdots \int_{a_{i}}^{b_{i}}f(x_{1},x_{2},...,x_{i})dx_{1}dx_{2}\cdots dx_{i}

We define:

F_{n}(\vec{a},\vec{b};\vec{\mu} ,\Sigma ) = \Phi_k(\mathbf{a},\mathbf{b},\mathbf{\Sigma} )

where:

\Phi_k(\mathbf{a},\mathbf{b};\mathbf{\Sigma} ) = \frac{1}{\sqrt{\left | \mathbf{\Sigma} \right |{(2\pi )^k}}}\int_{a_1}^{b_1}\int_{a_2}^{b_2}\begin{align*} &...\end{align*} \int_{a_k}^{b_k}e^{^{-\frac{1}{2}}\mathbf{x}^t{\mathbf{\Sigma }}^{-1}\boldsymbol{\textbf{x}}}dx_k...dx_1

Note that you’re only really computing the CDF if \vec{a} = -\infty.
If \vec{a}=\vec{0} and \vec{b} = +\vec{\infty}, this is called the orthant probability.

Non-Central MVN

The non-central MVN integral is given by:

\Phi_k(\mathbf{a},\mathbf{b};\mathbf{\Sigma},\delta ) = \frac{1}{\sqrt{\left | \mathbf{\Sigma} \right |{(2\pi )^k}}}\int_{a_1}^{b_1}\int_{a_2}^{b_2}\begin{align*} &...\end{align*} \int_{a_k}^{b_k}e^{^{-\frac{1}{2}}\mathbf{(x-\delta)}^t{\mathbf{\Sigma }}^{-1}\mathbf{(x-\delta)}}dx_k...dx_1

where \delta denotes the k x 1 non-centrality vector with -{\infty} < \delta_i < +{\infty}.

The non-central MVN integral can determined from the central MVN with shifted integration limits:

\Phi_k(\mathbf{a},\mathbf{b};\mathbf{\Sigma},\delta ) = \Phi_k(\mathbf{a-\delta},\mathbf{b-\delta };\mathbf{\Sigma})

Standardized MVN

The non-central MVN probability \Phi_k(\mathbf{a},\mathbf{b};\mathbf{\Sigma},\delta ) can also be transformed into the standardized MVN probability. The standardized MVN has \vec{\mu} = \vec{0} and \Sigma = I_n.

Let \mathbf{D} denote the diagonal matrix of the square roots of the diagonal entries of \Sigma. The correlation matrix \mathbf{R} is defined by \mathbf{\Sigma = DRD}. Then the transformation \mathbf{x = Dy + \delta} reduces the non-central MVN probability to

\Phi_k(\mathbf{a},\mathbf{b};\mathbf{\Sigma},\delta ) = \Phi_k(\mathbf{D^{-1}(a-\delta),D^{-1}(b-\delta)};\mathbf{R})

Mahalanobis Distance

The other way to think about the MVN probability is that of a sample vector \vec{x_0} falls within the ellipsoid defined by MVN(\vec{\mu},\Sigma). This has a closed form solution. If you’re interested in this, see Distribution of observation-level Mahalanobis distance.

2 Likes

First pass, Distributions.jl doesn’t implement the CDF for any MultivariateDistribution yet, so there’s no agreed upon standard there. StatsFuns.jl might be the appropriate home for this in the medium-term. However, you have to be careful about the licensing issues that might surround "a translation of the MATLAB qsimvnv".

2 Likes

Before you invest time in this, it’s worth making sure that you have the relevant IP rights from Alan Genz to include a translation of their code in a MIT-licensed project like Distributions.jl.

If that’s all set, I think you don’t need a full time partner to get started with this if you’re passionate and interested to experiment on your own. I would do the following:

  1. Locate your local copy of Distributions. You should be able to find this using something like
using Distributions
@which Bernoulli(0.5)
  1. Once you’ve found the directory, look for the mvnormal.jl file.

  2. Add a method like the following:

function cdf(d::MvNormal, x::AbstractArray)
    # YOUR CODE GOES HERE
end

Once you fill in code for cdf, you should be able to run it locally (since you’re editing your copy of Distributions) and test it.

If you’re confused by Git, you could wait to get your cdf function working and then submit a pull request using the steps from https://www.youtube.com/watch?v=0gshzIYqkcY&feature=youtu.be.

By the time you’ve done that, you’ll have learned a bunch and could finish the work with much less help from someone with commit rights to Distributions.jl. As @johnczito notes, the real blocker is getting people to agree what CDF for multivariate distributions does.

5 Likes

Prof. Genz allows redistribution of his code:

% Copyright © 2013, Alan Genz, All rights reserved.
%
% Redistribution and use in source and binary forms, with or without
% modification, are permitted provided the following conditions are met:
% 1. Redistributions of source code must retain the above copyright
% notice, this list of conditions and the following disclaimer.
% 2. Redistributions in binary form must reproduce the above copyright
% notice, this list of conditions and the following disclaimer in
% the documentation and/or other materials provided with the
% distribution.
% 3. The contributor name(s) may not be used to endorse or promote
% products derived from this software without specific prior
% written permission.

1 Like

Great. And I remember now that a Genz port of the bivariate normal cdf already exists in StatsFuns.jl/src/tvpack.jl. It happens not to be exported. And in that case I think someone just emailed him for permission.

If the contribution is a perfectly numerical function that computes multivariate normal probabilities over hyperrectangles, then I’d say the best place to put that is StatsFuns.jl. And that process could be fairly uncomplicated. It sounds like you’re already pretty close. Observe the proper licensing issues, write some thorough tests, and then follow @johnmyleswhite’s advice above.

Like I said, trying to go direct to Distributions.jl could be a challenge because there’s no agreed-upon convention for how cdf(d::MultivariateDistribution, x::AbstractVector) ought to behave. So your PR could get caught up in lengthy discussions about that. But no matter what convention is ultimately settled on, this function will be relevant, so if you put it in StatsFuns.jl, it will be “ready and waiting.”

1 Like

If the contribution is a perfectly numerical function that computes multivariate normal probabilities over hyperrectangles, then I’d say the best place to put that is StatsFuns.jl .

This is how things are structured for a lot of the univariates anyway. For things like gamma, beta, etc. The function that actually computes the cdf is in StatsFuns.jl. Distributions.jl is just a front end.

It’s not so simple, I think.

For MvNormal, we have:
const IsoNormal = MvNormal{ScalMat, Vector{Float64}}
const DiagNormal = MvNormal{PDiagMat, Vector{Float64}}
const FullNormal = MvNormal{PDMat, Vector{Float64}}

const ZeroMeanIsoNormal{Axes} = MvNormal{ScalMat, Zeros{Float64,1,Axes}}
const ZeroMeanDiagNormal{Axes} = MvNormal{PDiagMat, Zeros{Float64,1,Axes}}
const ZeroMeanFullNormal{Axes} = MvNormal{PDMat, Zeros{Float64,1,Axes}}

The MvNormal object stores its data differently depending on how it’s defined. \Sigma may be stored as an array, a PDMat, a PDDiagMat, a ScalMat, a Vector, or a real-valued number. If the covariance matrix is stored in a compact form, a full k x k covariance matrix needs to be created before calling qsimvnv().

In addition, there are the MvNoralCannon types, which store the potential vector \mathbf{h} and precision matrix \mathbf{J} instead of \mu and \Sigma. The MvNormalCannon types has just as many variations as the MvNormal type.

If you look at function fit_mle() in MvNormal.jl, you’ll see all of the different cases that need to be covered. There’s a lot in there I don’t understand – objects and type system in general – but also objects like MvNormalKnownCov.

I thinking adding it to StatsFun.jl is a lot more manageable – more of the work of forming a \Sigma with the proper integration limits is put on the user.

Yeah, sorting out those front-end issues is secondary. It will all resolve to a function like what you have in the end.

It’s true that doing a good job isn’t that simple, but to unblock yourself, you could start by simplifying the problem to avoid dealing with types you don’t understand:

function cdf(d::MvNormal, x::AbstractArray)
    mu = Vector(d.μ)
    sigma = Matrix(d.Σ)
    # YOUR CODE HERE USES mu AND sigma

Eventually you’d likely need to handle all of the complexity, but you could make progress by simplifying the problem to a narrower space that you’re comfortable operating in and then gradually increasing complexity to provide the performance/accuracy you’d lose through the simplification.

I don’t contribute to Distributions.jl anymore and can’t speak to how much support you could rely on from the current maintainers, but I suspect blocking on support from a volunteer partner is likely to slow you down substantially more than trying to fumble your way through things and making mistakes that will eventually teach you how to do things through simple trial and error.

All that said, I think @johnczito’s approach is a good one – it’s another simplification that will help you get unblocked.

1 Like

Just stumbled on this issue. I’d be really good if your implementation was made widely available; I was looking for this at one point and I was disappointed not to find it. I can help out a little bit by looking at the code and suggesting how to make it more Julian and more ready for submission to StatsFun.jl. (Not that I’ve ever contibuted to StatsFun.jl, but, you know, general direction :slight_smile: ).
You’re welcome to ping me if/when you get the code onto github.

1 Like

Any update on this?

I’m very frustrated that cdf function in Distributions.jl does not have a method for multivariate distributions.

1 Like