Generating high-dimensional Sobol sequences

#1

Hi there,

I am using Sobol sequences to generate say 10,000 draws from the Multivariate Normal with dimension 100, mean vector 0 and covariance matrix equal to the identity matrix.

Generating the sequence is fine, but I find that the distribution of the resulting points is highly concentrated. That is, the density of each draw, relative to that of the other draws, is almost zero for 99% of points, then several orders of magnitude higher for just a few points. I’d expect the relative densities to be a little more even.

Example below. Any ideas what I’m doing wrong?

Thanks

using Sobol
using Distributions
using UnicodePlots

function generate_z(K::Int, nr::Int)
    # Initialize result with draws from the Uniform distribution over the unit cube [0, 1]^K.
    result = fill(0.0, K, nr)
    s = SobolSeq(K)
    skip(s, nr)
    for r = 1:nr
        result[:, r] .= next!(s)
    end

    # Convert to draws from MVN(z | 0_k, I_k) using the inverse CDF
    d = Normal(0.0, 1.0)
    for r = 1:nr
        for i = 1:K
            result[i, r] = quantile(d, result[i, r])
        end
    end
    result
end

function zdensity(z::Array{Float64, 2})
    K, nr  = size(z)
    result = fill(1.0, nr)
    d      = Normal(0.0, 1.0)
    totaldens = 0.0
    for r = 1:nr
        for i = 1:K
            result[r] *= pdf(d, z[i, r])
        end
        totaldens += result[r]
    end
    result ./= totaldens
end

z = generate_z(100, 10_000);
dens = zdensity(z)
lineplot(dens)
describe(dens)
0 Likes

#2

First, I would use the sum of log pdfs, which is much less prone to numerical error. These look fine to me when compared with random normals. Also simple statistics look ok.

using Statistics, LinearAlgebra
extrema(mapslices(mean, z; dims = 2))
extrema(mapslices(std, z; dims = 2))
z_log_density(z) = normalize!(vec(sum(logpdf.(Ref(Normal(0, 1)), z); dims = 1)), 1)
z = generate_z(100, 10_000)
z_dens = z_log_density(z)
sim_z = randn(size(z))
sim_z_dens = z_log_density(sim_z)
ps = 0:0.1:1
quantile(z_dens, ps)
quantile(sim_z_dens, ps)

yields

julia> extrema(mapslices(mean, z; dims = 2))
(-0.0007172362463843937, 0.0009932045561835572)

julia> extrema(mapslices(std, z; dims = 2))
(0.999483368783979, 1.0003834076282714)

julia> quantile(z_dens, ps)
11-element Array{Float64,1}:
 -0.00012070140687908017
 -0.0001063122814842569 
 -0.00010400024477877038
 -0.00010240952128016883
 -0.00010104242015569829
 -9.979156638516813e-5  
 -9.854879329589064e-5  
 -9.727505743865635e-5  
 -9.584836020836684e-5  
 -9.393933488262388e-5  
 -8.430228880599951e-5  

julia> quantile(sim_z_dens, ps)
11-element Array{Float64,1}:
 -0.00012556606255873136
 -0.00010655950256849811
 -0.0001041351251443149 
 -0.0001024130143312689 
 -0.00010104015366172799
 -9.97620830194371e-5   
 -9.849225327092044e-5  
 -9.72325802687073e-5   
 -9.577554898483448e-5  
 -9.377537093476237e-5  
 -8.342424675283704e-5  
0 Likes

#3

Thanks for responding.

I don’t think numerical error is at play here. While it is true that across each of the 100 dimensions we get a good (low-discrepancy) sample from the univariate standard normal, I think we are seeing a natural consequence of high-dimensional sampling…that is, a handful of points dominating the distribution in terms of density.

I suspect this is unavoidable because within a draw all of the 100 component densities are less than 0.4 (pdf(Normal(0, 1), 0)), and their product is usually very small. But sometimes (rarely) most of the 100 component densities will be relatively close to 0.4 (or at least on the order of 10^-1), which results in a much larger density than that of most draws…several orders of magnitude in fact.

0 Likes