Hi there,
I am using Sobol sequences to generate say 10,000 draws from the Multivariate Normal with dimension 100, mean vector 0 and covariance matrix equal to the identity matrix.
Generating the sequence is fine, but I find that the distribution of the resulting points is highly concentrated. That is, the density of each draw, relative to that of the other draws, is almost zero for 99% of points, then several orders of magnitude higher for just a few points. I’d expect the relative densities to be a little more even.
Example below. Any ideas what I’m doing wrong?
Thanks
using Sobol
using Distributions
using UnicodePlots
function generate_z(K::Int, nr::Int)
# Initialize result with draws from the Uniform distribution over the unit cube [0, 1]^K.
result = fill(0.0, K, nr)
s = SobolSeq(K)
skip(s, nr)
for r = 1:nr
result[:, r] .= next!(s)
end
# Convert to draws from MVN(z | 0_k, I_k) using the inverse CDF
d = Normal(0.0, 1.0)
for r = 1:nr
for i = 1:K
result[i, r] = quantile(d, result[i, r])
end
end
result
end
function zdensity(z::Array{Float64, 2})
K, nr = size(z)
result = fill(1.0, nr)
d = Normal(0.0, 1.0)
totaldens = 0.0
for r = 1:nr
for i = 1:K
result[r] *= pdf(d, z[i, r])
end
totaldens += result[r]
end
result ./= totaldens
end
z = generate_z(100, 10_000);
dens = zdensity(z)
lineplot(dens)
describe(dens)
1 Like
First, I would use the sum of log pdfs, which is much less prone to numerical error. These look fine to me when compared with random normals. Also simple statistics look ok.
using Statistics, LinearAlgebra
extrema(mapslices(mean, z; dims = 2))
extrema(mapslices(std, z; dims = 2))
z_log_density(z) = normalize!(vec(sum(logpdf.(Ref(Normal(0, 1)), z); dims = 1)), 1)
z = generate_z(100, 10_000)
z_dens = z_log_density(z)
sim_z = randn(size(z))
sim_z_dens = z_log_density(sim_z)
ps = 0:0.1:1
quantile(z_dens, ps)
quantile(sim_z_dens, ps)
yields
julia> extrema(mapslices(mean, z; dims = 2))
(-0.0007172362463843937, 0.0009932045561835572)
julia> extrema(mapslices(std, z; dims = 2))
(0.999483368783979, 1.0003834076282714)
julia> quantile(z_dens, ps)
11-element Array{Float64,1}:
-0.00012070140687908017
-0.0001063122814842569
-0.00010400024477877038
-0.00010240952128016883
-0.00010104242015569829
-9.979156638516813e-5
-9.854879329589064e-5
-9.727505743865635e-5
-9.584836020836684e-5
-9.393933488262388e-5
-8.430228880599951e-5
julia> quantile(sim_z_dens, ps)
11-element Array{Float64,1}:
-0.00012556606255873136
-0.00010655950256849811
-0.0001041351251443149
-0.0001024130143312689
-0.00010104015366172799
-9.97620830194371e-5
-9.849225327092044e-5
-9.72325802687073e-5
-9.577554898483448e-5
-9.377537093476237e-5
-8.342424675283704e-5
1 Like
Thanks for responding.
I don’t think numerical error is at play here. While it is true that across each of the 100 dimensions we get a good (low-discrepancy) sample from the univariate standard normal, I think we are seeing a natural consequence of high-dimensional sampling…that is, a handful of points dominating the distribution in terms of density.
I suspect this is unavoidable because within a draw all of the 100 component densities are less than 0.4 (pdf(Normal(0, 1), 0)
), and their product is usually very small. But sometimes (rarely) most of the 100 component densities will be relatively close to 0.4 (or at least on the order of 10^-1), which results in a much larger density than that of most draws…several orders of magnitude in fact.
1 Like