After looking into this a bit, a few notes:
- At least for me, the type inference for
XLSX.readtabledid not work. I had to manually set the eltypes before creating theDataFrame - Constructing
param_prior_dfand then filling in the columns is type-unstable. Instead, it’s better to draw the parameters from the priors and then construct theDataFramefrom those parameters. e.g.
param_prior_df = DataFrame(
:β => [β₁, β₂],
:α => fill(α, 2),
:θₙ => fill(θₙ, 2),
:θᵢ => fill(θᵢ, 2),
:γₙ => fill(γₙ, 2),
:γᵢ => fill(γᵢ, 2),
:γ => fill(γ, 2),
:λ => fill(λ, 2),
:δᵢ => fill(δᵢ, 2),
)
- Even without doing AD, just computing the log-density results in
NaNs, e.g. here we use Turing’s mode estimation machinery to get a negative log density function in unconstrained space (where HMC samles) and an initial point drawn from the prior in that space.
julia> prob = optim_problem(combo1_model, MAP(); constrained=false);
julia> prob.prob.f(prob.prob.u0, nothing)
NaN
- A key issue here is I think that
inference_solutioncan be 0.MvNormalrequires that the covariance matrix be positive definite, so that every entry of the diagonal must be greater than 0. Intuitively,MvNormalwith diagonal covariance is equivalent to a product ofNormals. As the standard deviation ofNormalgoes to 0, its density becomes a Dirac delta function, infinite if data is exactly equal to the mean and 0 everywhere else. This could cause-Infand+Infto be added together in the log-density, which causes the log-density to beNaN. - I worked around the above for now by adding
1e-3toinference_solutionin the diagonal of the covariances. I don’t recommend this approach though; instead I’d recommend thinking about whether these 0 covariances make sense. - After the above, I got non-
NaNlog-densities, but the numerical errors persisted, which could be due to bad posterior geometry or a bad initialization
To find a good initialization for sampling, we can use Pathfinder.jl. Multi-path Pathfinder constructs an approximation to the posterior from a mixture of multivariate normals:
] add Optim, Pathfinder#retry
julia> using Random, Pathfinder, Optim
julia> prob = optim_problem(combo1_model, MAP(); constrained=false);
julia> rng = MersenneTwister(42);
julia> optimizer = LBFGS(; m=6);
julia> x0s = [rand!(rng, Uniform(-2, 2), similar(prob.prob.u0)) for _ in 1:8];
julia> res = multipathfinder(
prob.prob.f,
x0s, 1_000;
rng,
optimizer,
ndraws_per_run=1_000,
ndraws_elbo=100,
);
[ Info: Optimized for 88 iterations (tries: 1). Maximum ELBO of -1742.7 reached at iteration 43.
[ Info: Optimized for 86 iterations (tries: 1). Maximum ELBO of -1740.86 reached at iteration 76.
[ Info: Optimized for 89 iterations (tries: 1). Maximum ELBO of -2350.17 reached at iteration 72.
[ Info: Optimized for 63 iterations (tries: 1). Maximum ELBO of -1597.53 reached at iteration 57.
[ Info: Optimized for 124 iterations (tries: 1). Maximum ELBO of -1656.22 reached at iteration 97.
[ Info: Optimized for 103 iterations (tries: 1). Maximum ELBO of -1710.69 reached at iteration 90.
[ Info: Optimized for 60 iterations (tries: 1). Maximum ELBO of -2343.33 reached at iteration 36.
[ Info: Optimized for 86 iterations (tries: 1). Maximum ELBO of -1718.11 reached at iteration 68.
┌ Warning: Pareto shape k = 3.3 > 1. Corresponding importance sampling estimates are likely to be unstable and are unlikely to converge with additional samples.
└ @ PSIS ~/.julia/packages/PSIS/ZPzVl/src/core.jl:262
Pathfinder was pretty slow here, which implies HMC will be much slower. PSIS.jl’s Pareto shape diagnostic indicates that Pathfinder’s approximation is very bad, which can indicate that Pathfinder failed, or for a model with as few dimensions as this, it can indicate the posterior can’t be well approximated with a multivariate normal (the best posterior geometry for most MCMC methods). Either way, we shouldn’t trust the samples much, but they can often are still useful for initializing HMC or for diagnosing any high-level issues.
julia> q, ϕ, logqϕ = res;
julia> x = mapslices(prob.transform, ϕ; dims=1);
julia> _vi = prob.transform.vi;
julia> ts = [Turing.Inference.Transition(DynamicPPL.tonamedtuple(_vi), DynamicPPL.getlogp(_vi))];
julia> varnames, _ = Turing.Inference._params_to_array(ts);
julia> chns_pathfinder
┌ Warning: timestamp of type Missing unknown
└ @ MCMCChains ~/.julia/packages/MCMCChains/ctWhw/src/chains.jl:364
┌ Warning: timestamp of type Missing unknown
└ @ MCMCChains ~/.julia/packages/MCMCChains/ctWhw/src/chains.jl:364
Chains MCMC chain (1000×17×1 Array{Float64, 3}):
Iterations = 1:1:1000
Number of chains = 1
Samples per chain = 1000
parameters = β₁, β₂, α, θₙ, θᵢ, γₙ, γᵢ, γ, λ, δᵢ, σ₁, σ₂, σ₃, σ₄, σ₅, I0₁, I0₂
Summary Statistics
parameters mean std naive_se mcse ess rhat
Symbol Float64 Float64 Float64 Float64 Float64 Float64
β₁ 6.9985 0.0000 0.0000 0.0000 1005.3697 0.9990
β₂ 6.9863 0.0005 0.0000 0.0000 1024.4776 1.0019
α 1.4080 0.0004 0.0000 0.0000 1019.0961 1.0014
θₙ 0.1022 0.0003 0.0000 0.0000 1012.2757 0.9992
θᵢ 0.0053 0.0000 0.0000 0.0000 1055.3354 0.9999
γₙ 0.5485 0.0010 0.0000 0.0000 893.3647 1.0009
γᵢ 0.5160 0.0010 0.0000 0.0000 944.2894 0.9990
γ 2.0548 0.0016 0.0001 0.0001 900.2580 1.0008
λ 0.0091 0.0000 0.0000 0.0000 1127.4095 0.9991
δᵢ 0.5145 0.0014 0.0000 0.0000 1048.8733 0.9990
σ₁ 8.4751 0.1971 0.0062 0.0072 884.1938 1.0009
σ₂ 141.9986 2.4318 0.0769 0.0608 1054.5526 0.9995
σ₃ 3.9532 0.0688 0.0022 0.0020 971.5128 0.9995
σ₄ 0.5181 0.0173 0.0005 0.0006 928.3820 1.0012
σ₅ 0.2112 0.0106 0.0003 0.0003 1052.2068 0.9992
I0₁ 6.4298 0.0395 0.0012 0.0012 1034.2994 1.0006
I0₂ 99.9999 0.0000 0.0000 0.0000 945.1577 1.0005
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
β₁ 6.9985 6.9985 6.9985 6.9985 6.9985
β₂ 6.9857 6.9857 6.9865 6.9867 6.9872
α 1.4075 1.4075 1.4083 1.4083 1.4085
θₙ 0.1014 0.1023 0.1023 0.1023 0.1024
θᵢ 0.0053 0.0053 0.0053 0.0053 0.0054
γₙ 0.5468 0.5480 0.5482 0.5496 0.5496
γᵢ 0.5122 0.5161 0.5161 0.5168 0.5168
γ 2.0530 2.0530 2.0550 2.0566 2.0566
λ 0.0091 0.0091 0.0091 0.0091 0.0091
δᵢ 0.5115 0.5149 0.5149 0.5149 0.5180
σ₁ 8.2256 8.3736 8.3736 8.7023 8.7023
σ₂ 139.1153 139.2721 143.1914 143.1914 145.7702
σ₃ 3.8563 3.9477 3.9477 3.9530 4.1870
σ₄ 0.4963 0.5011 0.5314 0.5361 0.5361
σ₅ 0.1982 0.2068 0.2084 0.2084 0.2336
I0₁ 6.3699 6.3906 6.4615 6.4615 6.4811
I0₂ 99.9999 99.9999 99.9999 99.9999 99.9999
The main thing that jumps out to me from Pathfinder’s draws is that many of the parameters are concentrated at the edge of their support. That may indicate that the model is strained, i.e. that the model (e.g. the choice of prior or likelihood) is incompatible with the data, so the model is stretching to try to satisfy the data. One way to check this would be to simulate a single dataset from this model (i.e. draw a single set of parameters from the prior and simulate a dataset using the rest of the model.) and then fit the model to that.
Also, σ₂ is suspiciously much higher than the other σ parameters.
Here’s how we can initialize HMC with one of Pathfinder’s draws:
julia> init_params = x[:, 1]
17-element Vector{Float64}:
6.998486124852993
6.986700192730891
1.4082625428667448
0.10231177412323632
0.005335621912656097
0.5495912273365953
0.5167593488513013
2.053001567025573
0.009057902929811885
0.5148513860880671
8.70228440525855
143.191360543694
3.9529928098708353
0.5010519633309503
0.20838145500524408
6.390633180968371
99.99992890794998
julia> chns = sample(MersenneTwister(92), combo1_model, NUTS(0.65), 1_000; init_params)
I haven’t let this sample to completion (looks like it would take about half a day), but after warm-up, I’m getting no more numerical error. I did run from a different initialization last night and the resulting samples had not converged after 1000 draws (R-hat values were far from 1) and the transitions saturated the NUTS tree depth of 10, which indicates that it never U-turns, i.e. after an HMC transition of 1024 steps, the final point has not crossed the posterior (what NUTS is adapted to do). This usually indicates that the posterior geometry will be hard for HMC to sample, e.g. some parameters are highly correlated or the log density is heavy-tailed or “bumpy” without clear modes.
So in short, my advice at this stage is 1) think about whether the chosen likelihood with variances of 0 makes sense and 2) check the model by fitting it to data simulated from the same prior and likelihood.