Domain error for Turing model with callback

After looking into this a bit, a few notes:

  • At least for me, the type inference for XLSX.readtable did not work. I had to manually set the eltypes before creating the DataFrame
  • Constructing param_prior_df and then filling in the columns is type-unstable. Instead, it’s better to draw the parameters from the priors and then construct the DataFrame from those parameters. e.g.
param_prior_df = DataFrame(
    :β => [β₁, β₂],
    :α => fill(α, 2),
    :θₙ => fill(θₙ, 2),
    :θᵢ => fill(θᵢ, 2),
    :γₙ => fill(γₙ, 2),
    :γᵢ => fill(γᵢ, 2),
    :γ => fill(γ, 2),
    :λ => fill(λ, 2),
    :δᵢ => fill(δᵢ, 2),
)
  • Even without doing AD, just computing the log-density results in NaNs, e.g. here we use Turing’s mode estimation machinery to get a negative log density function in unconstrained space (where HMC samles) and an initial point drawn from the prior in that space.
julia> prob = optim_problem(combo1_model, MAP(); constrained=false);

julia> prob.prob.f(prob.prob.u0, nothing)
NaN
  • A key issue here is I think that inference_solution can be 0. MvNormal requires that the covariance matrix be positive definite, so that every entry of the diagonal must be greater than 0. Intuitively, MvNormal with diagonal covariance is equivalent to a product of Normals. As the standard deviation of Normal goes to 0, its density becomes a Dirac delta function, infinite if data is exactly equal to the mean and 0 everywhere else. This could cause -Inf and +Inf to be added together in the log-density, which causes the log-density to be NaN.
  • I worked around the above for now by adding 1e-3 to inference_solution in the diagonal of the covariances. I don’t recommend this approach though; instead I’d recommend thinking about whether these 0 covariances make sense.
  • After the above, I got non-NaN log-densities, but the numerical errors persisted, which could be due to bad posterior geometry or a bad initialization

To find a good initialization for sampling, we can use Pathfinder.jl. Multi-path Pathfinder constructs an approximation to the posterior from a mixture of multivariate normals:

] add Optim, Pathfinder#retry

julia> using Random, Pathfinder, Optim

julia> prob = optim_problem(combo1_model, MAP(); constrained=false);

julia> rng = MersenneTwister(42);

julia> optimizer = LBFGS(; m=6);

julia> x0s = [rand!(rng, Uniform(-2, 2), similar(prob.prob.u0)) for _ in 1:8];

julia> res = multipathfinder(
           prob.prob.f,
           x0s, 1_000;
           rng,
           optimizer,
           ndraws_per_run=1_000,
           ndraws_elbo=100,
       );
[ Info: Optimized for 88 iterations (tries: 1). Maximum ELBO of -1742.7 reached at iteration 43.
[ Info: Optimized for 86 iterations (tries: 1). Maximum ELBO of -1740.86 reached at iteration 76.
[ Info: Optimized for 89 iterations (tries: 1). Maximum ELBO of -2350.17 reached at iteration 72.
[ Info: Optimized for 63 iterations (tries: 1). Maximum ELBO of -1597.53 reached at iteration 57.
[ Info: Optimized for 124 iterations (tries: 1). Maximum ELBO of -1656.22 reached at iteration 97.
[ Info: Optimized for 103 iterations (tries: 1). Maximum ELBO of -1710.69 reached at iteration 90.
[ Info: Optimized for 60 iterations (tries: 1). Maximum ELBO of -2343.33 reached at iteration 36.
[ Info: Optimized for 86 iterations (tries: 1). Maximum ELBO of -1718.11 reached at iteration 68.
┌ Warning: Pareto shape k = 3.3 > 1. Corresponding importance sampling estimates are likely to be unstable and are unlikely to converge with additional samples.
└ @ PSIS ~/.julia/packages/PSIS/ZPzVl/src/core.jl:262

Pathfinder was pretty slow here, which implies HMC will be much slower. PSIS.jl’s Pareto shape diagnostic indicates that Pathfinder’s approximation is very bad, which can indicate that Pathfinder failed, or for a model with as few dimensions as this, it can indicate the posterior can’t be well approximated with a multivariate normal (the best posterior geometry for most MCMC methods). Either way, we shouldn’t trust the samples much, but they can often are still useful for initializing HMC or for diagnosing any high-level issues.

julia> q, ϕ, logqϕ = res;

julia> x = mapslices(prob.transform, ϕ; dims=1);

julia> _vi = prob.transform.vi;

julia> ts = [Turing.Inference.Transition(DynamicPPL.tonamedtuple(_vi), DynamicPPL.getlogp(_vi))];

julia> varnames, _ = Turing.Inference._params_to_array(ts);

julia> chns_pathfinder
┌ Warning: timestamp of type Missing unknown
└ @ MCMCChains ~/.julia/packages/MCMCChains/ctWhw/src/chains.jl:364
┌ Warning: timestamp of type Missing unknown
└ @ MCMCChains ~/.julia/packages/MCMCChains/ctWhw/src/chains.jl:364
Chains MCMC chain (1000×17×1 Array{Float64, 3}):

Iterations        = 1:1:1000
Number of chains  = 1
Samples per chain = 1000
parameters        = β₁, β₂, α, θₙ, θᵢ, γₙ, γᵢ, γ, λ, δᵢ, σ₁, σ₂, σ₃, σ₄, σ₅, I0₁, I0₂

Summary Statistics
  parameters       mean       std   naive_se      mcse         ess      rhat 
      Symbol    Float64   Float64    Float64   Float64     Float64   Float64 

          β₁     6.9985    0.0000     0.0000    0.0000   1005.3697    0.9990
          β₂     6.9863    0.0005     0.0000    0.0000   1024.4776    1.0019
           α     1.4080    0.0004     0.0000    0.0000   1019.0961    1.0014
          θₙ     0.1022    0.0003     0.0000    0.0000   1012.2757    0.9992
          θᵢ     0.0053    0.0000     0.0000    0.0000   1055.3354    0.9999
          γₙ     0.5485    0.0010     0.0000    0.0000    893.3647    1.0009
          γᵢ     0.5160    0.0010     0.0000    0.0000    944.2894    0.9990
           γ     2.0548    0.0016     0.0001    0.0001    900.2580    1.0008
           λ     0.0091    0.0000     0.0000    0.0000   1127.4095    0.9991
          δᵢ     0.5145    0.0014     0.0000    0.0000   1048.8733    0.9990
          σ₁     8.4751    0.1971     0.0062    0.0072    884.1938    1.0009
          σ₂   141.9986    2.4318     0.0769    0.0608   1054.5526    0.9995
          σ₃     3.9532    0.0688     0.0022    0.0020    971.5128    0.9995
          σ₄     0.5181    0.0173     0.0005    0.0006    928.3820    1.0012
          σ₅     0.2112    0.0106     0.0003    0.0003   1052.2068    0.9992
         I0₁     6.4298    0.0395     0.0012    0.0012   1034.2994    1.0006
         I0₂    99.9999    0.0000     0.0000    0.0000    945.1577    1.0005

Quantiles
  parameters       2.5%      25.0%      50.0%      75.0%      97.5% 
      Symbol    Float64    Float64    Float64    Float64    Float64 

          β₁     6.9985     6.9985     6.9985     6.9985     6.9985
          β₂     6.9857     6.9857     6.9865     6.9867     6.9872
           α     1.4075     1.4075     1.4083     1.4083     1.4085
          θₙ     0.1014     0.1023     0.1023     0.1023     0.1024
          θᵢ     0.0053     0.0053     0.0053     0.0053     0.0054
          γₙ     0.5468     0.5480     0.5482     0.5496     0.5496
          γᵢ     0.5122     0.5161     0.5161     0.5168     0.5168
           γ     2.0530     2.0530     2.0550     2.0566     2.0566
           λ     0.0091     0.0091     0.0091     0.0091     0.0091
          δᵢ     0.5115     0.5149     0.5149     0.5149     0.5180
          σ₁     8.2256     8.3736     8.3736     8.7023     8.7023
          σ₂   139.1153   139.2721   143.1914   143.1914   145.7702
          σ₃     3.8563     3.9477     3.9477     3.9530     4.1870
          σ₄     0.4963     0.5011     0.5314     0.5361     0.5361
          σ₅     0.1982     0.2068     0.2084     0.2084     0.2336
         I0₁     6.3699     6.3906     6.4615     6.4615     6.4811
         I0₂    99.9999    99.9999    99.9999    99.9999    99.9999

The main thing that jumps out to me from Pathfinder’s draws is that many of the parameters are concentrated at the edge of their support. That may indicate that the model is strained, i.e. that the model (e.g. the choice of prior or likelihood) is incompatible with the data, so the model is stretching to try to satisfy the data. One way to check this would be to simulate a single dataset from this model (i.e. draw a single set of parameters from the prior and simulate a dataset using the rest of the model.) and then fit the model to that.

Also, σ₂ is suspiciously much higher than the other σ parameters.

Here’s how we can initialize HMC with one of Pathfinder’s draws:

julia> init_params = x[:, 1]
17-element Vector{Float64}:
   6.998486124852993
   6.986700192730891
   1.4082625428667448
   0.10231177412323632
   0.005335621912656097
   0.5495912273365953
   0.5167593488513013
   2.053001567025573
   0.009057902929811885
   0.5148513860880671
   8.70228440525855
 143.191360543694
   3.9529928098708353
   0.5010519633309503
   0.20838145500524408
   6.390633180968371
  99.99992890794998

julia> chns = sample(MersenneTwister(92), combo1_model, NUTS(0.65), 1_000; init_params)

I haven’t let this sample to completion (looks like it would take about half a day), but after warm-up, I’m getting no more numerical error. I did run from a different initialization last night and the resulting samples had not converged after 1000 draws (R-hat values were far from 1) and the transitions saturated the NUTS tree depth of 10, which indicates that it never U-turns, i.e. after an HMC transition of 1024 steps, the final point has not crossed the posterior (what NUTS is adapted to do). This usually indicates that the posterior geometry will be hard for HMC to sample, e.g. some parameters are highly correlated or the log density is heavy-tailed or “bumpy” without clear modes.

So in short, my advice at this stage is 1) think about whether the chosen likelihood with variances of 0 makes sense and 2) check the model by fitting it to data simulated from the same prior and likelihood.

2 Likes