Inconsistent size of the ODE solution object for Turing inference

Hello people,

I’ve been using Turing to infer parameters of an ODE model for close to a year. But since recently, I am facing the following non-reproducible error from the code, which I haven’t encountered previously:

Sampling (2 processes)   0%|                            |  ETA: N/A
Sampling (2 processes) 100%|████████████████████████████| Time: 0:26:00
ERROR: LoadError: TaskFailedException

    nested task error: On worker 2:
    DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 0 and 11")

Following is an MWE:

#=
Section 1: Importing packages
=#

using Distributed
using Turing
num_chains = 2;
addprocs(num_chains)

@everywhere using DifferentialEquations, Interpolations, XLSX, DataFrames
@everywhere using Distributions, DistributionsAD, MCMCChains, Turing
@everywhere using LinearAlgebra, Random
@everywhere Random.seed!(18431);
@everywhere using ForwardDiff, Preferences
@everywhere Turing.setadbackend(:forwarddiff)
@everywhere set_preferences!(ForwardDiff, "nansafe_mode" => true)

# Time period for simulation of ODEs
@everywhere tspan = (1.0, 11.0)

# Observation data
@everywhere actual_data = [
    10.0 0.0 0.0 0.0;
    10.55 0.15 0.01 0.01;
    11.36 0.09 0.01 0.01;
    12.12 0.05 0.01 0.01;
    13.06 0.03 0.01 0.01;
    14.79 0.04 0.01 0.01;
    19.58 0.08 0.01 0.01;
    39.94 0.27 0.03 0.02;
    172.38 1.56 0.16 0.04;
    1489.52 13.98 1.40 0.20;
    20918.01 191.95 19.20 2.18
]

#=
Section 2: ODE model
=#

@everywhere function ODE_model!(dy, y, p, t)

    N = 1_000_000;      # A constant

    @inbounds begin
    dy[1] = - p[1] * y[1] * y[3] / N

    dy[2] =   p[1] * y[1] * y[3] / N - p[2] * y[2]

    dy[3] =   p[2] * y[2] - p[3] * y[3]

    dy[4] =   p[3] * (1 - p[7]) * (1 - p[6]) * y[3] - p[5] * y[4]

    dy[5] =   p[3] * p[7] * (1 - p[6]) * y[3] - p[4] * y[5]

    dy[6] =   p[3] * p[6] * y[3] + p[5] * y[4] + p[4] * (1 - p[8]) * y[5]

    #=
    Compartments used for inference
    =#
    dy[7] =   p[4] * p[8] * y[5]

    dy[8] =   p[2] * y[2]

    dy[9] =   p[3] * (1 - p[6]) * y[3] - p[5] * y[4] - p[4] * y[5]

    dy[10] =   p[3] * p[7] * (1 - p[6]) * y[3] - p[4] * y[5]
    end
    return nothing
end

#=
Section 3: Define the observation model
=#

@everywhere @model function inference_function!(data, ODEtspan)

    # Priors of ODE model parameters
    β ~ LogNormal(log(2.8), 0.5)
    α ~ LogNormal(log(7/6.5), 0.5)
    θ ~ LogNormal(log(7/3.2), 0.5)
    γᵢ ~ LogNormal(log(7/10), 0.5)
    γₙ ~ LogNormal(log(7/8), 0.5)
    ϕᵣ ~ Beta(5.0, 1.5)
    ϕᵢ ~ Beta(1.5, 5.0)
    ϕₖ ~ Beta(1.5, 5.0)

    param_prior = [β, α, θ, γᵢ, γₙ, ϕᵣ, ϕᵢ, ϕₖ]

    # Priors of observation model parameters
    σ₁ ~ Gamma(1.0, 5.0)
    σ₂ ~ Gamma(1.0, 5.0)
    σ₃ ~ Gamma(1.0, 5.0)
    σ₄ ~ Gamma(1.0, 5.0)

    # Defining the initial conditions and the problem
    u0 = zeros(10)          # Total 10 variables in the ODE model
    u0[1] = 1_000_000 - 10  # Value of N - 10
    u0[3] = 10
    u0[8] = 10
    inference_problem = ODEProblem(ODE_model!, u0, ODEtspan, param_prior)

    # Solve the ODE model
    inference_solution = solve(inference_problem, AutoVern7(Rodas5()), saveat = 1.0, save_everystep=false)

    # Inference using student's t-distributions
    data[2:11,1] ~ arraydist(LocationScale.(inference_solution[7,2:11],  σ₁.*sqrt.(data[2:11,1]), TDist.(4)))
    data[2:11,2] ~ arraydist(LocationScale.(inference_solution[8,2:11],  σ₂.*sqrt.(data[2:11,2]), TDist.(4)))
    data[2:11,3] ~ arraydist(LocationScale.(inference_solution[9,2:11],  σ₃.*sqrt.(data[2:11,3]), TDist.(4)))
    data[2:11,4] ~ arraydist(LocationScale.(inference_solution[10,2:11], σ₄.*sqrt.(data[2:11,4]), TDist.(4)))
end

#=
Section 4: Run the inference procedure and save the chains
=#

@everywhere inference_model = inference_function!(actual_data, tspan)
chains = sample(inference_model, NUTS(2000, 0.65), MCMCDistributed(), 5000, num_chains; progress=true)

Stacktrace shows that the error message is thrown by the inference lines, where I am using the T-distribution with arraydist.

By ‘non-reproducible’ error, I mean that the error is unpredictable. Sometimes, all the chains finish sampling without the above error, but sometimes, they do not. Sometimes, even the dimensions of the solution object inference_solution varies, though I am enforcing saveat while solving, resulting in error "arrays could not be broadcast to a common size; got a dimension with lengths x and 11" where x is unpredictable; here I am showing the case for x of 0.

The code was originally written not more than 6 months ago. I am using Julia v1.7.2 on a CentOS v7 HPC, and the following are the packages:

Status `~/.julia/environments/v1.7/Project.toml`
  [a93c6f00] DataFrames v1.3.4
  [0c46a032] DifferentialEquations v7.0.0
  [31c24e10] Distributions v0.25.60
  [ced4e74d] DistributionsAD v0.6.40
  [f6369f11] ForwardDiff v0.10.30
  [a98d9a8b] Interpolations v0.13.6
  [c7f686f2] MCMCChains v5.3.0
  [21216c6a] Preferences v1.3.0
  [fce5fe82] Turing v0.20.4
  [fdbf4ff8] XLSX v0.7.10
  [8ba89e20] Distributed
  [9a3f8284] Random

Please help me out as I am befuddled with this error which was not at all present since a week ago. :sweat:

Did the ODE solver print out a warning saying that it a solve was divergent, and did you handle the divergent trajectory as the docs mention?

https://sensitivity.sciml.ai/dev/training_tips/divergence/

I don’t see any check for whether the solution was successful.

1 Like

Thank you for the suggestion, @ChrisRackauckas. I modified the code, making it display the retcode when the error crops up.

Following is the output, where the retcode is printed, followed by the solution object inference_solution; please note that the error is from my main code and not MWE, and thus parameters are more, they have some subscripts, and other changes are present:

From worker 2:	retcode: Unstable
      From worker 2:	Interpolation: 1st order linear
      From worker 2:	t: 1-element Vector{Float64}:
      From worker 2:	 11.0
      From worker 2:	u: 1-element Vector{Vector{ForwardDiff.Dual{ForwardDiff.Tag{Turing.Essential.var"#f#4"{DynamicPPL.TypedVarInfo{NamedTuple{(:β₃, :α₃, :θ₃, :γᵢ₃, :γₙ₃, :ϕᵣ₃, :ϕᵢ₃, :ϕₖ₃, :σ₁₃, :σ₂₃, :σ₃₃, :σ₄₃, :ρ₁₃, :ρ₄₃, :τ₁₃, :τ₄₃, :I0₃), Tuple{DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:β₃, Setfield.IdentityLens}, Int64}, Vector{LogNormal{Float64}}, Vector{AbstractPPL.VarName{:β₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:α₃, Setfield.IdentityLens}, Int64}, Vector{LogNormal{Float64}}, Vector{AbstractPPL.VarName{:α₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:θ₃, Setfield.IdentityLens}, Int64}, Vector{LogNormal{Float64}}, Vector{AbstractPPL.VarName{:θ₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:γᵢ₃, Setfield.IdentityLens}, Int64}, Vector{LogNormal{Float64}}, Vector{AbstractPPL.VarName{:γᵢ₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:γₙ₃, Setfield.IdentityLens}, Int64}, Vector{LogNormal{Float64}}, Vector{AbstractPPL.VarName{:γₙ₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:ϕᵣ₃, Setfield.IdentityLens}, Int64}, Vector{Beta{Float64}}, Vector{AbstractPPL.VarName{:ϕᵣ₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:ϕᵢ₃, Setfield.IdentityLens}, Int64}, Vector{Beta{Float64}}, Vector{AbstractPPL.VarName{:ϕᵢ₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:ϕₖ₃, Setfield.IdentityLens}, Int64}, Vector{Beta{Float64}}, Vector{AbstractPPL.VarName{:ϕₖ₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:σ₁₃, Setfield.IdentityLens}, Int64}, Vector{Gamma{Float64}}, Vector{AbstractPPL.VarName{:σ₁₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:σ₂₃, Setfield.IdentityLens}, Int64}, Vector{Gamma{Float64}}, Vector{AbstractPPL.VarName{:σ₂₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:σ₃₃, Setfield.IdentityLens}, Int64}, Vector{Gamma{Float64}}, Vector{AbstractPPL.VarName{:σ₃₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:σ₄₃, Setfield.IdentityLens}, Int64}, Vector{Gamma{Float64}}, Vector{AbstractPPL.VarName{:σ₄₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:ρ₁₃, Setfield.IdentityLens}, Int64}, Vector{Beta{Float64}}, Vector{AbstractPPL.VarName{:ρ₁₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:ρ₄₃, Setfield.IdentityLens}, Int64}, Vector{Beta{Float64}}, Vector{AbstractPPL.VarName{:ρ₄₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:τ₁₃, Setfield.IdentityLens}, Int64}, Vector{LogNormal{Float64}}, Vector{AbstractPPL.VarName{:τ₁₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:τ₄₃, Setfield.IdentityLens}, Int64}, Vector{LogNormal{Float64}}, Vector{AbstractPPL.VarName{:τ₄₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:I0₃, Setfield.IdentityLens}, Int64}, Vector{Gamma{Float64}}, Vector{AbstractPPL.VarName{:I0₃, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}}}, Float64}, DynamicPPL.Model{typeof(delta_data_fitting!), (:data, :p_ances, :p_alpha, :inits_delta, :ODEtspan, :num_variants, :tot_pop, :interp_weath, :interp_mobil, :vacc1, :vacc2), (), (), Tuple{DataFrame, Vector{Float64}, Vector{Float64}, Vector{Float64}, Tuple{Float64, Float64}, Int64, Int64, Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, BSpline{Linear{Throw{OnGrid}}}, Tuple{Base.OneTo{Int64}}}, Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, BSpline{Linear{Throw{OnGrid}}}, Tuple{Base.OneTo{Int64}}}, Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, BSpline{Linear{Throw{OnGrid}}}, Tuple{Base.OneTo{Int64}}}, Interpolations.BSplineInterpolation{Float64, 1, Vector{Float64}, BSpline{Linear{Throw{OnGrid}}}, Tuple{Base.OneTo{Int64}}}}, Tuple{}, DynamicPPL.DefaultContext}, DynamicPPL.Sampler{NUTS{Turing.Essential.ForwardDiffAD{40}, (), AdvancedHMC.DiagEuclideanMetric}}, DynamicPPL.DefaultContext}, Float64}, Float64, 9}}}:

I find that different chains that are initiated parallelly throw the exact same error at different time points. So - correct me if I’m wrong - it looks like the sampled parameter values are causing the model to diverge, as you’ve rightly pointed out, and then further sampling of that chain is aborted. If my understanding is true, could you tell me how to deal with this? Is there a way, for instance, to neglect such samples and continue sampling, effectively making my inference code stable?

That’s what the linked tutorial shows how to do:

if tmp_sol.retcode == :Success
    # do normal thing
  else
    # handle bad case
  end

Look at the SDE case from the Turing tutorial:

    # Early exit if simulation could not be computed successfully.
    if predicted.retcode !== :Success
        Turing.@addlogprob! -Inf
        return nothing
    end

That’s how you make it throw away a sample.

2 Likes

It worked. Thanks a lot! :slight_smile:

1 Like

No problem. If you have any suggestions for how to make the tutorials and docs more clear, please let us know.

2 Likes