Resume chains Turing

marcobonici · August 12, 2021, 5:46pm

Hi to everybody,
I want to use Turing on the lsf cluster of my university. Following the official documentation and using the ClusterManager package I was able to run Turing on my cluster.

using Turing
using Distributed
using ClusterManager

#add 20 processes on the long queue
ClusterManagers.addprocs_lsf(20; bsub_flags = `-q long`)

#loading Turing on the processes
@everywhere using Turing

#toy model found in the Turing documentation
@everywhere @model function gdemo(x)
           s² ~ InverseGamma(2, 3)
           m ~ Normal(0, sqrt(s²))

           for i in eachindex(x)
               x[i] ~ Normal(m, sqrt(s²))
           end
       end

@everywhere model = gdemo([1.5, 2.0])

#sampling
chains = sample(model, NUTS(), MCMCDistributed(), 1000, 10; save_state = true)

write("chain-file.jls", chains)

This worked: after generating the samples, by 10 processes on the cluster, the chains have been saved.
The next step would be to resume the chains from their state.

So, I tried this way

chains_reloaded = read("chain-file.jls", Chains)

#resume sampling
sample(model, NUTS(), MCMCDistributed(), 10000, 10; save_state = true, resume_from = chains_reloaded)

However, looks to me that this did not work. I obtained

Chains MCMC chain (10000×14×10 Array{Float64, 3}):

Iterations        = 1:1:10000
Number of chains  = 10
Samples per chain = 10000
Wall duration     = 1.85 seconds
Compute duration  = 5.34 seconds
parameters        = s², m
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std   naive_se      mcse          ess      rhat   ess_per_sec 
      Symbol   Float64   Float64    Float64   Float64      Float64   Float64       Float64 

          s²    2.0238    1.8881     0.0060    0.0102   35011.0145    1.0001     6556.3698
           m    1.1626    0.8159     0.0026    0.0037   46594.2097    1.0000     8725.5074

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

          s²    0.5652    1.0406    1.5241    2.3436    6.5242
           m   -0.4784    0.6858    1.1662    1.6391    2.7920

However, I expected to find 10 chains with 20000 samples…not 10000.
Is there anything I am doing wrong?

Cheers,
Marco

dlakelan · August 12, 2021, 7:37pm

I think you’ve asked for 10k samples starting from the end-point of the previous chain, and it gave you that, no? What were you expecting? that it would append your new samples and the original samples?

marcobonici · August 12, 2021, 7:51pm

How can I check if this are resumed from the previous chains? I would have preferred to have the new samples appended to the old ones. What is your suggestion? Manually merge the chains?

dlakelan · August 12, 2021, 11:32pm

Yes, I think just concatenate the chains. I’m not sure how you can check, have you tried comparing the first sample of the new chain to the last sample of the earlier chain?

marcobonici · August 24, 2021, 5:07pm

Sorry for the late answer.
I checked and I found out that all chains restarts from the last point of the first chains (!)… I am confused

dlakelan · August 24, 2021, 5:41pm

yeah, that’s what resume means right? start sampling where the first chains left off. Like if I drive to a nearby town, sleep overnight, and then resume my trip the next morning, it means I continue on from the town.

marcobonici · August 24, 2021, 5:56pm

Yeah, that’s true. But, if I am running more chains in parallel, I would expect that the i-th chain would resume from the last point of the i-th chain, not from the last point of the first chain.

Am I missing something?

dlakelan · August 24, 2021, 6:01pm

Oh, I see what you’re saying. I misunderstood.

Yes, I don’t know why the code doesn’t do that, but I think your work around will be to manually @spawn N threads and then resume from each single chain?

marcobonici · August 24, 2021, 6:48pm

I don’t know if the right solution is a workaround or rather looking inside Turing code and open an issue or a PR. This is not the best approach to resume chains, IMHO.

dlakelan · August 24, 2021, 10:43pm

You could certainly open a PR. Arguably it’s better to give the user control over where exactly they’re starting. You might for example have a chain that’s stuck and not want to restart that one. It’s not totally clear that there’s a single right answer.

cpfiffer · August 25, 2021, 1:19am

Yeah, this is not intended functionality and needs to be fixed. An issue would be most welcome.

dlakelan · October 1, 2021, 11:05pm

Here’s a related issue:

startch = sample(model,
                 MH(diagm([.002 for i in 1:npars])),1200; thinning=10, init_theta=op.values.array)

resumech = sample(model,MH(diagm([.002 for i in 1:npars])),1200;resume_from=startch)

The second line gives an error:

julia> resumech = sample(model,MH(diagm([.002 for i in 1:npars])),1200;resume_from=startch)
ERROR: type NamedTuple has no field model
Stacktrace:
 [1] getindex(t::NamedTuple{(:start_time, :stop_time), Tuple{Float64, Float64}}, i::Symbol)
   @ Base ./namedtuple.jl:127
 [2] resume(rng::Random._GLOBAL_RNG, chain::Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, NamedTuple{(:parameters, :internals), Tuple{Vector{Symbol}, Vector{Symbol}}}, NamedTuple{(:start_time, :stop_time), Tuple{Float64, Float64}}}, args::Int64; progress::Bool, kwargs::Base.Pairs{Symbol, UnionAll, Tuple{Symbol}, NamedTuple{(:chain_type,), Tuple{UnionAll}}})
   @ Turing.Inference ~/.julia/packages/Turing/uMQmD/src/inference/Inference.jl:404
 [3] resume(chain::Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, NamedTuple{(:parameters, :internals), Tuple{Vector{Symbol}, Vector{Symbol}}}, NamedTuple{(:start_time, :stop_time), Tuple{Float64, Float64}}}, args::Int64; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:chain_type, :progress), Tuple{UnionAll, Bool}}})
   @ Turing.Inference ~/.julia/packages/Turing/uMQmD/src/inference/Inference.jl:396
 [4] sample(rng::Random._GLOBAL_RNG, model::DynamicPPL.Model{typeof(teamskill), (:Nteams, :orefteam, :drefteam, :m1, :m2, :m3, :m4, :week, :ns, :hometeam, :homepts, :awayteam, :awaypts), (), (), Tuple{Int64, Vector{Union{Nothing, Int64}}, Vector{Union{Nothing, Int64}}, Float64, Float64, Float64, Float64, Vector{Int64}, Vector{Bool}, Vector{Union{Nothing, Int64}}, Vector{Union{Missing, Int64}}, Vector{Union{Nothing, Int64}}, Vector{Union{Missing, Int64}}}, Tuple{}, DynamicPPL.DefaultContext}, sampler::DynamicPPL.Sampler{MH{(), RandomWalkProposal{false, ZeroMeanFullNormal{Tuple{Base.OneTo{Int64}}}}}}, N::Int64; chain_type::Type, resume_from::Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, NamedTuple{(:parameters, :internals), Tuple{Vector{Symbol}, Vector{Symbol}}}, NamedTuple{(:start_time, :stop_time), Tuple{Float64, Float64}}}, progress::Bool, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Turing.Inference ~/.julia/packages/Turing/uMQmD/src/inference/Inference.jl:159
 [5] sample(rng::Random._GLOBAL_RNG, model::DynamicPPL.Model{typeof(teamskill), (:Nteams, :orefteam, :drefteam, :m1, :m2, :m3, :m4, :week, :ns, :hometeam, :homepts, :awayteam, :awaypts), (), (), Tuple{Int64, Vector{Union{Nothing, Int64}}, Vector{Union{Nothing, Int64}}, Float64, Float64, Float64, Float64, Vector{Int64}, Vector{Bool}, Vector{Union{Nothing, Int64}}, Vector{Union{Missing, Int64}}, Vector{Union{Nothing, Int64}}, Vector{Union{Missing, Int64}}}, Tuple{}, DynamicPPL.DefaultContext}, alg::MH{(), RandomWalkProposal{false, ZeroMeanFullNormal{Tuple{Base.OneTo{Int64}}}}}, N::Int64; kwargs::Base.Pairs{Symbol, Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, NamedTuple{(:parameters, :internals), Tuple{Vector{Symbol}, Vector{Symbol}}}, NamedTuple{(:start_time, :stop_time), Tuple{Float64, Float64}}}, Tuple{Symbol}, NamedTuple{(:resume_from,), Tuple{Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, NamedTuple{(:parameters, :internals), Tuple{Vector{Symbol}, Vector{Symbol}}}, NamedTuple{(:start_time, :stop_time), Tuple{Float64, Float64}}}}}})
   @ Turing.Inference ~/.julia/packages/Turing/uMQmD/src/inference/Inference.jl:142
 [6] sample(model::DynamicPPL.Model{typeof(teamskill), (:Nteams, :orefteam, :drefteam, :m1, :m2, :m3, :m4, :week, :ns, :hometeam, :homepts, :awayteam, :awaypts), (), (), Tuple{Int64, Vector{Union{Nothing, Int64}}, Vector{Union{Nothing, Int64}}, Float64, Float64, Float64, Float64, Vector{Int64}, Vector{Bool}, Vector{Union{Nothing, Int64}}, Vector{Union{Missing, Int64}}, Vector{Union{Nothing, Int64}}, Vector{Union{Missing, Int64}}}, Tuple{}, DynamicPPL.DefaultContext}, alg::MH{(), RandomWalkProposal{false, ZeroMeanFullNormal{Tuple{Base.OneTo{Int64}}}}}, N::Int64; kwargs::Base.Pairs{Symbol, Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, NamedTuple{(:parameters, :internals), Tuple{Vector{Symbol}, Vector{Symbol}}}, NamedTuple{(:start_time, :stop_time), Tuple{Float64, Float64}}}, Tuple{Symbol}, NamedTuple{(:resume_from,), Tuple{Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, NamedTuple{(:parameters, :internals), Tuple{Vector{Symbol}, Vector{Symbol}}}, NamedTuple{(:start_time, :stop_time), Tuple{Float64, Float64}}}}}})

Sahil_Khan · June 9, 2023, 8:13am

did you find any way to do this?

Topic		Replies	Views
Function for resuming mcmc chain Probabilistic programming question , turing , mcmc	13	851	September 12, 2024
How to save to disk my results so far? New to Julia question , turing	9	1931	May 8, 2020
Continuing previous chain in turing New to Julia turing	0	215	June 8, 2023
Setting up multiple chains in Turing.jl General Usage	23	2792	March 7, 2019
Turing.jl run not reproducible! Probabilistic programming turing , reproducibility	17	2547	October 10, 2022

Resume chains Turing

Related topics