Making Turing Fast with large numbers of parameters?

Ok, I did that.

https://github.com/TuringLang/Turing.jl/issues/1722

Ok it turns out you need to use init_params not init_theta, the docs are out of date… But I’ve changed the issue to be more generally about the lack of sampling in truncated normals.

1 Like

FWIW, I really like this proposal. Please open an issue to discuss with the Turing team.

3 Likes

On my set-up (Julia v1.6.1, latest update of all packages), I get two error messages from this:

  1. UndefVarError: setrdcache not defined
  2. AD method:
MethodError: no method matching _setadbackend(::Val{:reversediff})
Closest candidates are:
  _setadbackend(::Val{:forwarddiff}) at C:\Users\User_name\.julia\packages\Turing\YGtAo\src\core\ad.jl:12
  _setadbackend(::Val{:tracker}) at C:\Users\User_name\.julia\packages\Turing\YGtAo\src\core\ad.jl:16
  _setadbackend(::Val{:zygote})

I have used forwarddiff – that works.

Question: is there an advantage in using reversediff (but Turing doesn’t find it on my computer…).

I have this data model:

for i = 1:Nd
    data[:,i] ~ MvNormal(SOL[:,i], sqrt(V))
end

Four comments:

  1. SOL is an array of the same size as data; SOL is the result of solving a DAE.
  2. MvNormal interprets scalar second argument m (m=sqrt(V) in my example above) as a standard deviation and makes it into a covariance matrix m^2*I. I have tried to run Turing where I replace sqrt(V) with V*diagm(ones(nd)), but then it crashes.
  3. I’m not a statistics expert. But I’ve read somewhere that it is common to assume that the variance has an inverse gamma distribution. If I do not take the square root of V as above, doesn’t that mean that I take the standard deviation to be inverse gamma distributed? (Perhaps not a big deal?)
  4. I have four outputs in my case. I’m not sure how good it is to assume the same variance on all outputs???

So – running Turing is really slow… with a DAE of 6 descriptors (differential+algebraic variables) and some 600 datapoints, I try to estimate all 6 initial descriptor (perhaps not a good idea, since 3 of them are constrained by algebraic equations) + 6 model parameters. Running 1000 iterations in Turing takes some 5 hours on my workstation light.

Could I write the data model as:

 data ~ MvNormal(SOL, sqrt(V))

??? Would that correctly draw random values for V?

@BLI please read the relevant docs section Automatic Differentiation and start a separate discussion thread not to have too many unrelated questions in this already quite long discussion thread.

1 Like

I’ll do that (although I think my data model question is related to the above discussion… but I agree, it is already a long discussion).

1 Like

Haha, my record for a single thread is 3231 posts and counting over on the OpenWrt forum so to me it feels like we are just getting started :sweat_smile:

1 Like

I assume you are solving your DAE at each transition, and if you are using NUTS it’s taking the gradient of the solution with respect to the parameters and doing HMC on the DAE so I think this is just going to be slow. 5 hours sounds fantastic, I fit a bunch of ODEs via diffusive MCMC in R back in 2013 or so and it took me weeks.

Not to say that you shouldn’t try to optimize it, just to have some realistic idea of what is possible.

1 Like

Seems like only :forwardiff works. I’ve tried with another options than NUTS (i.e., SMC), but the result is poor then.

Done! https://github.com/TuringLang/Turing.jl/issues/1723

3 Likes

Ok, I’ve tried doing a broadcasting Gamma in my “real” model, and I am able to optimize the model, find an initial position, run a few thousand steps of MH() sampling, take the final point and hand it as init_params to NUTS… at that point it crashes with:

ERROR: LoadError: ArgumentError: Gamma: the condition α > zero(α) && θ > zero(θ) is not satisfied.
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Distributions/b3JDl/src/utils.jl:6 [inlined]
  [2] #Gamma#114
    @ ~/.julia/packages/Distributions/b3JDl/src/univariate/continuous/gamma.jl:34 [inlined]
  [3] Gamma(α::Float64, θ::Float64)
    @ Distributions ~/.julia/packages/Distributions/b3JDl/src/univariate/continuous/gamma.jl:34
  [4] #16
    @ ./broadcast.jl:383 [inlined]
  [5] #10
    @ ./broadcast.jl:329 [inlined]
  [6] _broadcast_getindex_evalf
    @ ./broadcast.jl:670 [inlined]
  [7] _broadcast_getindex
    @ ./broadcast.jl:643 [inlined]
  [8] getindex
    @ ./broadcast.jl:597 [inlined]

In the code I have

@infiltrate ...

and I check a condition that the value I’m giving isn’t less than or equal to zero. Infiltrator does NOT trigger, so in theory I’m not actually passing anything incorrect to Gamma.

WTH?

I realize this is impossible to debug so I’ll spend some time trying to trigger it in the MWE

Sorry for the noise, this may have been me doing something stupid. I’ve got it working now.

I like greedy programmers.

Check that your derivatives work. With DAEProblem there’s still some edge cases. That’s not Turing’s problem.

1 Like

Yeah, I didn’t actually mean to discourage here. Maybe just a little jealousy after going through literally weeks to fit ODEs back then.

Also, I should say that I found that low-precision fitting methods made more sense in the context of Bayesian model fitting. Bayesian models treat model error and measurement error and everything all as “probability of error”, which as long as it’s not massive divergence in behavior can incorporate numerical imprecision just fine.

Hm… I’ll open a new thread on DAEs and Turing… the results are somewhat strange, and I’m not 100% sure as to why.

1 Like

I hope this is related to this topic, because of performance issues wrt models with large numbers of parameters.
While developing a package for Item response models, see e.g. Item response models (Stan User’s Guide), even the simplest model (1pl or Rasch model) needs about 6 to 9 minutes in Turing. In contrast Stan needs 10 to 20 seconds for the same model and the same sampling arguments (i.e. chain length, …). I was hoping to find some answers in this topic to solve the performance issues, but was not successful yet.

Here is the full code with comments on benchmarks (btw packages are all up to date):

using LazyArrays
using ReverseDiff, Memoization, Distributions, Turing
using CSV, DataFrames, CategoricalArrays
import Turing: StatsFuns
auto_lazy_arraydist(x) =  arraydist(BroadcastArray(xi -> xi, x))

Turing.setadbackend(:reversediff)
Turing.setrdcache(true)

url = "https://raw.githubusercontent.com/t-alfers/Rasch-Turing/master/dichotom.txt"
data = CSV.read(download(url), DataFrame, delim = "\t")
data.person = collect(1:nrow(data));
data_responses = data[:, Not([:Anger, :Sex])];

data_long = DataFrames.stack(data_responses, Not(:person), :person);

data_long.variable = CategoricalArray(data_long.variable)
levels!(data_long.variable, unique(data_long.variable))
data_long.variable = levelcode.(data_long.variable)

@model rasch(y, ii, jj) = begin
  I = maximum(ii);
  J = maximum(jj);

  # prior item parameter
  β ~ filldist(Normal(0, 1), I);
  δ ~ Normal(0.75, 1);

  # prior person parameter
  θ ~ filldist(Normal(0, 1), J);

  # target distribution
  ψ = view(θ, jj) .- view(β, ii) .+ δ

  # 495.154840 seconds (18.11 M allocations: 21.605 GiB, 0.12% gc time)
  # y .~ BernoulliLogit.(ψ)

  # 449.327347 seconds (18.10 M allocations: 21.216 GiB, 0.12% gc time)
  # y ~ arraydist(BernoulliLogit.(ψ))

  # 376.004346 seconds (18.30 M allocations: 22.736 GiB, 0.31% gc time)
  # y ~ auto_lazy_arraydist(BernoulliLogit.(ψ))

  # 351.287136 seconds (17.95 M allocations: 21.385 GiB, 0.18% gc time)
  y ~ arraydist(LazyArray(@~ BernoulliLogit.(ψ)))
end

chn = sample(rasch(data_long.value, data_long.variable, data_long.person),
             NUTS(1000, 0.8), 2000, progress = true)
1 Like

Definitely on topic. When you run this model and get the timing are you doing it more than once? Compilation time will be included in the first run whereas Stan puts that compilation time up front.

The times stated in the comments are all from second runs.

try

y ~ arraydist(LazyArray(@~ BernoulliLogit.(θ[jj] .- β[ii] .+ δ)))

Thanks for your reply. Already tried it - but sadly no performance benefit:

# 470.161665 seconds (17.60 M allocations: 4.395 GiB, 0.07% gc time)
y ~ arraydist(LazyArray(@~ BernoulliLogit.(θ[jj] .- β[ii] .+ δ)))