Turing: help with slow model

So I’m making progress with implementing a bunch of models in Turing. But one I’ve had problems with is this.

using Turing, StatsPlots

sᵇ = [15,11,15,14,15,18,16,16,18,16,15,13,18,12,11,13,17,18,16,11,17,18,12,18,18,14,21,18,17,10,11,12,16,18,17,15,19,12,21,15,16,20,15,19,16,16,14,18,16,19,17,11,19,18,16,16,11,19,18,12,15,18,20, 8,12,19,16,16,16,12,18,17,11,20]
nᵇ = 21
sⁿ = [15,12,14,15,13,14,10,17,13,16,16,10,15,15,10,14,17,18,19,12,19,18,10,18,16,13,15,20,13,15,13,14,19,19,19,18,13,12,19,16,14,17,15,16,15,16,13,15,14,19,12,11,17,13,18,13,13,19,18,13,13,16,18,14,14,17,12,12,16,14,16,18,13,13]
nⁿ = 21
@assert length(sᵇ) == length(sⁿ)

Φ(x) = cdf(Normal(0, 1), x)

@model function model(sᵇ, nᵇ, sⁿ, nⁿ)
    N = length(sᵇ)
    μ ~ truncated(Normal(0, 1), 0, Inf)
    σ ~ Uniform(0, 10)
    δ ~ truncated(Normal(0, 1), 0, Inf)
    σα ~ Uniform(0, 10)
    μα = δ * σα

    α = Vector(undef, N)
    ϕⁿ = Vector(undef, N)
    ϕᵇ = Vector(undef, N)
    θⁿ = Vector(undef, N)
    θᵇ = Vector(undef, N)
    for i in eachindex(sᵇ)
        α[i] ~ Normal(μα, σα)
        ϕⁿ[i] ~ Normal(μ, σ)
        ϕᵇ[i] = ϕⁿ[i] + α[i]
        θⁿ[i] = Φ(ϕⁿ[i])
        θᵇ[i] = Φ(ϕᵇ[i])
        # likelihood
        sⁿ[i] ~ Binomial(nⁿ, θⁿ[i])
        sᵇ[i] ~ Binomial(nᵇ, θᵇ[i])
    end
end

# sample from prior
prior_chain = sample(model(sᵇ, nᵇ, sⁿ, nⁿ), Prior(), 5000)
# sample from posterior
posterior_chain = sample(model(sᵇ, nᵇ, sⁿ, nⁿ), HMC(0.01, 10), 5000)

Inferences about the parameter of interest, δ seem roughly correct compared to what I’m expecting.

The main problem is that sampling is extremely slow and sometimes seemingly stalls. I’m not sure if it’s because defining a function as Φ(x) = cdf(Normal(0, 1), x) is a sub-optimal way of doing it, or because I’m not being specific enough about types of the deterministic variables. Any insights would be appreciated.

Would a LogitNormal distribution help in your case?

As you suspected, the vectors are not type stable:

julia> Vector(undef,1)
1-element Vector{Any}:

I have not tested this code, but it might get you on the right track:

@model function model(sᵇ, nᵇ, sⁿ, nⁿ)
    N = length(sᵇ)
    μ ~ truncated(Normal(0, 1), 0, Inf)
    σ ~ Uniform(0, 10)
    δ ~ truncated(Normal(0, 1), 0, Inf)
    σα ~ Uniform(0, 10)
    μα = δ * σα

    T = typeof(μ)

    α = Vector{T}(undef, N)
    ϕⁿ = Vector{T}(undef, N)
    ϕᵇ = Vector{T}(undef, N)
    θⁿ = Vector{T}(undef, N)
    θᵇ = Vector{T}(undef, N)
    for i in eachindex(sᵇ)
        α[i] ~ Normal(μα, σα)
        ϕⁿ[i] ~ Normal(μ, σ)
        ϕᵇ[i] = ϕⁿ[i] + α[i]
        θⁿ[i] = Φ(ϕⁿ[i])
        θᵇ[i] = Φ(ϕᵇ[i])
        # likelihood
        sⁿ[i] ~ Binomial(nⁿ, θⁿ[i])
        sᵇ[i] ~ Binomial(nᵇ, θᵇ[i])
    end
end

T is defined in terms of your mu parameter, which should allow it to be a Float64 or a Dual as needed.

It looks like you have more parameters than data points, which makes me wonder whether your model is identifiable. That could cause sampler to get stuck or produce anomalous results. Are the posterior distributions highly correlated?

Thanks for the tips so far. Will update soon. But the model should be identifiable - this is an example from a textbook that I’m translating. More soon…

Good to hear. The prior on mu, sigma, and delta must constrain the lower level parameters sufficiently. Considering that the number of parameters is large, you might consider using reverse mode AD. Unfortunately, its still a bit slow in Julia compared to Stan, but it should give you a speed up. To use reverse mode AD, add the following to your code:

using ReverseDiff

and

Turing.setadbackend(:reversediff)

in addition to the other changes. Vectorized expressions as opposed to loops might be faster for ReverseDiff.

I was able to get a 3X speed up with the following code:

using Turing, StatsPlots, ReverseDiff

sᵇ = [15,11,15,14,15,18,16,16,18,16,15,13,18,12,11,13,17,18,16,11,17,18,12,18,18,14,21,18,17,10,11,12,16,18,17,15,19,12,21,15,16,20,15,19,16,16,14,18,16,19,17,11,19,18,16,16,11,19,18,12,15,18,20, 8,12,19,16,16,16,12,18,17,11,20]
nᵇ = 21
sⁿ = [15,12,14,15,13,14,10,17,13,16,16,10,15,15,10,14,17,18,19,12,19,18,10,18,16,13,15,20,13,15,13,14,19,19,19,18,13,12,19,16,14,17,15,16,15,16,13,15,14,19,12,11,17,13,18,13,13,19,18,13,13,16,18,14,14,17,12,12,16,14,16,18,13,13]
nⁿ = 21


@assert length(sᵇ) == length(sⁿ)

Φ(x) = cdf(Normal(0, 1), x)

@model function model(sᵇ, nᵇ, sⁿ, nⁿ)
    N = length(sᵇ)
    μ ~ truncated(Normal(0, 1), 0, Inf)
    σ ~ Uniform(0, 10)
    δ ~ truncated(Normal(0, 1), 0, Inf)
    σα ~ Uniform(0, 10)
    μα = δ * σα

    α ~ filldist(Normal(μα, σα), N)
    ϕⁿ ~ filldist(Normal(μ, σ), N)
    ϕᵇ = ϕⁿ .+ α

    θⁿ = Φ.(ϕⁿ)
    θᵇ = Φ.(ϕᵇ)

    # likelihood
    @. sⁿ ~ Binomial(nⁿ, θⁿ)
    @. sᵇ ~ Binomial(nᵇ, θᵇ)
end
#442
Turing.setadbackend(:reversediff)

# sample from prior
prior_chain = sample(model(sᵇ, nᵇ, sⁿ, nⁿ), Prior(), 5000)
# sample from posterior
@time posterior_chain = sample(model(sᵇ, nᵇ, sⁿ, nⁿ), NUTS(500,.65), 5000)

Please verify that I vectorized it correctly. Also, you might need to verify that the likelihood is computed for both sb and sn. I never tried two data sets simultaneously.

1 Like

Identifiability

I’ve run the original JAGS code (from the book examples I am converting) and this is what I get for delta. It matches up with what’s in the book, so I think I can cross identifiability issues off the list.
Screenshot 2021-02-04 at 16.22.26

Sampling from the prior

Samples from the prior match up just fine. It’s pretty slow, but perhaps that’s because it’s sampling from vectors of α and ϕⁿ in the for loop? Timing sampling from the prior with @time sample(model(sᵇ, nᵇ, sⁿ, nⁿ), Prior(), 10_000); shows

11.531925 seconds (117.04 M allocations: 7.192 GiB, 26.96% gc time)

I’m still new to Julia, but that seems like a LOT of memory allocations. My goal was to keep the models as simple and interpretable as possible, so was hoping to keep loops… but I guess where it comes to severe issues it’s worth doing mechanical tinkering. So I’ll explore vectorisation.

Sampling from posterior

So I think memory is killing this. If you ask for just 500 samples @time sample(model(sᵇ, nᵇ, sⁿ, nⁿ), HMC(0.01, 10), 500); you get

35.653252 seconds (341.23 M allocations: 20.948 GiB, 9.47% gc time)

Asking for much more than 5000 samples hangs my machine. So perhaps a memory issue.

I’ll try our your vectorised version.

I’m still somewhat surprised that this seemed to work just fine with JAGS.

Yeah. I think using the vectorized code above will help. On my machine, sampling from the prior takes about 2 seconds and 150 seconds for the posterior distribution. Reverse mode is kind of a pain point currently in Julia. How does the timing compare to Jags (also note that you may not need as many samples with NUTS to achieve the same effective sample size in Jags)?

1 Like

Hi,
It’s generally advisable to write the code in a more vectorised for if you use HMC or any other gradient based inference algorithm.

You can for example write your models as follows:

@model function model2(sᵇ, nᵇ, sⁿ, nⁿ, ::Type{T} = Float64) where {T}
    N = length(sᵇ)
    μ ~ truncated(Normal(0, 1), 0, Inf)
    σ ~ Uniform(0, 10)
    δ ~ truncated(Normal(0, 1), 0, Inf)
    σα ~ Uniform(0, 10)

    α ~ filldist(Normal(δ * σα, σα), N)
    ϕⁿ ~ filldist(Normal(μ, σ), N)
    
    θⁿ = zero(T)
    θᵇ = zero(T)

    for i in eachindex(sᵇ)

        θⁿ = normcdf(ϕⁿ[i])
        θᵇ = normcdf(ϕⁿ[i] + α[i])

        # likelihood
        sⁿ[i] ~ Binomial(nⁿ, θⁿ)
        sᵇ[i] ~ Binomial(nᵇ, θᵇ)
    end
end

In your model it is also advisable to use reverse mode AD, given the number of parameter. You can do this for example as follows:

using ReverseDiff, Memoization
Turing.setadbackend(:reversediff)

Running inference with HMC for 1000 iterations now takes about 10.891953 seconds (125.91 M allocations: 5.309 GiB, 6.99% gc time) on my MacBook.

1 Like

Here, filldist is similar in semantics to the Julia fill function. In case the first argument is a Normal distribution filldist will construct a MvNormal in other case the behaviour is different.

There is also arraydist which allows you to create a list or array of different distributions. For example:

μ ~ arraydist(map(m -> Normal(m, 1), 0:10))

In contrast to filldist, arraydist is really just a short hand notation and doesn’t do any more efficient handling than writing it out yourself. At least atm.

@trappmartin, thank you for your help. Would you mind clarifying which parts of the code should be vectorized? In your version of the model, you used vectorization when initializing the parameters, but used the loop to compute the likelihood and fill in some of the parameter values. In my version, I vectorized all of the code, but it runs twice as slow. Is that a general finding?

1 Like

JAGS takes about 5 seconds to generate 10,000 samples. Which is pretty cool. Overall I’ve found that JAGS is pretty robust and the end user doesn’t have to think much about the inference side of things. The downside of that is that very often you have to come up with kludgy mechanistic fixes in your model specification. Turing seems much neater and simpler in the model specification, but a bit more thought has to go into the sampling/inference side of things.

Running your vectorised code, and asking for 10,000 took 204 seconds. But yes, maybe the ess is better than in JAGS so I don’t need as many samples.

The results seem spot on, although my plotting skills in Julia are not up to scratch yet. This is based on a histogram, whereas the JAGS plot in R is based on a polynomial fit to a histogram. So yes, maybe I don’t need to be asking for 10,000 samples here

1 Like

Thanks. I think I will go and revisit some of the other models I’ve converted which apparently work just fine. I’ll see if vectorising can make them much faster. Maybe vectorising all the models won’t actually be a bad tradeoff.

I’ll certainly explore vectorisation much more.

I think you might be able to obtain a reasonable effective sample size with 2,000 or even 1,000 samples after warmup with NUTS. Trappmartins version requires about 20 seconds on my machine with 1,000 samples. That might be closer to what you are finding in JAGS. Hopefully, more improvements will be made with reverse AD.

By the way, I enjoyed your paper on muddled units. The idea sort of blew my mind because I never really considered how units combine in models. Do you have any recommendations for a book or paper that explains how to check for dimensional consistency? I suspect this is a ubiquitous problem.

Thanks very much! I don’t have anything specific, but there are some pretty nice guides (also YouTube videos) on Dimensional Analysis. I don’t pretend to be an expert on it, but it was pretty cool to find a specific case where it was a real problem. I’m keeping my eye out for more examples now.

1 Like

Ok, so a kind of summary…

  • JAGS is somewhat magical in it’s speed for this particular model. I am pretty impressed with how robust it is to various model types.

Using the vectorised + loop model suggested by @trappmartin takes:

  • ~10 seconds for 10,000 samples from the prior. Although because delta is the key variable of interest, you could just define a much simpler model.
  • ~103 seconds for 10,000 samples from the posterior, using HMC. That’s not terrible to be honest.

Lessons:

  • don’t be so scared of vectorisation
  • the density plots from the text book are good partially because they use the polspline library for plotting, rather than calculating tons of samples and using a histogram
  • need to lean how to make use of my 8 cores and so parallel chains

Thanks for the suggestions :slight_smile:

1 Like

Thanks for the leads. I look forward to reading more about this topic!

This should easy to do in Turing. Just pass MCMCThreads() and the number of chains as follows:

sample(model(sᵇ, nᵇ, sⁿ, nⁿ), NUTS(500,.65), MCMCThreads(), 1000, n_chains)

Aside from that, there is no special setup required.

By the way, would you be willing to share a repo of the models that you are translating? That would certainly be helpful for other users. The Turing dev team would probably include a link to it in the documentation.

Of course. My plan is to make the repo public. Either when it’s 100% ready, or when it’s nearly ready. I’ll certainly be asking a number more questions here first though.

1 Like

Yes, that is often the case. The most efficient is to write out the log likelihood explicitly and increase the log joint used by Turing. If this is still slow, you can write the log joint by hand and plug it directly into AdvancedHMC. This way you can do all kinds of tricks and optimisations yourself. But it’s often sufficient to simply write the model in a slightly more vectorised form and use the best suited AD.