Turing: help speed up state space model

Hello all.

I am trying to use Turing to run a state space model based on Simon Jackman’s Bayesian Analysis for the Social Sciences. In chapter 9, he sets out a few different latent variable models, including a state space model for pooling for the polls. Basically, there is an underlying “true” voting intention that we are trying to estimate from a bunch of noisy polls. By forcing the model start and end with actual election results, we can estimate the house effects, which may or may not sum to zero. There a good case study of this model done using Stan by Peter Ellis.

I’ve run this model using Stan, and it takes about 7 minutes on my laptop. However, when I run it using Turing, it takes about 3 hours using NUTS. I suspect the reason has to do with using a for loop. Other samplers didn’t really give me an improvement in performance.

If anyone has any ideas on how to seed up the model , please let me know.

Code:

using Plots, StatsPlots
using DataFrames, CSV
using Turing, ReverseDiff, Memoization
using RDatasets
using StatsFuns
using Dates


# Helper function
function calc_moe(x, ss)
    return  sqrt(x * (1-x) / ss)
end



# load and clean data
df = RDatasets.dataset("pscl", "AustralianElectionPolling")

dateformat = DateFormat("y-m-d")
Election2004 = Date("2004-10-09")
Election2007 = Date("2007-11-24")
N_days = Dates.value(Election2007 - Election2004) + 1

df.NumDays = df.EndDate .- Election2004
df.NumDays = Dates.value.(df.NumDays) .+ 1

df.ALPProp = df.ALP ./ 100
df.Poll_ID = [1:size(df, 1);]

pollster_dict = Dict(key => idx for (idx, key) in enumerate(unique(df.Org)))
df.pollster_id = [pollster_dict[i] for i in df.Org]


# Plot polls
scatter((df.NumDays, df.ALPProp), group = df.Org)


# Define model
@model function state_space(y, y_moe, start_election, end_election, poll_date, poll_id, N_days, N_pollsters, pollster_id, ::Type{T} = Float64) where {T}

    # priors
    ξ = Vector{T}(undef, N_days)
    z_ξ ~ filldist(Normal(0, 1), (N_days-2))

    δ ~ filldist(Normal(0, 0.05), N_pollsters)

    σ ~ Exponential(1/5)
    
    # daily random walk
    ξ[1] = start_election
    ξ[N_days] = end_election

    for i in 2:(N_days - 1)
        ξ[i] = ξ[i-1] + σ * z_ξ[i-1]
    end

    # Measurement model
    y .~ Normal.(ξ[poll_date[poll_id]] .+ δ[pollster_id], y_moe[poll_id])

    return ξ

end


N_polls = size(df, 1)
N_pollsters = length(unique(df.pollster_id))
y_moe = calc_moe.(df.ALPProp, df.SampleSize)
y = df.ALPProp
start_election = .3764
end_election = .4338
poll_date = df.NumDays
poll_id = df.Poll_ID
pollster_id = df.pollster_id

n_adapt = 1_000
n_iter = 2_000
n_chains = 4

Turing.setadbackend(:reversediff)  # Zygote and Tracker do not work
model = state_space(y, y_moe, start_election, end_election, poll_date, poll_id, N_days, N_pollsters, pollster_id)
chn = sample(model, NUTS(n_adapt, .8), MCMCThreads(), n_iter, n_threads)


Thanks!

EDIT: Actually used an Exponential(1/5) distribution, as in julia it is the inverse of Stan.

There may be a few ways to speed up the model itself, but the biggest speed-up will come from calling Turing.setrdcache(true). With just this change, on my machine, Turing sampled 4 chains in 13 minutes. See Automatic Differentiation for a description of when this is safe to use.

Thanks! That sped it up tremendously. Naturally, I misread what was in Automatic Differentiation and thought that I could not use Turing.setrdcache(true).

With that change, it now takes about 10 minutes on my computer.

1 Like

please check this simulation