Improving performance of item response model in Turing.jl

Hi everyone,

I am trying to fit an item response model in Turing.jl and have some questions regarding performance. This question is related to Making Turing Fast with large numbers of parameters? and in fact is referenced in post #116.

For simplicity I constructed a minimal model that features just two parameter vectors.

  • theta: person-specific parameter
  • beta: item-specific parameter

where there is 1 parameter for each person and item respectively. In terms of hierarchical models this can be framed as random intercepts for each person and item.

A naive implementation of this model in Turing looks like

@model function irt_naive(y, i, p; I=maximum(i), P=maximum(p))
    theta ~ filldist(Normal(), P)
    beta ~ filldist(Normal(), I)
    
    for n in eachindex(y)
        y[n] ~ Bernoulli(logistic(theta[p[n]] - beta[i[n]]))
    end
end

where y is the response vector, i is a vector of item indices, and p is a vector of person indices.

I’ve tried a lot of things to improve the performance of this model (including vectorizing, LazyArrays, …) and the most performant version I came up with so far is a version where the likelihood is added manually using the @addlogprob! macro.

@model function irt(y, i, p; I=maximum(i), P=maximum(p))
    theta ~ filldist(Normal(), P)
    beta ~ filldist(Normal(), I)
    @addlogprob! sum(logpdf.(BernoulliLogit.(theta[p] - beta[i]), y))
end

This change improved performance by a factor of ~5.

Now, comparing the optimized irt model to Stan reveals that Turing is about 3-10 times slower.
Here are the timings for increasing number of persons P and a fixed set of items I = 20 (running on Macbook Pro M1 and Julia 1.8). Note that I tried to match the algorithm in Turing to the Stan defaults, NUTS(1_000, 0.8; max_depth=10).

P = 10
Turing: 575.398 ms
Stan: 169.081 ms
ratio: 3.40

P = 100
Turing: 14.295 s
Stan: 1.462 s
ratio: 9.78

P = 1000
Turing: 150.454 s
Stan: 20.029 s
ratio: 7.51

P = 10000
Turing: 2293.964 s
Stan: 405.192 s
ratio: 5.66

To conclude, my specific questions are:

  • Is it possible to further improve the code for the irt model?
  • For these types of models with large number of parameters is this performance compared to Stan expected or should Turing be as fast as Stan here?

You can download the full benchmark code from this gist: benchmarking turing vs stan on a simple IRT model Β· GitHub.

1 Like

Have you tried using MvNormal priors, specifically using FillArrays and I from LinearAlgebra as in this post? Regularized horseshoe prior

Thanks for the suggestion (interesting thread btw). I’ve adapted the model to

@model function irt_v2(y, i, p; I=maximum(i), P=maximum(p))
    theta ~ MvNormal(Zeros(P), LinearAlgebra.I)
    beta ~ MvNormal(Zeros(I), LinearAlgebra.I)
    @addlogprob! sum(logpdf.(BernoulliLogit.(theta[p] - beta[i]), y))
end

but preliminary testing shows no difference between the two versions.

m1 = irt(y, i, p)
m2 = irt_v2(y, i, p)

@benchmark sample($m1, $alg, $n_samples, init_params=$zeros(1020))  # 138.772 s
@benchmark sample($m2, $alg, $n_samples, init_params=$zeros(1020))  # 160.522 s

I see. Do you have other versions with addlogprob? I assume this version allocates atleast one vector for each gradient calculation?

Yes, I have tried switching between Bernoulli(logistic(...)) and BernoulliLogit(...). The latter gave me lot more stable results, which I guess is the whole reason this distribution exists within Turing.

I also tested a version using views instead of indexing the parameter vectors,

@addlogprob! sum(logpdf.(BernoulliLogit.(view(theta, p) - view(beta, i), y))

but this somehow had abysmal performance so I didn’t pursue it further.

How would one go about profiling this effectively? I benchmarked a few variations of loglikelihood function outside the model:

ll_v1(y, i, p, theta, beta) = sum(logpdf.(BernoulliLogit.(theta[p] - beta[i]), y))
ll_v2(y, i, p, theta, beta) = sum(@. logpdf(BernoulliLogit(theta[p] - beta[i]), y))
ll_v3(y, i, p, theta, beta) = sum(@. logpdf(Bernoulli(logistic(theta[p] - beta[i])), y))
ll_v4(y, i, p, theta, beta) = sum(logpdf.(Bernoulli.(logistic.(view(theta, p) - view(beta, i))), y))

@benchmark ll_v1(y, i, p, theta, beta)
BenchmarkTools.Trial: 4739 samples with 1 evaluation.
 Range (min … max):  968.791 ΞΌs …  16.477 ms  β”Š GC (min … max): 0.00% … 93.42%
 Time  (median):       1.004 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.053 ms Β± 689.209 ΞΌs  β”Š GC (mean Β± Οƒ):  4.16% Β±  5.87%

             β–β–†β–ˆβ–ˆβ–„β–‚                                              
  β–‚β–„β–…β–ƒβ–‚β–‚β–‚β–‚β–β–‚β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  969 ΞΌs           Histogram: frequency by time         1.12 ms <

 Memory estimate: 625.20 KiB, allocs estimate: 9.

@benchmark ll_v2(y, i, p, theta, beta)
BenchmarkTools.Trial: 4821 samples with 1 evaluation.
 Range (min … max):  970.458 ΞΌs …  15.887 ms  β”Š GC (min … max): 0.00% … 93.27%
 Time  (median):     997.041 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.035 ms Β± 608.350 ΞΌs  β”Š GC (mean Β± Οƒ):  3.23% Β±  5.07%

             β–‚β–†β–ˆβ–‡β–„β–‚                                              
  β–‚β–…β–…β–ƒβ–‚β–‚β–‚β–β–‚β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–‚β–‚β–‚β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  970 ΞΌs           Histogram: frequency by time         1.08 ms <

 Memory estimate: 468.91 KiB, allocs estimate: 7.

@benchmark ll_v3(y, i, p, theta, beta)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  255.791 ΞΌs …  14.436 ms  β”Š GC (min … max): 0.00% … 97.83%
 Time  (median):     278.584 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   310.709 ΞΌs Β± 536.731 ΞΌs  β”Š GC (mean Β± Οƒ):  9.46% Β±  5.45%

                 β–ˆβ–‡β–ˆβ–…β–ƒβ–                                          
  β–†β–ƒβ–‚β–‚β–β–β–β–β–β–β–β–β–β–β–‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  256 ΞΌs           Histogram: frequency by time          329 ΞΌs <

 Memory estimate: 468.91 KiB, allocs estimate: 7.

@benchmark ll_v4(y, i, p, theta, beta)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  237.916 ΞΌs …  11.594 ms  β”Š GC (min … max): 0.00% … 97.68%
 Time  (median):     253.083 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   270.429 ΞΌs Β± 369.099 ΞΌs  β”Š GC (mean Β± Οƒ):  6.13% Β±  4.42%

                     β–‚β–ˆβ–ƒβ–ƒβ–‚β–                                      
  β–‡β–‚β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–…β–…β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  238 ΞΌs           Histogram: frequency by time          277 ΞΌs <

 Memory estimate: 312.61 KiB, allocs estimate: 5.

A few things I take away from this benchmark:

  1. The loglikelihood function does indeed allocate
  2. ll_v4 (view + Bernoulli(logistic(...))) is the most performant version
  3. Somehow the performance ll_v4 does not translate to the evaluation within @model

The new @profview_allocs macro somehow gives me 0 allocations, so im pretty much lost here…

Why not?

using LazyArrays

@model function irt(y, i, p; I=maximum(i), P=maximum(p))
    theta ~ filldist(Normal(), P)
    beta ~ filldist(Normal(), I)
    y ~ arraydist(LazyArray(@~ BernoulliLogit.(theta[p] - beta[i])))
end

I think this is the same model, right?

Or EDIT:

using LazyArrays
using LinearAlgebra

@model function irt(y, i, p; max_i=maximum(i), max_p=maximum(p))
    theta ~ MvNormal(I(max_p))
    beta ~ MvNormal(I(max_i))
    y ~ arraydist(LazyArray(@~ BernoulliLogit.(theta[p] - beta[i])))
end

Yes, this is the same model. I am aware of the LazyArray β€œtrick”, but unfortunately it does not work for this particular model. In the thread I linked in the initial post a possible regression was discussed (#122) but then this should affect all models using LazyArray, no? I then just followed what @dlakelan described and summed everything manually using @addlogprob!.

Here is the benchmark for @addlogprob! vs LazyArray version of the model (I had to reduce to P = 100 because of runtime):

@model function irt_lazy(y, i, p; max_i=maximum(i), max_p=maximum(p))
    theta ~ MvNormal(I(max_p))
    beta ~ MvNormal(I(max_i))
    y ~ arraydist(LazyArray(@~ BernoulliLogit.(theta[p] - beta[i])))
end

@benchmark sample($m3, $alg, $n_samples, init_params=$zeros(120)) 
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 10.272 s (0.72% GC) to evaluate,
 with a memory estimate of 1.06 GiB, over 4902514 allocations.

@benchmark sample($m_lazy, $alg, $n_samples, init_params=$zeros(120)) 
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 350.161 s (0.02% GC) to evaluate,
 with a memory estimate of 1.24 GiB, over 5181966 allocations.

So the LazyArray version takes 35 times longer than the @addlogprob! version.

1 Like

In my experience, this is unfortunately the best one can do in Turing for many real world models with lots of parameters. I suspect most performance gains will be achieved with a better optimized reverse mode AD. Enzyme.jl seems promising, but I do not know when that might be ready to work with Turing.

1 Like

Is this actively worked on?

I really enjoy using Turing and think it improves on a lot of things compared to Stan in terms of the user experience, but if 3-10x sampling time is the best that can be done then it unfortunately is a non-starter for many applied projects with these types of models.

3 Likes

I completely agree. AD performance has been a pain point for MCMC sampling in Julia. My understanding is that Enzyme.jl is under active development and Turing should be able to support it at some point. But I don’t know what the timeline is for using it with Turing. I did not receive a reply on Slack a few weeks ago when I inquired.

1 Like

That’s highly problem dependent. If you’re using the generic-ness in the right way, Turing is like 3x-5x faster than Stan for ODE problems. So it’s a give and take: it’s general so it’s more optimized for hard models and less optimized for the β€œsimplest” models.

https://benchmarks.sciml.ai/html/ParameterEstimation/DiffEqBayesFitzHughNagumo.html

https://benchmarks.sciml.ai/html/ParameterEstimation/DiffEqBayesLotkaVolterra.html

1 Like

This is definitely true. I should have specified that my benchmarks focused on multilevel linear models, IRT models, and multinomial processing trees. In some cases, the ratio was 2-5, which isn’t terrible, especially at the lower end of that range. Some models were consistently closer to 10. I suspect there is still room for performance improvement.

Another limitation with ReverseDiff is that it is only performant for vectorized code. Sometimes vectorized code is difficult to reason about, and is less flexible than a for loop in some situations. In other cases it is quite elegant. Having the flexibility to use either approach without a significant performance cost would improve the user experience quite a bit.

Did you turn on tape compilation?

Unfortunately, I do not recall all of the details. Its been at least half a year. I followed the performance tips in the Turing documentation and followed some additional tips from the Turing devs. That included Turing.setrdcache(true), using MvNormal for broadcasting, and memoization. Is compiling the tape different from those approaches?

I agree and I have had great success with other problem types. That’s why I specifically referred to β€œthese types of models”, e.g. IRT models or other hierarchical models with large number of parameters.

As far as I understand it, no. From the https://turing.ml/dev/docs/using-turing/autodiff docs:

When using ReverseDiff , to compile the tape only once and cache it for later use, the user has to call Turing.setrdcache(true) .

If you are not tied to NUTS/HMC then for these more standard types of models you may find [ReactiveMP][(https://github.com/biaslab/ReactiveMP.jl) a nice alternative. Especially if you have a huge number of parameters and datapoints.

ZigzagBoomerang may be a good option also. I’m sure I’ve seen examples where @mschauer uses it to sample from a Turing generated model.

This may not specifically answer your question. More of a general comment that the Julia PPL landscape is more than HMC.

2 Likes

Out of curiosity, on those benchmarks, I note an unusual option – stepsize_jitter = 1.0. Could you explain the reasoning for this?

We don’t set that, so it would just be whatever the Stan default is.

Interesting. The Stan default is stepsize_jitter = 0, but Stan.jl erroneously supplied stepsize_jitter = 1.0 as the default for quite some time; it is possible that CmdStan.jl was also supplying the same, given that CmdStan.jl merged into Stan.jl. In any case, it would be interesting to see the benchmarks without the jittering, which, I suspect, may be slowing down Stan considerably.

To clarify my position: it is worth noting that I would love to see Turing.jl (or any other equivalent) have performance within a factor of 2 of Stan.

I suspect this is highly problem dependent. There are already examples where Turing.jl is within this margin, or maybe even faster than Stan, but then there are models where that’s not the case. For me, though I love what Stan can do, Turing.jl is my go-to in part because I’d rather write my model in Julia.