Improving performance of item response model in Turing.jl

p-gw · August 24, 2022, 8:52am

Hi everyone,

I am trying to fit an item response model in Turing.jl and have some questions regarding performance. This question is related to Making Turing Fast with large numbers of parameters? and in fact is referenced in post #116.

For simplicity I constructed a minimal model that features just two parameter vectors.

theta: person-specific parameter
beta: item-specific parameter

where there is 1 parameter for each person and item respectively. In terms of hierarchical models this can be framed as random intercepts for each person and item.

A naive implementation of this model in Turing looks like

@model function irt_naive(y, i, p; I=maximum(i), P=maximum(p))
    theta ~ filldist(Normal(), P)
    beta ~ filldist(Normal(), I)
    
    for n in eachindex(y)
        y[n] ~ Bernoulli(logistic(theta[p[n]] - beta[i[n]]))
    end
end

where y is the response vector, i is a vector of item indices, and p is a vector of person indices.

I’ve tried a lot of things to improve the performance of this model (including vectorizing, LazyArrays, …) and the most performant version I came up with so far is a version where the likelihood is added manually using the @addlogprob! macro.

@model function irt(y, i, p; I=maximum(i), P=maximum(p))
    theta ~ filldist(Normal(), P)
    beta ~ filldist(Normal(), I)
    @addlogprob! sum(logpdf.(BernoulliLogit.(theta[p] - beta[i]), y))
end

This change improved performance by a factor of ~5.

Now, comparing the optimized irt model to Stan reveals that Turing is about 3-10 times slower.
Here are the timings for increasing number of persons P and a fixed set of items I = 20 (running on Macbook Pro M1 and Julia 1.8). Note that I tried to match the algorithm in Turing to the Stan defaults, NUTS(1_000, 0.8; max_depth=10).

P = 10
Turing: 575.398 ms
Stan: 169.081 ms
ratio: 3.40

P = 100
Turing: 14.295 s
Stan: 1.462 s
ratio: 9.78

P = 1000
Turing: 150.454 s
Stan: 20.029 s
ratio: 7.51

P = 10000
Turing: 2293.964 s
Stan: 405.192 s
ratio: 5.66

To conclude, my specific questions are:

Is it possible to further improve the code for the irt model?
For these types of models with large number of parameters is this performance compared to Stan expected or should Turing be as fast as Stan here?

You can download the full benchmark code from this gist: benchmarking turing vs stan on a simple IRT model · GitHub.

EvoArt · August 24, 2022, 11:39am

Have you tried using MvNormal priors, specifically using FillArrays and I from LinearAlgebra as in this post? Regularized horseshoe prior

p-gw · August 24, 2022, 3:41pm

Thanks for the suggestion (interesting thread btw). I’ve adapted the model to

@model function irt_v2(y, i, p; I=maximum(i), P=maximum(p))
    theta ~ MvNormal(Zeros(P), LinearAlgebra.I)
    beta ~ MvNormal(Zeros(I), LinearAlgebra.I)
    @addlogprob! sum(logpdf.(BernoulliLogit.(theta[p] - beta[i]), y))
end

but preliminary testing shows no difference between the two versions.

m1 = irt(y, i, p)
m2 = irt_v2(y, i, p)

@benchmark sample($m1, $alg, $n_samples, init_params=$zeros(1020))  # 138.772 s
@benchmark sample($m2, $alg, $n_samples, init_params=$zeros(1020))  # 160.522 s

EvoArt · August 24, 2022, 5:21pm

I see. Do you have other versions with addlogprob? I assume this version allocates atleast one vector for each gradient calculation?

p-gw · August 24, 2022, 7:07pm

Yes, I have tried switching between Bernoulli(logistic(...)) and BernoulliLogit(...). The latter gave me lot more stable results, which I guess is the whole reason this distribution exists within Turing.

I also tested a version using views instead of indexing the parameter vectors,

@addlogprob! sum(logpdf.(BernoulliLogit.(view(theta, p) - view(beta, i), y))

but this somehow had abysmal performance so I didn’t pursue it further.

How would one go about profiling this effectively? I benchmarked a few variations of loglikelihood function outside the model:

ll_v1(y, i, p, theta, beta) = sum(logpdf.(BernoulliLogit.(theta[p] - beta[i]), y))
ll_v2(y, i, p, theta, beta) = sum(@. logpdf(BernoulliLogit(theta[p] - beta[i]), y))
ll_v3(y, i, p, theta, beta) = sum(@. logpdf(Bernoulli(logistic(theta[p] - beta[i])), y))
ll_v4(y, i, p, theta, beta) = sum(logpdf.(Bernoulli.(logistic.(view(theta, p) - view(beta, i))), y))

@benchmark ll_v1(y, i, p, theta, beta)
BenchmarkTools.Trial: 4739 samples with 1 evaluation.
 Range (min … max):  968.791 μs …  16.477 ms  ┊ GC (min … max): 0.00% … 93.42%
 Time  (median):       1.004 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.053 ms ± 689.209 μs  ┊ GC (mean ± σ):  4.16% ±  5.87%

             ▁▆██▄▂                                              
  ▂▄▅▃▂▂▂▂▁▂▃██████▇▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  969 μs           Histogram: frequency by time         1.12 ms <

 Memory estimate: 625.20 KiB, allocs estimate: 9.

@benchmark ll_v2(y, i, p, theta, beta)
BenchmarkTools.Trial: 4821 samples with 1 evaluation.
 Range (min … max):  970.458 μs …  15.887 ms  ┊ GC (min … max): 0.00% … 93.27%
 Time  (median):     997.041 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.035 ms ± 608.350 μs  ┊ GC (mean ± σ):  3.23% ±  5.07%

             ▂▆█▇▄▂                                              
  ▂▅▅▃▂▂▂▁▂▂▄██████▇▆▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  970 μs           Histogram: frequency by time         1.08 ms <

 Memory estimate: 468.91 KiB, allocs estimate: 7.

@benchmark ll_v3(y, i, p, theta, beta)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  255.791 μs …  14.436 ms  ┊ GC (min … max): 0.00% … 97.83%
 Time  (median):     278.584 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   310.709 μs ± 536.731 μs  ┊ GC (mean ± σ):  9.46% ±  5.45%

                 █▇█▅▃▁                                          
  ▆▃▂▂▁▁▁▁▁▁▁▁▁▁▂███████▆▆▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  256 μs           Histogram: frequency by time          329 μs <

 Memory estimate: 468.91 KiB, allocs estimate: 7.

@benchmark ll_v4(y, i, p, theta, beta)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  237.916 μs …  11.594 ms  ┊ GC (min … max): 0.00% … 97.68%
 Time  (median):     253.083 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   270.429 μs ± 369.099 μs  ┊ GC (mean ± σ):  6.13% ±  4.42%

                     ▂█▃▃▂▁                                      
  ▇▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▃███████▇▇▅▅▄▄▃▃▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  238 μs           Histogram: frequency by time          277 μs <

 Memory estimate: 312.61 KiB, allocs estimate: 5.

A few things I take away from this benchmark:

The loglikelihood function does indeed allocate
ll_v4 (view + Bernoulli(logistic(...))) is the most performant version
Somehow the performance ll_v4 does not translate to the evaluation within @model

The new @profview_allocs macro somehow gives me 0 allocations, so im pretty much lost here…

Storopoli · August 24, 2022, 10:52pm

Why not?

using LazyArrays

@model function irt(y, i, p; I=maximum(i), P=maximum(p))
    theta ~ filldist(Normal(), P)
    beta ~ filldist(Normal(), I)
    y ~ arraydist(LazyArray(@~ BernoulliLogit.(theta[p] - beta[i])))
end

I think this is the same model, right?

Or EDIT:

using LazyArrays
using LinearAlgebra

@model function irt(y, i, p; max_i=maximum(i), max_p=maximum(p))
    theta ~ MvNormal(I(max_p))
    beta ~ MvNormal(I(max_i))
    y ~ arraydist(LazyArray(@~ BernoulliLogit.(theta[p] - beta[i])))
end

p-gw · August 25, 2022, 8:42am

Yes, this is the same model. I am aware of the LazyArray “trick”, but unfortunately it does not work for this particular model. In the thread I linked in the initial post a possible regression was discussed (#122) but then this should affect all models using LazyArray, no? I then just followed what @dlakelan described and summed everything manually using @addlogprob!.

Here is the benchmark for @addlogprob! vs LazyArray version of the model (I had to reduce to P = 100 because of runtime):

@model function irt_lazy(y, i, p; max_i=maximum(i), max_p=maximum(p))
    theta ~ MvNormal(I(max_p))
    beta ~ MvNormal(I(max_i))
    y ~ arraydist(LazyArray(@~ BernoulliLogit.(theta[p] - beta[i])))
end

@benchmark sample($m3, $alg, $n_samples, init_params=$zeros(120)) 
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 10.272 s (0.72% GC) to evaluate,
 with a memory estimate of 1.06 GiB, over 4902514 allocations.

@benchmark sample($m_lazy, $alg, $n_samples, init_params=$zeros(120)) 
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 350.161 s (0.02% GC) to evaluate,
 with a memory estimate of 1.24 GiB, over 5181966 allocations.

So the LazyArray version takes 35 times longer than the @addlogprob! version.

Christopher_Fisher · August 25, 2022, 11:47am

In my experience, this is unfortunately the best one can do in Turing for many real world models with lots of parameters. I suspect most performance gains will be achieved with a better optimized reverse mode AD. Enzyme.jl seems promising, but I do not know when that might be ready to work with Turing.

p-gw · August 26, 2022, 7:01am

Is this actively worked on?

I really enjoy using Turing and think it improves on a lot of things compared to Stan in terms of the user experience, but if 3-10x sampling time is the best that can be done then it unfortunately is a non-starter for many applied projects with these types of models.

Christopher_Fisher · August 26, 2022, 12:40pm

I completely agree. AD performance has been a pain point for MCMC sampling in Julia. My understanding is that Enzyme.jl is under active development and Turing should be able to support it at some point. But I don’t know what the timeline is for using it with Turing. I did not receive a reply on Slack a few weeks ago when I inquired.

ChrisRackauckas · August 26, 2022, 12:44pm

That’s highly problem dependent. If you’re using the generic-ness in the right way, Turing is like 3x-5x faster than Stan for ODE problems. So it’s a give and take: it’s general so it’s more optimized for hard models and less optimized for the “simplest” models.

https://benchmarks.sciml.ai/html/ParameterEstimation/DiffEqBayesFitzHughNagumo.html

https://benchmarks.sciml.ai/html/ParameterEstimation/DiffEqBayesLotkaVolterra.html

Christopher_Fisher · August 26, 2022, 1:21pm

This is definitely true. I should have specified that my benchmarks focused on multilevel linear models, IRT models, and multinomial processing trees. In some cases, the ratio was 2-5, which isn’t terrible, especially at the lower end of that range. Some models were consistently closer to 10. I suspect there is still room for performance improvement.

Another limitation with ReverseDiff is that it is only performant for vectorized code. Sometimes vectorized code is difficult to reason about, and is less flexible than a for loop in some situations. In other cases it is quite elegant. Having the flexibility to use either approach without a significant performance cost would improve the user experience quite a bit.

ChrisRackauckas · August 26, 2022, 1:23pm

Did you turn on tape compilation?

Christopher_Fisher · August 26, 2022, 1:49pm

Unfortunately, I do not recall all of the details. Its been at least half a year. I followed the performance tips in the Turing documentation and followed some additional tips from the Turing devs. That included Turing.setrdcache(true), using MvNormal for broadcasting, and memoization. Is compiling the tape different from those approaches?

p-gw · August 26, 2022, 2:25pm

I agree and I have had great success with other problem types. That’s why I specifically referred to “these types of models”, e.g. IRT models or other hierarchical models with large number of parameters.

As far as I understand it, no. From the https://turing.ml/dev/docs/using-turing/autodiff docs:

When using ReverseDiff , to compile the tape only once and cache it for later use, the user has to call Turing.setrdcache(true) .

EvoArt · August 26, 2022, 4:17pm

If you are not tied to NUTS/HMC then for these more standard types of models you may find [ReactiveMP][(https://github.com/biaslab/ReactiveMP.jl) a nice alternative. Especially if you have a huge number of parameters and datapoints.

ZigzagBoomerang may be a good option also. I’m sure I’ve seen examples where @mschauer uses it to sample from a Turing generated model.

This may not specifically answer your question. More of a general comment that the Julia PPL landscape is more than HMC.

AndrewRadcliffe · August 30, 2022, 9:17pm

Out of curiosity, on those benchmarks, I note an unusual option – stepsize_jitter = 1.0. Could you explain the reasoning for this?

ChrisRackauckas · August 31, 2022, 12:22am

We don’t set that, so it would just be whatever the Stan default is.

AndrewRadcliffe · August 31, 2022, 7:14pm

Interesting. The Stan default is stepsize_jitter = 0, but Stan.jl erroneously supplied stepsize_jitter = 1.0 as the default for quite some time; it is possible that CmdStan.jl was also supplying the same, given that CmdStan.jl merged into Stan.jl. In any case, it would be interesting to see the benchmarks without the jittering, which, I suspect, may be slowing down Stan considerably.

To clarify my position: it is worth noting that I would love to see Turing.jl (or any other equivalent) have performance within a factor of 2 of Stan.

dlakelan · August 31, 2022, 7:40pm

I suspect this is highly problem dependent. There are already examples where Turing.jl is within this margin, or maybe even faster than Stan, but then there are models where that’s not the case. For me, though I love what Stan can do, Turing.jl is my go-to in part because I’d rather write my model in Julia.

Topic		Replies	Views
Making Turing Fast with large numbers of parameters? General Usage turing	127	9558	April 9, 2023
Speed up multilevel model in Turing.jl Probabilistic Programming turing	15	489	June 17, 2023
Turing.jl for Causal Inference model Probabilistic Programming performance , turing	19	1690	February 4, 2022
Case study: Speeding up a logistic regression with RHS prior (Turing vs Numpyro) - any tricks I'm missing? Probabilistic Programming	38	3328	February 6, 2023
Turing: help with slow model Probabilistic Programming	46	2638	February 13, 2021

Improving performance of item response model in Turing.jl

Related topics