Bayesian logistic regression with Turing.jl

For example, yes:) Tullio.jl that I mentioned above often ends up using LoopVectorization.jl under the hood. Though using @addlogprob! within the for-loop probably won’t buy you much, i.e. store the results in some other variable and the sum/reduce this in a single @addlogprob! instead. With Tullio.jl you can also perform efficient reductions + it will automatically derive adjoint-rules for Zygote.jl and Tracker.jl.

I’m seeing a slow down using Tullio for summing log probabilities. Probably I’m just doing it wrong.

using BenchmarkTools, LoopVectorization, Tullio

X = rand(960*950)
Y = rand(960*950)
sig = 1

julia> @benchmark log_lik1 = sum(map(StatsFuns.normlogpdf, X, sig, Y))
BenchmarkTools.Trial: 
  memory estimate:  288 bytes
  allocs estimate:  6
  --------------
  minimum time:     696.503 ns (0.00% GC)
  median time:      903.497 ns (0.00% GC)
  mean time:        1.075 μs (3.42% GC)
  maximum time:     203.784 μs (99.31% GC)
  --------------
  samples:          10000
  evals/sample:     143

julia> @benchmark @tullio  log_lik2 := StatsFuns.normlogpdf(X[i],sig,Y[i])
BenchmarkTools.Trial: 
  memory estimate:  6.80 KiB
  allocs estimate:  158
  --------------
  minimum time:     1.191 ms (0.00% GC)
  median time:      1.502 ms (0.00% GC)
  mean time:        1.693 ms (0.00% GC)
  maximum time:     4.586 ms (0.00% GC)
  --------------
  samples:          2914
  evals/sample:     1

A couple of notes here:

  1. Your map is actually only computing a single element:) That’s why it’s significantly faster. map and broadcast have different behavior when the arguments are of different shapes: map I believe falls back to the smallest common size, e.g. in this case it’s 1, so it will only call the function once, while broadcast will do what you want.
  2. When using BenchmarkTools.jl, you need to use interpolation to ensure that you’re not seeing performance decrease because of scoping issues and whatnot.
  3. If-statements within the function you’re broadcasting over really hampers performance. In the case of StatsFuns.normlogdf there are two versions of the method: 1) takes mean, std, and value, from which it computes the z-value and then finally the logpdf, 2) takes the z-value directly and computes logpdf. In (1) there are some if-statements while in (2) there are none. As a result, (2) will be muuuuch faster when broadcasted/vectorized.
  4. Tullio.jl can’t derive the gradients automatically when the RHS include “complex” function calls, e.g. StatsFuns.normlogpdf. So in this particular case you probably shouldn’t use it.

All in all, your example should be something like:

using Tullio, CUDA

X = rand(960*950);
Y = rand(960*950);
sig = 1.0

f(X, Y, sig) = sum(StatsFuns.normlogpdf.(X, sig, Y))
g(X, Y, sig) = @tullio lp = StatsFuns.normlogpdf(X[i], sig, Y[i])

@benchmark $f($X, $Y, $sig)
# BenchmarkTools.Trial: 
#   memory estimate:  6.96 MiB
#   allocs estimate:  2
#   --------------
#   minimum time:     6.359 ms (0.00% GC)
#   median time:      6.662 ms (0.00% GC)
#   mean time:        6.847 ms (1.98% GC)
#   maximum time:     11.060 ms (23.53% GC)
#   --------------
#   samples:          730
#   evals/sample:     1

@benchmark $g($X, $Y, $sig)
# BenchmarkTools.Trial: 
#   memory estimate:  128 bytes
#   allocs estimate:  5
#   --------------
#   minimum time:     6.571 ms (0.00% GC)
#   median time:      6.799 ms (0.00% GC)
#   mean time:        6.821 ms (0.00% GC)
#   maximum time:     8.835 ms (0.00% GC)
#   --------------
#   samples:          732
#   evals/sample:     1


# Circumvent if-statements used in `StatsFuns.normlogpdf(::Number, ::Number, ::Number)`.
h(X, Y, sig) = sum(StatsFuns.normlogpdf.(Y - X / sig))
i(X, Y, sig) = @tullio (+) lp = StatsFuns.normlogpdf(Y[i] - X[i] / sig)

@benchmark $h($X, $Y, $sig)
# markTools.Trial: 
#   memory estimate:  20.87 MiB
#   allocs estimate:  6
#   --------------
#   minimum time:     2.947 ms (0.00% GC)
#   median time:      3.497 ms (0.00% GC)
#   mean time:        4.436 ms (12.28% GC)
#   maximum time:     13.297 ms (25.54% GC)
#   --------------
#   samples:          1123
#   evals/sample:     1

@benchmark $i($X, $Y, $sig)
# BenchmarkTools.Trial: 
#   memory estimate:  128 bytes
#   allocs estimate:  5
#   --------------
#   minimum time:     476.989 μs (0.00% GC)
#   median time:      556.951 μs (0.00% GC)
#   mean time:        562.737 μs (0.00% GC)
#   maximum time:     999.113 μs (0.00% GC)
#   --------------
#   samples:          8765
#   evals/sample:     1

But as mentioned above, you want gradients through it, so your best bet here is probably just to use h in the above example. EDIT: Actually, it might work very nicely with ForwardDiff.jl? I just know that for reverse-rules then you can’t use “complex” function calls on the RHS.

Thanks a lot! I knew I must have been doing something silly.
I am using ForwardDiff, so I’ll give ‘i’ a shot.

Looking at the StatsFuns docs, the log likelihood function should be normlogpdf((X - Y) / sig) - log(sig). Just incase anyone is copy pasting into their own model.

1 Like

Ah yes, nice catch! Forgot about the logabsdet jacobian term

Btw, this will be fixed as soon as [Merged by Bors] - Fix use of arrays of Distributions by devmotion · Pull Request #245 · TuringLang/DynamicPPL.jl · GitHub is merged, which should be veeery soon:)

1 Like

Using Tullio in the log likelihood gave me a big speed up, using forwarddiff. But rhat values were higher and the posterior visibly messy. So I’ve stuck with broadcasting for now. Obviously I could increase the iterations due to the speed up. But seems something is off, so I don’t quite trust it.

Could you share the code using Tullio.jl? Sort of surpirsed here. If it’s indeed just doing the same computation, it shouldn’t be an issue.

Btw, the new version of DPPL now allows you to do the I .~ BernoulliLogit.(z) which should also be pretty fast:)

I want to try this. What packages are you using ? Is there any other prep not in the MWE?