Bayesian logistic regression with Turing.jl

A couple of notes here:

  1. Your map is actually only computing a single element:) That’s why it’s significantly faster. map and broadcast have different behavior when the arguments are of different shapes: map I believe falls back to the smallest common size, e.g. in this case it’s 1, so it will only call the function once, while broadcast will do what you want.
  2. When using BenchmarkTools.jl, you need to use interpolation to ensure that you’re not seeing performance decrease because of scoping issues and whatnot.
  3. If-statements within the function you’re broadcasting over really hampers performance. In the case of StatsFuns.normlogdf there are two versions of the method: 1) takes mean, std, and value, from which it computes the z-value and then finally the logpdf, 2) takes the z-value directly and computes logpdf. In (1) there are some if-statements while in (2) there are none. As a result, (2) will be muuuuch faster when broadcasted/vectorized.
  4. Tullio.jl can’t derive the gradients automatically when the RHS include “complex” function calls, e.g. StatsFuns.normlogpdf. So in this particular case you probably shouldn’t use it.

All in all, your example should be something like:

using Tullio, CUDA

X = rand(960*950);
Y = rand(960*950);
sig = 1.0

f(X, Y, sig) = sum(StatsFuns.normlogpdf.(X, sig, Y))
g(X, Y, sig) = @tullio lp = StatsFuns.normlogpdf(X[i], sig, Y[i])

@benchmark $f($X, $Y, $sig)
# BenchmarkTools.Trial: 
#   memory estimate:  6.96 MiB
#   allocs estimate:  2
#   --------------
#   minimum time:     6.359 ms (0.00% GC)
#   median time:      6.662 ms (0.00% GC)
#   mean time:        6.847 ms (1.98% GC)
#   maximum time:     11.060 ms (23.53% GC)
#   --------------
#   samples:          730
#   evals/sample:     1

@benchmark $g($X, $Y, $sig)
# BenchmarkTools.Trial: 
#   memory estimate:  128 bytes
#   allocs estimate:  5
#   --------------
#   minimum time:     6.571 ms (0.00% GC)
#   median time:      6.799 ms (0.00% GC)
#   mean time:        6.821 ms (0.00% GC)
#   maximum time:     8.835 ms (0.00% GC)
#   --------------
#   samples:          732
#   evals/sample:     1


# Circumvent if-statements used in `StatsFuns.normlogpdf(::Number, ::Number, ::Number)`.
h(X, Y, sig) = sum(StatsFuns.normlogpdf.(Y - X / sig))
i(X, Y, sig) = @tullio (+) lp = StatsFuns.normlogpdf(Y[i] - X[i] / sig)

@benchmark $h($X, $Y, $sig)
# markTools.Trial: 
#   memory estimate:  20.87 MiB
#   allocs estimate:  6
#   --------------
#   minimum time:     2.947 ms (0.00% GC)
#   median time:      3.497 ms (0.00% GC)
#   mean time:        4.436 ms (12.28% GC)
#   maximum time:     13.297 ms (25.54% GC)
#   --------------
#   samples:          1123
#   evals/sample:     1

@benchmark $i($X, $Y, $sig)
# BenchmarkTools.Trial: 
#   memory estimate:  128 bytes
#   allocs estimate:  5
#   --------------
#   minimum time:     476.989 μs (0.00% GC)
#   median time:      556.951 μs (0.00% GC)
#   mean time:        562.737 μs (0.00% GC)
#   maximum time:     999.113 μs (0.00% GC)
#   --------------
#   samples:          8765
#   evals/sample:     1

But as mentioned above, you want gradients through it, so your best bet here is probably just to use h in the above example. EDIT: Actually, it might work very nicely with ForwardDiff.jl? I just know that for reverse-rules then you can’t use “complex” function calls on the RHS.