A couple of notes here:
- Your
map
is actually only computing a single element:) That’s why it’s significantly faster.map
andbroadcast
have different behavior when the arguments are of different shapes:map
I believe falls back to the smallest common size, e.g. in this case it’s 1, so it will only call the function once, whilebroadcast
will do what you want. - When using BenchmarkTools.jl, you need to use interpolation to ensure that you’re not seeing performance decrease because of scoping issues and whatnot.
- If-statements within the function you’re broadcasting over really hampers performance. In the case of
StatsFuns.normlogdf
there are two versions of the method: 1) takes mean, std, and value, from which it computes the z-value and then finally the logpdf, 2) takes the z-value directly and computes logpdf. In (1) there are some if-statements while in (2) there are none. As a result, (2) will be muuuuch faster when broadcasted/vectorized. - Tullio.jl can’t derive the gradients automatically when the RHS include “complex” function calls, e.g.
StatsFuns.normlogpdf
. So in this particular case you probably shouldn’t use it.
All in all, your example should be something like:
using Tullio, CUDA
X = rand(960*950);
Y = rand(960*950);
sig = 1.0
f(X, Y, sig) = sum(StatsFuns.normlogpdf.(X, sig, Y))
g(X, Y, sig) = @tullio lp = StatsFuns.normlogpdf(X[i], sig, Y[i])
@benchmark $f($X, $Y, $sig)
# BenchmarkTools.Trial:
# memory estimate: 6.96 MiB
# allocs estimate: 2
# --------------
# minimum time: 6.359 ms (0.00% GC)
# median time: 6.662 ms (0.00% GC)
# mean time: 6.847 ms (1.98% GC)
# maximum time: 11.060 ms (23.53% GC)
# --------------
# samples: 730
# evals/sample: 1
@benchmark $g($X, $Y, $sig)
# BenchmarkTools.Trial:
# memory estimate: 128 bytes
# allocs estimate: 5
# --------------
# minimum time: 6.571 ms (0.00% GC)
# median time: 6.799 ms (0.00% GC)
# mean time: 6.821 ms (0.00% GC)
# maximum time: 8.835 ms (0.00% GC)
# --------------
# samples: 732
# evals/sample: 1
# Circumvent if-statements used in `StatsFuns.normlogpdf(::Number, ::Number, ::Number)`.
h(X, Y, sig) = sum(StatsFuns.normlogpdf.(Y - X / sig))
i(X, Y, sig) = @tullio (+) lp = StatsFuns.normlogpdf(Y[i] - X[i] / sig)
@benchmark $h($X, $Y, $sig)
# markTools.Trial:
# memory estimate: 20.87 MiB
# allocs estimate: 6
# --------------
# minimum time: 2.947 ms (0.00% GC)
# median time: 3.497 ms (0.00% GC)
# mean time: 4.436 ms (12.28% GC)
# maximum time: 13.297 ms (25.54% GC)
# --------------
# samples: 1123
# evals/sample: 1
@benchmark $i($X, $Y, $sig)
# BenchmarkTools.Trial:
# memory estimate: 128 bytes
# allocs estimate: 5
# --------------
# minimum time: 476.989 μs (0.00% GC)
# median time: 556.951 μs (0.00% GC)
# mean time: 562.737 μs (0.00% GC)
# maximum time: 999.113 μs (0.00% GC)
# --------------
# samples: 8765
# evals/sample: 1
But as mentioned above, you want gradients through it, so your best bet here is probably just to use h
in the above example. EDIT: Actually, it might work very nicely with ForwardDiff.jl? I just know that for reverse-rules then you can’t use “complex” function calls on the RHS.