I wanted to use Tullio to get rid of lots of nested loops in my code and write Einstein notation instead.
Code
Iβm trying to compute a simple sum:
elbo_tullio(G::AbstractMatrix, p::AbstractVector, mu::AbstractVector, var::AbstractVector, x::AbstractVector) =
@tullio ret := G[k,n] * (log(p[k]) - (log(2pi) + log(var[k]) + (x[n] - mu[k])^2 / var[k]) / 2 - log(G[k, n] + 1e-100)) grad=false
Same function with raw loops using LoopVectorization:
function elbo(G, p, mu, var, x)
K, N = size(G)
ret = G |> eltype |> zero
@tturbo for k β 1:K, n β 1:N
q = G[k, n]
ret += q * (
log(p[k]) - (log(2pi) + log(var[k]) + (x[n] - mu[k])^2 / var[k]) / 2 - log(q + 1e-100)
)
end
ret
end
Benchmarks
$ julia-1.8 --threads=4
julia> G, p, mu, var, x = rand(3, 400), rand(3), randn(3), rand(3), randn(400);
julia> # Define `elbo` and `elbo_tullio`...
julia> elbo_tullio(G, p, mu, var, x) β elbo(G, p, mu, var, x)
true
julia> @benchmark $elbo_tullio($G, $p, $mu, $var, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 18.097 ΞΌs β¦ 79.455 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 18.237 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 18.558 ΞΌs Β± 1.491 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ βββ βββ βββ β
βββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββ β
18.1 ΞΌs Histogram: log(frequency) by time 21.9 ΞΌs <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> @benchmark $elbo($G, $p, $mu, $var, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 9.715 ΞΌs β¦ 110.423 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 9.793 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 10.095 ΞΌs Β± 1.826 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ β
β ββ β
ββββββββββββββββ
ββββββββββ
βββββββββββββ
βββββββββββββββββββββ β
9.72 ΞΌs Histogram: log(frequency) by time 13.2 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.8.0-beta3
Commit 3e092a2521 (2022-03-29 15:42 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: 4 Γ Intel(R) Core(TM) i5-3330S CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, ivybridge)
Threads: 4 on 4 virtual cores
Apparently, Tullio is about 1.84 times slower than my basic implementation? Moreover, Tullio uses only one core, while LoopVectorization uses two. I guess thatβs why Tullio is around two times slower? I tried playing around with the threads
option to @tullio
and didnβt get any speedups. For example, with the above setup, any threads<100
uses all 4 cores, but takes 45 ΞΌs to complete, which is way slower than 18 ΞΌs in my original Tullio code.
Am I doing something wrong? Does it not make sense to use Tullio here? Is it possible to get the speed back with Tullio?
- Julia 1.8.0-beta3
- LoopVectorization v0.12.107
- Tullio v0.3.3