Tullio seems two times slower than basic LoopVectorization

I wanted to use Tullio to get rid of lots of nested loops in my code and write Einstein notation instead.

Code

I’m trying to compute a simple sum:

elbo_tullio(G::AbstractMatrix, p::AbstractVector, mu::AbstractVector, var::AbstractVector, x::AbstractVector) =
    @tullio ret := G[k,n] * (log(p[k]) - (log(2pi) + log(var[k]) + (x[n] - mu[k])^2 / var[k]) / 2 - log(G[k, n] + 1e-100)) grad=false

Same function with raw loops using LoopVectorization:

function elbo(G, p, mu, var, x)
    K, N = size(G)
    ret = G |> eltype |> zero
    @tturbo for k ∈ 1:K, n ∈ 1:N
        q = G[k, n]
        ret += q * (
            log(p[k]) - (log(2pi) + log(var[k]) + (x[n] - mu[k])^2 / var[k]) / 2 - log(q + 1e-100)
        )
    end

    ret
end

Benchmarks

$ julia-1.8 --threads=4
julia> G, p, mu, var, x = rand(3, 400), rand(3), randn(3), rand(3), randn(400);

julia> # Define `elbo` and `elbo_tullio`...

julia> elbo_tullio(G, p, mu, var, x) β‰ˆ elbo(G, p, mu, var, x)
true

julia> @benchmark $elbo_tullio($G, $p, $mu, $var, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.097 ΞΌs … 79.455 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     18.237 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   18.558 ΞΌs Β±  1.491 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–‡β–ˆβ–     β–ƒβ–ƒβ–„                 β–‚β–‚β–ƒ        ▁▂▂                 β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–„β–β–ƒβ–ƒβ–β–ˆβ–ˆβ–ˆβ–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–ˆβ–‡β–†β–ƒβ–β–ƒβ–β–β–β–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–β–„β–ƒβ–β–β–β–β–ƒ β–ˆ
  18.1 ΞΌs      Histogram: log(frequency) by time      21.9 ΞΌs <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark $elbo($G, $p, $mu, $var, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.715 ΞΌs … 110.423 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):      9.793 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.095 ΞΌs Β±   1.826 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–ˆβ–ƒ         β–…β–„    β–ƒβ–„                                         β–‚
  β–ˆβ–ˆβ–ˆβ–β–β–β–β–β–β–β–β–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ƒβ–ˆβ–ˆβ–ˆβ–†β–ƒβ–„β–„β–…β–„β–ƒβ–ƒβ–β–ƒβ–β–β–ƒβ–β–β–ƒβ–„β–…β–„β–β–„β–ƒβ–ƒβ–β–β–β–β–β–ƒβ–β–ƒβ–ƒβ–β–β–ƒβ–β–ƒβ–β–ƒ β–ˆ
  9.72 ΞΌs       Histogram: log(frequency) by time      13.2 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.8.0-beta3
Commit 3e092a2521 (2022-03-29 15:42 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: 4 Γ— Intel(R) Core(TM) i5-3330S CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, ivybridge)
  Threads: 4 on 4 virtual cores

Apparently, Tullio is about 1.84 times slower than my basic implementation? Moreover, Tullio uses only one core, while LoopVectorization uses two. I guess that’s why Tullio is around two times slower? I tried playing around with the threads option to @tullio and didn’t get any speedups. For example, with the above setup, any threads<100 uses all 4 cores, but takes 45 ΞΌs to complete, which is way slower than 18 ΞΌs in my original Tullio code.


Am I doing something wrong? Does it not make sense to use Tullio here? Is it possible to get the speed back with Tullio?

  • Julia 1.8.0-beta3
  • LoopVectorization v0.12.107
  • Tullio v0.3.3

Are the comparisons the same if you replace 400 by 4000? Your arrays are pretty small, so you may not be able to use more than one core effectively.

With 4000 Tullio is on par with LoopVectorization, as expected:

  • Tullio: mean 98.010 ΞΌs Β± 17.693 ΞΌs
  • LoopVectorization: 98.271 ΞΌs Β± 6.692 ΞΌs

However, I don’t need such big arrays: I’m trying to squeeze as much speed as possible out of arrays of sizes 100-500, but that’s where Tullio slows down significantly. Essentially, I wanted to use Tullio purely because of the concise Einstein notation, expecting the speed to be the same as for LoopVectorization.

Yes, Tullio is deciding not to multi-thread this, based on some heuristic. Whereas @tturbo seems to be getting a factor of 2 gain from multi-threading. Without threading, they are more or less the same speed.

The heuristic isn’t completely wrong, as over-riding it to use 2 or 4 threads makes it slower, as you say. I believe this is about the overhead of Julia’s @spawn etc. Although it’s possible that I’ve missed something, and there is some way to make this step cheaper. Might be worth investigating.

LoopVectorization.jl avoids that to use its own Polyester.jl, which is good at exactly this problem: getting value out of multiple threads when the problem is only just big enough. It would be nice to teach Tullio.jl how to use them, but I haven’t got there – #113 is the issue.

julia> N = 400; G, p, mu, var, x = rand(3, N), rand(3), randn(3), rand(3), randn(N);

# same size as above, same functions:

julia> [ @benchmark $elbo_tullio($G, $p, $mu, $var, $x)
         @benchmark $elbo($G, $p, $mu, $var, $x) ]
2-element Vector{BenchmarkTools.Trial}:
β”Œ Trial [1]:
β”‚  min 7.739 ΞΌs, median 7.781 ΞΌs, mean 7.804 ΞΌs, 99α΅—Κ° 8.469 ΞΌs
β”‚  1 allocation, 16 bytes
β”‚                                                                       β—‘* 
β”‚                                                                        β–ˆ
β”‚  β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–ˆβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–‚
β””  3.5 ΞΌs                  10_000 samples, each 4 evaluations                  8.5 ΞΌs +
β”Œ Trial [2]:
β”‚  min 3.531 ΞΌs, median 4.026 ΞΌs, mean 4.015 ΞΌs, 99α΅—Κ° 5.672 ΞΌs
β”‚  0 allocations
β”‚        β—” *β—‘
β”‚        β–‚ β–ˆ
β”‚  β–‚β–β–β–‚β–‚β–„β–ˆβ–‚β–ˆβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–‚β–β–‚β–‚β–β–‚β–‚β–β–‚β–β–β–‚β–β–β–‚β–β–‚β–β–β–‚β–β–β–‚β–β–‚β–‚β–β–‚β–β–β–‚β–β–β–‚β–β–‚β–‚β–β–‚β–β–β–‚β–β–β–‚β–β–β–‚β– β–‚
β””  3.5 ΞΌs                  10_000 samples, each 8 evaluations                  8.5 ΞΌs +

# constrain to 1 thread & they are roughly equivalent:

julia> [ @benchmark $elbo1_tullio($G, $p, $mu, $var, $x)  # with threads=false
         @benchmark $elbo1($G, $p, $mu, $var, $x) ]       # with @turbo
2-element Vector{BenchmarkTools.Trial}:
β”Œ Trial [1]:
β”‚  min 7.823 ΞΌs, median 7.854 ΞΌs, mean 7.916 ΞΌs, 99α΅—Κ° 9.781 ΞΌs
β”‚  1 allocation, 16 bytes
β”‚              β—‘ *
β”‚              β–ˆ
β”‚  β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–ˆβ–…β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–‚β–β–‚β–‚β–β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–‚
β””  7.5 ΞΌs                  10_000 samples, each 4 evaluations                  9.8 ΞΌs +
β”Œ Trial [2]:
β”‚  min 7.750 ΞΌs, median 7.781 ΞΌs, mean 7.796 ΞΌs, 99α΅—Κ° 8.281 ΞΌs
β”‚  0 allocations
β”‚           β—‘* 
β”‚            β–ˆ
β”‚  β–β–β–β–β–β–β–β–β–‚β–‡β–ˆβ–ƒβ–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–‚β–β–β–‚β–‚β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–β–‚β–‚β–β–‚β–β–β–β–β–‚β–‚β–β–β– β–‚
β””  7.5 ΞΌs                  10_000 samples, each 4 evaluations                  9.8 ΞΌs +

# encourage Tullio to use all 4 threads, by setting block size  threads=100 (tiny)

julia> @benchmark $elbo_tullio_th($G, $p, $mu, $var, $x)
β”Œ Trial:
β”‚  min 10.000 ΞΌs, median 21.292 ΞΌs, mean 25.040 ΞΌs, 99α΅—Κ° 62.336 ΞΌs
β”‚  87 allocations, total 4.66 KiB
β”‚              β—”    β—‘     *     β—•
β”‚             β–ˆ β–‚   ▁
β”‚  β–‚β–β–β–β–β–‚β–‚β–β–‚β–‚β–„β–ˆβ–‡β–ˆβ–ƒβ–„β–ƒβ–ˆβ–ƒβ–…β–ƒβ–ƒβ–‚β–ƒβ–‚β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–ƒβ–ƒβ–ƒβ–ƒβ–…β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
β””  10 ΞΌs                    10_000 samples, each 1 evaluation                   63 ΞΌs +

# smaller N than original post: much closer, and @turbo faster than @tturbo

julia> N = 100; G, p, mu, var, x = rand(3, N), rand(3), randn(3), rand(3), randn(N);

julia> [ @benchmark $elbo_tullio($G, $p, $mu, $var, $x)
         @benchmark $elbo($G, $p, $mu, $var, $x) ]
2-element Vector{BenchmarkTools.Trial}:
β”Œ Trial [1]:
β”‚  min 1.995 ΞΌs, median 2.014 ΞΌs, 2.022 ΞΌs, 99α΅—Κ° 2.088 ΞΌs
β”‚  1 allocation, 16 bytes
β”‚          β—‘*
β”‚          β–ˆ
β”‚  β–β–β–β–β–β–β–β–ƒβ–ˆβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–‚β–‚β–β–‚β–β–β–‚β–β–β–‚β–‚β–β–β–β–β–β–‚β–β–β–β–β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–‚β–‚β–‚β–β–β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–‚β–‚β–‚β–‚β–β–β–‚β–‚β– β–‚
β””  1.9 ΞΌs                  10_000 samples, each 9 evaluations                    3 ΞΌs +
β”Œ Trial [2]:
β”‚  min 2.097 ΞΌs, median 2.551 ΞΌs, 2.605 ΞΌs, 99α΅—Κ° 3.000 ΞΌs
β”‚  0 allocations
β”‚                                                  β—‘   *      β—•
β”‚                                                  β–ˆ
β”‚  β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–β–β–β–‚β–β–β–‚β–β–‚β–‚β–‚β–β–‚β–β–‚β–‚β–β–β–‚β–‚β–ˆβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–‡β–‚β–β–β–β–β–β–β–‚β–β–‚β–‚β–β–β–‚β–‚β–β–‚β–β–β–β–‚β–‚ β–‚
β””  1.9 ΞΌs                  10_000 samples, each 9 evaluations                    3 ΞΌs +

julia> [ @benchmark $elbo1_tullio($G, $p, $mu, $var, $x)
         @benchmark $elbo1($G, $p, $mu, $var, $x) ]
2-element Vector{BenchmarkTools.Trial}:
β”Œ Trial [1]:
β”‚  min 2.056 ΞΌs, median 2.069 ΞΌs, 2.073 ΞΌs, 99α΅—Κ° 2.102 ΞΌs
β”‚  1 allocation, 16 bytes
β”‚                                               β—”β—‘*
β”‚                                               β–„β–ˆ
β”‚  β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–„β–β–ˆβ–ˆβ–†β–ƒβ–β–ƒβ–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–β–β–‚β–β–β–β–β–β–‚β–β–β–β–β– β–‚
β””  1.9 ΞΌs                  10_000 samples, each 9 evaluations                  2.2 ΞΌs +
β”Œ Trial [2]:
β”‚  min 1.992 ΞΌs, median 2.004 ΞΌs, 2.006 ΞΌs, 99α΅—Κ° 2.021 ΞΌs
β”‚  0 allocations
β”‚                             β—”β—‘*
β”‚                             β–‚β–ˆ
β”‚  β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–ƒβ–ˆβ–ˆβ–‡β–„β–ƒβ–β–‚β–‚β–‚β–‚β–β–‚β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–‚β– β–‚
β””  1.9 ΞΌs                  10_000 samples, each 10 evaluations                 2.2 ΞΌs +
7 Likes