Tullio seems two times slower than basic LoopVectorization

ForceBru · April 10, 2022, 2:11pm

I wanted to use Tullio to get rid of lots of nested loops in my code and write Einstein notation instead.

Code

I’m trying to compute a simple sum:

elbo_tullio(G::AbstractMatrix, p::AbstractVector, mu::AbstractVector, var::AbstractVector, x::AbstractVector) =
    @tullio ret := G[k,n] * (log(p[k]) - (log(2pi) + log(var[k]) + (x[n] - mu[k])^2 / var[k]) / 2 - log(G[k, n] + 1e-100)) grad=false

Same function with raw loops using LoopVectorization:

function elbo(G, p, mu, var, x)
    K, N = size(G)
    ret = G |> eltype |> zero
    @tturbo for k ∈ 1:K, n ∈ 1:N
        q = G[k, n]
        ret += q * (
            log(p[k]) - (log(2pi) + log(var[k]) + (x[n] - mu[k])^2 / var[k]) / 2 - log(q + 1e-100)
        )
    end

    ret
end

Benchmarks

$ julia-1.8 --threads=4
julia> G, p, mu, var, x = rand(3, 400), rand(3), randn(3), rand(3), randn(400);

julia> # Define `elbo` and `elbo_tullio`...

julia> elbo_tullio(G, p, mu, var, x) ≈ elbo(G, p, mu, var, x)
true

julia> @benchmark $elbo_tullio($G, $p, $mu, $var, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.097 μs … 79.455 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     18.237 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.558 μs ±  1.491 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇▇█▁     ▃▃▄                 ▂▂▃        ▁▂▂                 ▂
  ████▄▁▃▃▁████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁███▇▆▃▁▃▁▁▁████▅▄▄▃▃▃▃▁▄▃▁▁▁▁▃ █
  18.1 μs      Histogram: log(frequency) by time      21.9 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark $elbo($G, $p, $mu, $var, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.715 μs … 110.423 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      9.793 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.095 μs ±   1.826 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▃         ▅▄    ▃▄                                         ▂
  ███▁▁▁▁▁▁▁▁███▇▅▄▃███▆▃▄▄▅▄▃▃▁▃▁▁▃▁▁▃▄▅▄▁▄▃▃▁▁▁▁▁▃▁▃▃▁▁▃▁▃▁▃ █
  9.72 μs       Histogram: log(frequency) by time      13.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.8.0-beta3
Commit 3e092a2521 (2022-03-29 15:42 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: 4 × Intel(R) Core(TM) i5-3330S CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, ivybridge)
  Threads: 4 on 4 virtual cores

Apparently, Tullio is about 1.84 times slower than my basic implementation? Moreover, Tullio uses only one core, while LoopVectorization uses two. I guess that’s why Tullio is around two times slower? I tried playing around with the threads option to @tullio and didn’t get any speedups. For example, with the above setup, any threads<100 uses all 4 cores, but takes 45 μs to complete, which is way slower than 18 μs in my original Tullio code.

Am I doing something wrong? Does it not make sense to use Tullio here? Is it possible to get the speed back with Tullio?

Julia 1.8.0-beta3
LoopVectorization v0.12.107
Tullio v0.3.3

ctkelley · April 10, 2022, 2:39pm

Are the comparisons the same if you replace 400 by 4000? Your arrays are pretty small, so you may not be able to use more than one core effectively.

ForceBru · April 10, 2022, 2:47pm

With 4000 Tullio is on par with LoopVectorization, as expected:

Tullio: mean 98.010 μs ± 17.693 μs
LoopVectorization: 98.271 μs ± 6.692 μs

However, I don’t need such big arrays: I’m trying to squeeze as much speed as possible out of arrays of sizes 100-500, but that’s where Tullio slows down significantly. Essentially, I wanted to use Tullio purely because of the concise Einstein notation, expecting the speed to be the same as for LoopVectorization.

mcabbott · April 10, 2022, 5:44pm

Yes, Tullio is deciding not to multi-thread this, based on some heuristic. Whereas @tturbo seems to be getting a factor of 2 gain from multi-threading. Without threading, they are more or less the same speed.

The heuristic isn’t completely wrong, as over-riding it to use 2 or 4 threads makes it slower, as you say. I believe this is about the overhead of Julia’s @spawn etc. Although it’s possible that I’ve missed something, and there is some way to make this step cheaper. Might be worth investigating.

LoopVectorization.jl avoids that to use its own Polyester.jl, which is good at exactly this problem: getting value out of multiple threads when the problem is only just big enough. It would be nice to teach Tullio.jl how to use them, but I haven’t got there – #113 is the issue.

julia> N = 400; G, p, mu, var, x = rand(3, N), rand(3), randn(3), rand(3), randn(N);

# same size as above, same functions:

julia> [ @benchmark $elbo_tullio($G, $p, $mu, $var, $x)
         @benchmark $elbo($G, $p, $mu, $var, $x) ]
2-element Vector{BenchmarkTools.Trial}:
┌ Trial [1]:
│  min 7.739 μs, median 7.781 μs, mean 7.804 μs, 99ᵗʰ 8.469 μs
│  1 allocation, 16 bytes
│                                                                       ◑* 
│                                                                        █
│  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█▂▂▂▂▂▂▂▂▂▂▂ ▂
└  3.5 μs                  10_000 samples, each 4 evaluations                  8.5 μs +
┌ Trial [2]:
│  min 3.531 μs, median 4.026 μs, mean 4.015 μs, 99ᵗʰ 5.672 μs
│  0 allocations
│        ◔ *◑
│        ▂ █
│  ▂▁▁▂▂▄█▂█▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▂▂▂▂▂▁▂▁▂▂▁▂▂▁▂▂▁▂▁▁▂▁▁▂▁▂▁▁▂▁▁▂▁▂▂▁▂▁▁▂▁▁▂▁▂▂▁▂▁▁▂▁▁▂▁▁▂▁ ▂
└  3.5 μs                  10_000 samples, each 8 evaluations                  8.5 μs +

# constrain to 1 thread & they are roughly equivalent:

julia> [ @benchmark $elbo1_tullio($G, $p, $mu, $var, $x)  # with threads=false
         @benchmark $elbo1($G, $p, $mu, $var, $x) ]       # with @turbo
2-element Vector{BenchmarkTools.Trial}:
┌ Trial [1]:
│  min 7.823 μs, median 7.854 μs, mean 7.916 μs, 99ᵗʰ 9.781 μs
│  1 allocation, 16 bytes
│              ◑ *
│              █
│  ▁▁▁▁▁▁▁▁▁▁▁▃█▅▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▂▂▂▁▂▁▂▂▁▂▂▁▂▂▁▂▂▂▂▂▂▁▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
└  7.5 μs                  10_000 samples, each 4 evaluations                  9.8 μs +
┌ Trial [2]:
│  min 7.750 μs, median 7.781 μs, mean 7.796 μs, 99ᵗʰ 8.281 μs
│  0 allocations
│           ◑* 
│            █
│  ▁▁▁▁▁▁▁▁▂▇█▃▂▂▂▂▁▂▁▁▁▁▁▁▂▁▁▂▂▁▁▁▂▂▂▂▂▂▁▁▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▂▂▁▂▂▁▂▁▁▁▁▂▂▁▁▁ ▂
└  7.5 μs                  10_000 samples, each 4 evaluations                  9.8 μs +

# encourage Tullio to use all 4 threads, by setting block size  threads=100 (tiny)

julia> @benchmark $elbo_tullio_th($G, $p, $mu, $var, $x)
┌ Trial:
│  min 10.000 μs, median 21.292 μs, mean 25.040 μs, 99ᵗʰ 62.336 μs
│  87 allocations, total 4.66 KiB
│              ◔    ◑     *     ◕
│             █ ▂   ▁
│  ▂▁▁▁▁▂▂▁▂▂▄█▇█▃▄▃█▃▅▃▃▂▃▂▃▃▃▃▃▃▂▃▃▃▃▅▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
└  10 μs                    10_000 samples, each 1 evaluation                   63 μs +

# smaller N than original post: much closer, and @turbo faster than @tturbo

julia> N = 100; G, p, mu, var, x = rand(3, N), rand(3), randn(3), rand(3), randn(N);

julia> [ @benchmark $elbo_tullio($G, $p, $mu, $var, $x)
         @benchmark $elbo($G, $p, $mu, $var, $x) ]
2-element Vector{BenchmarkTools.Trial}:
┌ Trial [1]:
│  min 1.995 μs, median 2.014 μs, 2.022 μs, 99ᵗʰ 2.088 μs
│  1 allocation, 16 bytes
│          ◑*
│          █
│  ▁▁▁▁▁▁▁▃█▃▂▂▂▂▂▂▂▁▁▁▁▁▂▂▁▂▁▁▂▁▁▂▂▁▁▁▁▁▂▁▁▁▁▂▁▂▂▂▂▂▂▁▂▁▁▁▂▂▂▁▁▂▁▁▂▂▂▂▂▁▂▂▁▂▂▂▂▁▁▂▂▁ ▂
└  1.9 μs                  10_000 samples, each 9 evaluations                    3 μs +
┌ Trial [2]:
│  min 2.097 μs, median 2.551 μs, 2.605 μs, 99ᵗʰ 3.000 μs
│  0 allocations
│                                                  ◑   *      ◕
│                                                  █
│  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▂▁▁▂▁▂▂▂▁▂▁▂▂▁▁▂▂█▂▂▂▂▂▂▂▂▂▃▇▂▁▁▁▁▁▁▂▁▂▂▁▁▂▂▁▂▁▁▁▂▂ ▂
└  1.9 μs                  10_000 samples, each 9 evaluations                    3 μs +

julia> [ @benchmark $elbo1_tullio($G, $p, $mu, $var, $x)
         @benchmark $elbo1($G, $p, $mu, $var, $x) ]
2-element Vector{BenchmarkTools.Trial}:
┌ Trial [1]:
│  min 2.056 μs, median 2.069 μs, 2.073 μs, 99ᵗʰ 2.102 μs
│  1 allocation, 16 bytes
│                                               ◔◑*
│                                               ▄█
│  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▄▁██▆▃▁▃▂▂▂▁▂▂▂▂▁▂▂▂▁▂▂▂▂▁▁▂▁▁▁▁▁▂▁▁▁▁▁ ▂
└  1.9 μs                  10_000 samples, each 9 evaluations                  2.2 μs +
┌ Trial [2]:
│  min 1.992 μs, median 2.004 μs, 2.006 μs, 99ᵗʰ 2.021 μs
│  0 allocations
│                             ◔◑*
│                             ▂█
│  ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃██▇▄▃▁▂▂▂▂▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁ ▂
└  1.9 μs                  10_000 samples, each 10 evaluations                 2.2 μs +

Topic		Replies	Views
LoopVec, Tullio losing to Matrix multiplication Performance	9	730	July 25, 2024
Understanding Tullio performances Performance tullio , jax	12	398	June 10, 2025
Realistically, how close is Gaius.jl to becoming a full replacement for BLAS in Julia? Internals & Design tullio , loopvectorization , openblas	13	4714	August 16, 2020
Experiments with LoopVectorization and convolutions Performance simd , loopvectorization , for-loop , convolution	24	836	December 3, 2024
Parallelization efficiency for @tensor @tullio Numerics	5	204	June 30, 2023

Tullio seems two times slower than basic LoopVectorization

Code

Benchmarks

Related topics