# LoopVectorization: @turbo performs worse than @inbounds on trivial loop

I am a long-time admirer of julia who is finally getting serious about learning it. I am comparing the timings when applying `@turbo`, `@inbounds`, and `@simd` to the loop in a simple element-wise multiplication function and am surprised at the timings Iβm seeing. Here is my code:

``````# runtimings.jl
using LoopVectorization: @turbo
using BenchmarkTools: @btime
using Statistics: median

function vmul!(a, b, c)
for i in eachindex(a)
c[i] = a[i] * b[i]
end
end

function vmul_turbo!(a, b, c)
@turbo for i in eachindex(a)
c[i] = a[i] * b[i]
end
end

function vmul_inbounds!(a, b, c)
@inbounds for i in eachindex(a)
c[i] = a[i] * b[i]
end
end

function vmul_inbounds_simd!(a, b, c)
@inbounds @simd for i in eachindex(a)
c[i] = a[i] * b[i]
end
end

function run_btime(func)
a = Float64.(collect(1:64))
b = Float64.(collect(1:64))
c = zero(a)
@btime \$func(\$a, \$b, \$c)
end

print("baseline:")
run_btime(vmul!)
print("@turbo:")
run_btime(vmul_turbo!)
print("@inbounds:")
run_btime(vmul_inbounds!)
print("@inbounds @simd:")
run_btime(vmul_inbounds_simd!)
``````

Here are the timings I get:

``````\$ julia runtimings.jl
baseline:  91.415 ns (0 allocations: 0 bytes)
@turbo:  28.591 ns (0 allocations: 0 bytes)
@inbounds:  20.238 ns (0 allocations: 0 bytes)
@inbounds @simd:  20.907 ns (0 allocations: 0 bytes)
``````

I am running this on an i5-3320M with julia 1.6.1 and LoopVectorization v0.12.65.

I have two main questions:

1. Why are the timings for `@inbounds` and `@inbounds @simd` essentially the same, i.e. why do I not get any additional speedup from `@simd`? `lscpu` returns `sse`, `sse2`, and `avx`, so Iβd expect some speedup for floating-point operations.
2. Why is `@turbo` worse than `@inbounds`?

I ran the timings across a range of array sizes to see what difference it makes and the relative speed ordering seems the same: `@turbo` < `@inbounds` == `@inbounds @simd`.

Thanks very much for any guidance.

Mark the `@turbo` function `@inline`:

``````julia> a = rand(64);

julia> b = rand(64);

julia> c = rand(64);

julia> @inline function vmul_turbo!(a, b, c)
@turbo for i in eachindex(a)
c[i] = a[i] * b[i]
end
end
vmul_turbo! (generic function with 1 method)

julia> @inline function vmul_inbounds_simd!(a, b, c)
@inbounds @simd for i in eachindex(a)
c[i] = a[i] * b[i]
end
end
vmul_inbounds_simd! (generic function with 1 method)

julia> @inline function vmul_inbounds!(a, b, c)
@inbounds for i in eachindex(a)
c[i] = a[i] * b[i]
end
end
vmul_inbounds! (generic function with 1 method)

julia> @btime vmul_turbo!(\$c, \$a, \$b)
3.374 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds_simd!(\$c, \$a, \$b)
3.938 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds!(\$c, \$a, \$b)
3.942 ns (0 allocations: 0 bytes)
``````

This is what I got before (without `@inline`):

``````julia> @btime vmul_inbounds!(\$c, \$a, \$b) # inlined speed: 3.942 ns (0 allocations: 0 bytes)
3.938 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds_simd!(\$c, \$a, \$b) # inlined speed: 3.938 ns (0 allocations: 0 bytes)
3.944 ns (0 allocations: 0 bytes)

julia> @btime vmul_turbo!(\$c, \$a, \$b) # inlined speed: 3.374 ns (0 allocations: 0 bytes)
11.054 ns (0 allocations: 0 bytes)
``````

Basically, Juliaβs compiler automatically inlines functions it estimates to be cheap.
It however automatically estimates `@turbo` to be very expensive (because it things `llvmcall`, which it relies on, is very expensive β even though it normally calls just a single instruction [or less!]).

Hence, the functions without `@turbo` inlined into your benchmark, but the `@turbo` version did not, adding function call overhead.

For a βfairβ comparison, you should add `@inline` (or maybe `@noinline`) to all the functions.

Why are the timings for `@inbounds` and `@inbounds @simd` essentially the same, i.e. why do I not get any additional speedup from `@simd` ? `lscpu` returns `sse` , `sse2` , and `avx` , so Iβd expect some speedup for floating-point operations.

`@simd` tells the compiler it is okay to change the order of operations in order to SIMD your code, i.e. assume that floating point operations are associative (even though they arenβt!).
This speeds up your code, but changes the answer. Normally, it actually makes the answer more accurate, because using separate parallel accumulators in a reduction (e.g. a sum) makes it much more accurate (along with much faster).
However, this changes the result vs what you literally wrote, so you must give the compiler permission to do it.

With your simple loop, however, the compiler can SIMD it without changing the answer at all.
Try:

``````julia> @code_llvm debuginfo=:none vmul_inbounds!(c, a, b)
``````

Among the output, I get:

``````vector.body:                                      ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%19 = getelementptr inbounds double, double* %16, i64 %index
%20 = bitcast double* %19 to <8 x double>*
%wide.load = load <8 x double>, <8 x double>* %20, align 8
%21 = getelementptr inbounds double, double* %19, i64 8
%22 = bitcast double* %21 to <8 x double>*
%wide.load17 = load <8 x double>, <8 x double>* %22, align 8
%23 = getelementptr inbounds double, double* %19, i64 16
%24 = bitcast double* %23 to <8 x double>*
%wide.load18 = load <8 x double>, <8 x double>* %24, align 8
%25 = getelementptr inbounds double, double* %19, i64 24
%26 = bitcast double* %25 to <8 x double>*
%wide.load19 = load <8 x double>, <8 x double>* %26, align 8
%27 = getelementptr inbounds double, double* %17, i64 %index
%28 = bitcast double* %27 to <8 x double>*
%wide.load20 = load <8 x double>, <8 x double>* %28, align 8
%29 = getelementptr inbounds double, double* %27, i64 8
%30 = bitcast double* %29 to <8 x double>*
%wide.load21 = load <8 x double>, <8 x double>* %30, align 8
%31 = getelementptr inbounds double, double* %27, i64 16
%32 = bitcast double* %31 to <8 x double>*
%wide.load22 = load <8 x double>, <8 x double>* %32, align 8
%33 = getelementptr inbounds double, double* %27, i64 24
%34 = bitcast double* %33 to <8 x double>*
%wide.load23 = load <8 x double>, <8 x double>* %34, align 8
%39 = getelementptr inbounds double, double* %18, i64 %index
%40 = bitcast double* %39 to <8 x double>*
store <8 x double> %35, <8 x double>* %40, align 8
%41 = getelementptr inbounds double, double* %39, i64 8
%42 = bitcast double* %41 to <8 x double>*
store <8 x double> %36, <8 x double>* %42, align 8
%43 = getelementptr inbounds double, double* %39, i64 16
%44 = bitcast double* %43 to <8 x double>*
store <8 x double> %37, <8 x double>* %44, align 8
%45 = getelementptr inbounds double, double* %39, i64 24
%46 = bitcast double* %45 to <8 x double>*
store <8 x double> %38, <8 x double>* %46, align 8
%index.next = add i64 %index, 32
%47 = icmp eq i64 %index.next, %n.vec
br i1 %47, label %middle.block, label %vector.body
``````

Note all the `<8 x double>`. This is an easy way to confirm SIMD.

Also, as a final comment, Iβd be interested in seeing the chart with `@inline` (preferably) or `@noinline` applied to each of the functions. I expect LoopVectorization to win overall, particularly whenever `N % 16` is more than a couple.

11 Likes

Thanks so much for these insights.

That makes sense that the compiler can SIMD it. As a follow-up, is it guaranteed that the compiler will automatically SIMD code that has no associativity concerns?

As an aside, if I change my functions to compute the dot product instead of the element-wise product, and so julia cannot automatically SIMD it without permission, I see timings like I originally expected (N=64 here):

``````@turbo:  15.907 ns (0 allocations: 0 bytes)
@inbounds:  56.515 ns (0 allocations: 0 bytes)
@inbounds @simd:  16.286 ns (0 allocations: 0 bytes)
``````

Indeed my goal with this exercise was to eventually recreate that wonderful sawtooth plot in the LoopVectorization.jl docs. I will rerun this tonight and post the result.

2 Likes

In the following, all variants are marked with `@inline` except for `@turbo @noinline`.

Here is the plot for the element-wise multiplication. I wish I had shrunk the x-range a bit so it would be easier to see detail, but unfortunately I didnβt keep the timings.

And here is one for the dot product:

And zoomed in for detail:

Somewhat unexpected results for `@turbo`, no? Here is the function corresponding to `@turbo` in that plot:

``````@inline function dot_product_turbo(a, b)
dp = zero(eltype(a))
@turbo for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
``````

Interesting, `@turbo` is much less steady than I would have expected in the dot product benchmark.
It looks like `@turbo` is only faster when `N % 16 >= 8`.

`@turbo`'s performance is much worse when `N % 8 != 0`, which strikes me as a little unusual. 8 iterations corresponds to a cacheline,

If you have code to copy/paste and run the benchmarks, I can run them on Haswell and Tiger Lake laptops.

Also, LVβs loop remainder handling is worse on older CPUs that have relatively slower vmaskmov instructions (not applicable to AVX512 CPUs, which donβt need it).

Something else you could run

``````using Random, BenchmarkTools, LoopVectorization
x = rand(512); y = rand(512);
zs = similar(x); Ns = shuffle(axes(x,1));

@inline function dotsimd(x,y)
s = zero(promote_type(eltype(x),eltype(y)))
@inbounds @simd for i in eachindex(x,y)
s += x[i]*y[i]
end
s
end
@inline function dotturbo(x,y)
s = zero(promote_type(eltype(x),eltype(y)))
@turbo for i in eachindex(x,y)
s += x[i]*y[i]
end
s
end

foreachn!(f::F, zs, x, y, Ns) where {F} = map!(n -> f(view(x,1:n),view(y,1:n)), zs, Ns)
@benchmark foreachn!(dotsimd, \$zs, \$x, \$y, \$Ns)
@benchmark foreachn!(dotturbo, \$zs, \$x, \$y, \$Ns)
``````

I get

``````julia> @benchmark foreachn!(dotsimd, \$zs, \$x, \$y, \$Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  10.965 ΞΌs β¦  26.926 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     11.500 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   11.438 ΞΌs Β± 700.343 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
11 ΞΌs           Histogram: frequency by time         12.6 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foreachn!(dotturbo, \$zs, \$x, \$y, \$Ns)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min β¦ max):  7.171 ΞΌs β¦  17.318 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     7.889 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   7.948 ΞΌs Β± 261.149 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

β        β        β        ββ        ββ        βββ        β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
7.17 ΞΌs      Histogram: log(frequency) by time      8.28 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.
``````
Old Benchmarks with BenchmarkTools Histogram Bug
``````julia> @benchmark foreachn!(dotsimd, \$zs, \$x, \$y, \$Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  10.573 ΞΌs β¦  28.419 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     10.819 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   11.057 ΞΌs Β± 784.480 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
19.7 ΞΌs         Histogram: frequency by time         11.1 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foreachn!(dotturbo, \$zs, \$x, \$y, \$Ns)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min β¦ max):  6.740 ΞΌs β¦  14.426 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     7.771 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   7.774 ΞΌs Β± 609.367 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

β
βββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ  β
12.4 ΞΌs         Histogram: frequency by time        7.77 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.8.0-DEV.390
Commit c88db4e32a* (2021-08-23 19:46 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
``````

I.e., much better performance over the random assortment of sizes from LoopVectorization.
However, IIRC, performance on my Haswell CPU is roughly comparable between the two.

Note that the main loop body is the same between `@simd` and `@turbo` here, the primary difference is just how they handle remainders.
Things are more likely to differ for more complicated loops, e.g. if you evaluate `exp` or `log`.

Also, on your CPU, `exp` vectorized by `@turbo` should actually be more accurate than the `Base` version.

4 Likes

Here is what I get β much less of a difference:

``````julia> @benchmark foreachn!(dotsimd, \$zs, \$x, \$y, \$Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  34.489 ΞΌs β¦ 113.794 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     34.863 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   35.359 ΞΌs Β±   2.933 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββ       ββ                                                 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
34.5 ΞΌs       Histogram: log(frequency) by time      48.9 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foreachn!(dotturbo, \$zs, \$x, \$y, \$Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  33.193 ΞΌs β¦ 77.087 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     33.623 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   34.231 ΞΌs Β±  2.594 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββ       β       β                                        β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
33.2 ΞΌs      Histogram: log(frequency) by time      46.6 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

Are you doing something special to get such tight histograms? I just left my laptop alone while it ran, but I still had a browser open and Iβm sure my OS was running things in the background that I could have disabled. Maybe CPU pinning or something?

The benchmarking code for that plot is nothing special, but here it is if you want to run it yourself:

Summary
``````using LoopVectorization: @turbo
using BenchmarkTools: @benchmark
using Statistics: median

@inline function dot_product(a, b)
dp = zero(eltype(a))
for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end

@inline function dot_product_turbo(a, b)
dp = zero(eltype(a))
@turbo for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end

@inline function dot_product_inbounds(a, b)
dp = zero(eltype(a))
@inbounds for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end

@inline function dot_product_inbounds_simd(a, b)
dp = zero(eltype(a))
@inbounds @simd for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end

@noinline function dot_product_turbo_noinline(a, b)
dp = zero(eltype(a))
@turbo for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end

function get_times_dot_product(func, N)
times = Float64[]
for n in N
a = rand(n)
b = rand(n)

bench = @benchmark \$func(\$a, \$b)
push!(times, median(bench.times))
end
return times
end

N = range(32, 256, step=1)

times1 = get_times_dot_product(dot_product, N)
times2 = get_times_dot_product(dot_product_turbo, N)
times3 = get_times_dot_product(dot_product_inbounds, N)
times4 = get_times_dot_product(dot_product_inbounds_simd, N)
times5 = get_times_dot_product(dot_product_turbo_noinline, N)

# plots

using CairoMakie: Figure, Axis, Legend, lines!

fig = Figure(resolution = (1000,500))
ax = Axis(fig[1, 1], xlabel="N", ylabel="GFLOPS")

l1 = lines!(ax, N, N ./ times1)
l2 = lines!(ax, N, N ./ times2)
l3 = lines!(ax, N, N ./ times3)
l4 = lines!(ax, N, N ./ times4)
l5 = lines!(ax, N, N ./ times5)

fig[1, 2] = Legend(fig, [l1, l2, l3, l4, l5], ["Baseline", "@turbo", "@inbounds", "@inbounds @simd", "@turbo @noinline"])

display(fig)
``````
2 Likes

Yes, on an old-ish i7-4790k Haswell, they have more comparable performance :

``````julia> @benchmark foreachn!(dotsimd, \$zs, \$x, \$y, \$Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  19.600 ΞΌs β¦ 162.000 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     19.700 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   19.822 ΞΌs Β±   2.164 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

β  β   β  β                    β                             β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
19.6 ΞΌs       Histogram: log(frequency) by time      21.3 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foreachn!(dotturbo, \$zs, \$x, \$y, \$Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  17.300 ΞΌs β¦  32.100 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     17.400 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   17.434 ΞΌs Β± 370.649 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

β    β     β                                         β       β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
17.3 ΞΌs       Histogram: log(frequency) by time      18.3 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.
``````
2 Likes

Actually, I think itβs a bug in the histogram code.
Look at the min and max times (which I believe are reported correctly) vs the range of histogram values. They donβt match up.
A minimum and median time of 10.573 and 10.819 microseconds, respectively, while the histogramβs range is from a minimum of 19.7 microseconds to a maximum of 11.1 microseconds, even though 19.7 > 11.1.

You could try temporarily disabling your laptopβs turbo/boost to make sure it maintains a steady clock speed throughout your benchmark, but I didnβt do this.

Results on an i7 1165G6:

2 Likes
3 Likes

Great, thanks. I updated to to BenchmarkTools v1.1.4 and edited my post.

1 Like