LoopVectorization: @turbo performs worse than @inbounds on trivial loop

I am a long-time admirer of julia who is finally getting serious about learning it. I am comparing the timings when applying @turbo, @inbounds, and @simd to the loop in a simple element-wise multiplication function and am surprised at the timings I’m seeing. Here is my code:

# runtimings.jl
using LoopVectorization: @turbo
using BenchmarkTools: @btime
using Statistics: median

function vmul!(a, b, c)
    for i in eachindex(a)
        c[i] = a[i] * b[i]
    end
end

function vmul_turbo!(a, b, c)
    @turbo for i in eachindex(a)
        c[i] = a[i] * b[i]
    end
end

function vmul_inbounds!(a, b, c)
    @inbounds for i in eachindex(a)
        c[i] = a[i] * b[i]
    end
end

function vmul_inbounds_simd!(a, b, c)
    @inbounds @simd for i in eachindex(a)
        c[i] = a[i] * b[i]
    end
end

function run_btime(func)
    a = Float64.(collect(1:64))
    b = Float64.(collect(1:64))
    c = zero(a)
    @btime $func($a, $b, $c)
end

print("baseline:")
run_btime(vmul!)
print("@turbo:")
run_btime(vmul_turbo!)
print("@inbounds:")
run_btime(vmul_inbounds!)
print("@inbounds @simd:")
run_btime(vmul_inbounds_simd!)

Here are the timings I get:

$ julia runtimings.jl 
baseline:  91.415 ns (0 allocations: 0 bytes)
@turbo:  28.591 ns (0 allocations: 0 bytes)
@inbounds:  20.238 ns (0 allocations: 0 bytes)
@inbounds @simd:  20.907 ns (0 allocations: 0 bytes)

I am running this on an i5-3320M with julia 1.6.1 and LoopVectorization v0.12.65.

I have two main questions:

  1. Why are the timings for @inbounds and @inbounds @simd essentially the same, i.e. why do I not get any additional speedup from @simd? lscpu returns sse, sse2, and avx, so I’d expect some speedup for floating-point operations.
  2. Why is @turbo worse than @inbounds?

I ran the timings across a range of array sizes to see what difference it makes and the relative speed ordering seems the same: @turbo < @inbounds == @inbounds @simd.

Thanks very much for any guidance.

Mark the @turbo function @inline:

julia> a = rand(64);

julia> b = rand(64);

julia> c = rand(64);

julia> @inline function vmul_turbo!(a, b, c)
           @turbo for i in eachindex(a)
               c[i] = a[i] * b[i]
           end
       end
vmul_turbo! (generic function with 1 method)

julia> @inline function vmul_inbounds_simd!(a, b, c)
           @inbounds @simd for i in eachindex(a)
               c[i] = a[i] * b[i]
           end
       end
vmul_inbounds_simd! (generic function with 1 method)

julia> @inline function vmul_inbounds!(a, b, c)
           @inbounds for i in eachindex(a)
               c[i] = a[i] * b[i]
           end
       end
vmul_inbounds! (generic function with 1 method)

julia> @btime vmul_turbo!($c, $a, $b)
  3.374 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds_simd!($c, $a, $b)
  3.938 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds!($c, $a, $b)
  3.942 ns (0 allocations: 0 bytes)

This is what I got before (without @inline):

julia> @btime vmul_inbounds!($c, $a, $b) # inlined speed: 3.942 ns (0 allocations: 0 bytes)
  3.938 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds_simd!($c, $a, $b) # inlined speed: 3.938 ns (0 allocations: 0 bytes)
  3.944 ns (0 allocations: 0 bytes)

julia> @btime vmul_turbo!($c, $a, $b) # inlined speed: 3.374 ns (0 allocations: 0 bytes)
  11.054 ns (0 allocations: 0 bytes)

Basically, Julia’s compiler automatically inlines functions it estimates to be cheap.
It however automatically estimates @turbo to be very expensive (because it things llvmcall, which it relies on, is very expensive – even though it normally calls just a single instruction [or less!]).

Hence, the functions without @turbo inlined into your benchmark, but the @turbo version did not, adding function call overhead.

For a β€œfair” comparison, you should add @inline (or maybe @noinline) to all the functions.

Why are the timings for @inbounds and @inbounds @simd essentially the same, i.e. why do I not get any additional speedup from @simd ? lscpu returns sse , sse2 , and avx , so I’d expect some speedup for floating-point operations.

@simd tells the compiler it is okay to change the order of operations in order to SIMD your code, i.e. assume that floating point operations are associative (even though they aren’t!).
This speeds up your code, but changes the answer. Normally, it actually makes the answer more accurate, because using separate parallel accumulators in a reduction (e.g. a sum) makes it much more accurate (along with much faster).
However, this changes the result vs what you literally wrote, so you must give the compiler permission to do it.

With your simple loop, however, the compiler can SIMD it without changing the answer at all.
Try:

julia> @code_llvm debuginfo=:none vmul_inbounds!(c, a, b)

Among the output, I get:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %19 = getelementptr inbounds double, double* %16, i64 %index
  %20 = bitcast double* %19 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %20, align 8
  %21 = getelementptr inbounds double, double* %19, i64 8
  %22 = bitcast double* %21 to <8 x double>*
  %wide.load17 = load <8 x double>, <8 x double>* %22, align 8
  %23 = getelementptr inbounds double, double* %19, i64 16
  %24 = bitcast double* %23 to <8 x double>*
  %wide.load18 = load <8 x double>, <8 x double>* %24, align 8
  %25 = getelementptr inbounds double, double* %19, i64 24
  %26 = bitcast double* %25 to <8 x double>*
  %wide.load19 = load <8 x double>, <8 x double>* %26, align 8
  %27 = getelementptr inbounds double, double* %17, i64 %index
  %28 = bitcast double* %27 to <8 x double>*
  %wide.load20 = load <8 x double>, <8 x double>* %28, align 8
  %29 = getelementptr inbounds double, double* %27, i64 8
  %30 = bitcast double* %29 to <8 x double>*
  %wide.load21 = load <8 x double>, <8 x double>* %30, align 8
  %31 = getelementptr inbounds double, double* %27, i64 16
  %32 = bitcast double* %31 to <8 x double>*
  %wide.load22 = load <8 x double>, <8 x double>* %32, align 8
  %33 = getelementptr inbounds double, double* %27, i64 24
  %34 = bitcast double* %33 to <8 x double>*
  %wide.load23 = load <8 x double>, <8 x double>* %34, align 8
  %35 = fmul <8 x double> %wide.load, %wide.load20
  %36 = fmul <8 x double> %wide.load17, %wide.load21
  %37 = fmul <8 x double> %wide.load18, %wide.load22
  %38 = fmul <8 x double> %wide.load19, %wide.load23
  %39 = getelementptr inbounds double, double* %18, i64 %index
  %40 = bitcast double* %39 to <8 x double>*
  store <8 x double> %35, <8 x double>* %40, align 8
  %41 = getelementptr inbounds double, double* %39, i64 8
  %42 = bitcast double* %41 to <8 x double>*
  store <8 x double> %36, <8 x double>* %42, align 8
  %43 = getelementptr inbounds double, double* %39, i64 16
  %44 = bitcast double* %43 to <8 x double>*
  store <8 x double> %37, <8 x double>* %44, align 8
  %45 = getelementptr inbounds double, double* %39, i64 24
  %46 = bitcast double* %45 to <8 x double>*
  store <8 x double> %38, <8 x double>* %46, align 8
  %index.next = add i64 %index, 32
  %47 = icmp eq i64 %index.next, %n.vec
  br i1 %47, label %middle.block, label %vector.body

Note all the <8 x double>. This is an easy way to confirm SIMD.

Also, as a final comment, I’d be interested in seeing the chart with @inline (preferably) or @noinline applied to each of the functions. I expect LoopVectorization to win overall, particularly whenever N % 16 is more than a couple.

12 Likes

Thanks so much for these insights.

That makes sense that the compiler can SIMD it. As a follow-up, is it guaranteed that the compiler will automatically SIMD code that has no associativity concerns?

As an aside, if I change my functions to compute the dot product instead of the element-wise product, and so julia cannot automatically SIMD it without permission, I see timings like I originally expected (N=64 here):

@turbo:  15.907 ns (0 allocations: 0 bytes)
@inbounds:  56.515 ns (0 allocations: 0 bytes)
@inbounds @simd:  16.286 ns (0 allocations: 0 bytes)

Indeed my goal with this exercise was to eventually recreate that wonderful sawtooth plot in the LoopVectorization.jl docs. I will rerun this tonight and post the result.

2 Likes

In the following, all variants are marked with @inline except for @turbo @noinline.

Here is the plot for the element-wise multiplication. I wish I had shrunk the x-range a bit so it would be easier to see detail, but unfortunately I didn’t keep the timings.

And here is one for the dot product:

And zoomed in for detail:

Somewhat unexpected results for @turbo, no? Here is the function corresponding to @turbo in that plot:

@inline function dot_product_turbo(a, b)
    dp = zero(eltype(a))
    @turbo for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end

Interesting, @turbo is much less steady than I would have expected in the dot product benchmark.
It looks like @turbo is only faster when N % 16 >= 8.

@turbo’s performance is much worse when N % 8 != 0, which strikes me as a little unusual. 8 iterations corresponds to a cacheline,

If you have code to copy/paste and run the benchmarks, I can run them on Haswell and Tiger Lake laptops.

Also, LV’s loop remainder handling is worse on older CPUs that have relatively slower vmaskmov instructions (not applicable to AVX512 CPUs, which don’t need it).

Something else you could run

using Random, BenchmarkTools, LoopVectorization
x = rand(512); y = rand(512);
zs = similar(x); Ns = shuffle(axes(x,1));

@inline function dotsimd(x,y)
    s = zero(promote_type(eltype(x),eltype(y)))
    @inbounds @simd for i in eachindex(x,y)
        s += x[i]*y[i]
    end
    s
end
@inline function dotturbo(x,y)
    s = zero(promote_type(eltype(x),eltype(y)))
    @turbo for i in eachindex(x,y)
        s += x[i]*y[i]
    end
    s
end

foreachn!(f::F, zs, x, y, Ns) where {F} = map!(n -> f(view(x,1:n),view(y,1:n)), zs, Ns)
@benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
@benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)

I get

julia> @benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.965 ΞΌs …  26.926 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     11.500 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   11.438 ΞΌs Β± 700.343 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                     β–ˆβ–†
  β–„β–ˆβ–„β–β–‚β–β–β–β–‚β–ƒβ–…β–ƒβ–‚β–β–β–β–β–β–ƒβ–ˆβ–ˆβ–ƒβ–‚β–β–β–β–‚β–β–‚β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–‚
  11 ΞΌs           Histogram: frequency by time         12.6 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  7.171 ΞΌs …  17.318 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     7.889 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   7.948 ΞΌs Β± 261.149 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒ        ▁        ▁        β–„β–„        β–‡β–‡        β–ƒβ–ˆβ–‚        β–‚ β–‚
  β–ˆβ–†β–β–β–β–β–β–β–„β–ˆβ–„β–ƒβ–ƒβ–ƒβ–β–β–β–„β–ˆβ–ƒβ–β–β–β–β–β–β–β–ˆβ–ˆβ–β–β–β–β–ƒβ–ƒβ–ƒβ–β–ˆβ–ˆβ–ƒβ–„β–β–β–„β–„β–β–„β–ˆβ–ˆβ–ˆβ–„β–…β–‡β–ˆβ–‡β–‡β–‡β–ˆβ–ˆ β–ˆ
  7.17 ΞΌs      Histogram: log(frequency) by time      8.28 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Old Benchmarks with BenchmarkTools Histogram Bug
julia> @benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.573 ΞΌs …  28.419 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     10.819 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   11.057 ΞΌs Β± 784.480 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– ▁
  19.7 ΞΌs         Histogram: frequency by time         11.1 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.740 ΞΌs …  14.426 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     7.771 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   7.774 ΞΌs Β± 609.367 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁  ▁
  12.4 ΞΌs         Histogram: frequency by time        7.77 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.8.0-DEV.390
Commit c88db4e32a* (2021-08-23 19:46 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

I.e., much better performance over the random assortment of sizes from LoopVectorization.
However, IIRC, performance on my Haswell CPU is roughly comparable between the two.

Note that the main loop body is the same between @simd and @turbo here, the primary difference is just how they handle remainders.
Things are more likely to differ for more complicated loops, e.g. if you evaluate exp or log.

Also, on your CPU, exp vectorized by @turbo should actually be more accurate than the Base version.

4 Likes

Here is what I get – much less of a difference:

julia> @benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  34.489 ΞΌs … 113.794 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     34.863 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   35.359 ΞΌs Β±   2.933 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–ˆβ–„       ▂▁                                                 ▁
  β–ˆβ–ˆβ–ˆβ–„β–β–β–β–β–β–…β–ˆβ–ˆβ–‡β–‡β–†β–…β–„β–β–ƒβ–β–β–„β–ƒβ–β–ƒβ–„β–β–„β–ƒβ–β–β–β–β–ƒβ–„β–β–…β–†β–†β–…β–…β–†β–„β–…β–…β–…β–„β–β–β–…β–ƒβ–ƒβ–…β–„β–…β–…β–†β–…β–†β–… β–ˆ
  34.5 ΞΌs       Histogram: log(frequency) by time      48.9 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  33.193 ΞΌs … 77.087 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     33.623 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   34.231 ΞΌs Β±  2.594 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–ˆβ–‡β–„       β–ƒ       β–ƒ                                        ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–…β–ƒβ–†β–„β–β–ƒβ–‡β–ˆβ–ˆβ–‡β–†β–„β–ƒβ–ƒβ–‡β–ˆβ–ƒβ–„β–ƒβ–„β–„β–„β–„β–„β–ƒβ–…β–ƒβ–ƒβ–β–…β–„β–„β–β–ƒβ–ƒβ–…β–‡β–‡β–†β–‡β–‡β–‡β–†β–‡β–†β–…β–β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–† β–ˆ
  33.2 ΞΌs      Histogram: log(frequency) by time      46.6 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Are you doing something special to get such tight histograms? I just left my laptop alone while it ran, but I still had a browser open and I’m sure my OS was running things in the background that I could have disabled. Maybe CPU pinning or something?

The benchmarking code for that plot is nothing special, but here it is if you want to run it yourself:

Summary
using LoopVectorization: @turbo
using BenchmarkTools: @benchmark
using Statistics: median

@inline function dot_product(a, b)
    dp = zero(eltype(a))
    for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end

@inline function dot_product_turbo(a, b)
    dp = zero(eltype(a))
    @turbo for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end

@inline function dot_product_inbounds(a, b)
    dp = zero(eltype(a))
    @inbounds for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end

@inline function dot_product_inbounds_simd(a, b)
    dp = zero(eltype(a))
    @inbounds @simd for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end

@noinline function dot_product_turbo_noinline(a, b)
    dp = zero(eltype(a))
    @turbo for i in eachindex(a)
        dp = dp + a[i] * b[i]
    end
    return dp
end


function get_times_dot_product(func, N)
    times = Float64[]
    for n in N
        a = rand(n)
        b = rand(n)

        bench = @benchmark $func($a, $b)
        push!(times, median(bench.times))
    end
    return times
end

N = range(32, 256, step=1)

times1 = get_times_dot_product(dot_product, N)
times2 = get_times_dot_product(dot_product_turbo, N)
times3 = get_times_dot_product(dot_product_inbounds, N)
times4 = get_times_dot_product(dot_product_inbounds_simd, N)
times5 = get_times_dot_product(dot_product_turbo_noinline, N)


# plots

using CairoMakie: Figure, Axis, Legend, lines!

fig = Figure(resolution = (1000,500))
ax = Axis(fig[1, 1], xlabel="N", ylabel="GFLOPS")

l1 = lines!(ax, N, N ./ times1)
l2 = lines!(ax, N, N ./ times2)
l3 = lines!(ax, N, N ./ times3)
l4 = lines!(ax, N, N ./ times4)
l5 = lines!(ax, N, N ./ times5)

fig[1, 2] = Legend(fig, [l1, l2, l3, l4, l5], ["Baseline", "@turbo", "@inbounds", "@inbounds @simd", "@turbo @noinline"])

display(fig)
2 Likes

Yes, on an old-ish i7-4790k Haswell, they have more comparable performance :

julia> @benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  19.600 ΞΌs … 162.000 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     19.700 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   19.822 ΞΌs Β±   2.164 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…  β–ˆ   β–†  β–‚                    ▁                             ▁
  β–ˆβ–β–β–ˆβ–β–β–β–ˆβ–β–β–ˆβ–β–β–β–‡β–β–β–„β–β–β–β–β–β–β–„β–β–β–β–‡β–β–β–ˆβ–β–β–β–ˆβ–β–β–‡β–β–β–β–…β–β–β–…β–β–β–β–…β–β–β–ƒβ–β–β–β–‡β–β–β–† β–ˆ
  19.6 ΞΌs       Histogram: log(frequency) by time      21.3 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.300 ΞΌs …  32.100 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     17.400 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   17.434 ΞΌs Β± 370.649 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†    β–ˆ     β–†                                         ▁       ▁
  β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–β–ˆβ–β–β–β–β–β–†β–β–β–β–β–β–„β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–‡β–β–β–β–β–β–ˆβ–β–β–β–β–β–‡ β–ˆ
  17.3 ΞΌs       Histogram: log(frequency) by time      18.3 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
2 Likes

Actually, I think it’s a bug in the histogram code.
Look at the min and max times (which I believe are reported correctly) vs the range of histogram values. They don’t match up.
A minimum and median time of 10.573 and 10.819 microseconds, respectively, while the histogram’s range is from a minimum of 19.7 microseconds to a maximum of 11.1 microseconds, even though 19.7 > 11.1.

You could try temporarily disabling your laptop’s turbo/boost to make sure it maintains a steady clock speed throughout your benchmark, but I didn’t do this.

Results on an i7 1165G6:

2 Likes
3 Likes

Great, thanks. I updated to to BenchmarkTools v1.1.4 and edited my post.

1 Like