LoopVectorization: @turbo performs worse than @inbounds on trivial loop

Elrod · August 24, 2021, 1:33am

kevinsa5:

using LoopVectorization: @turbo
using BenchmarkTools: @btime
using Statistics: median

function vmul!(a, b, c)
    for i in eachindex(a)
        c[i] = a[i] * b[i]
    end
end

function vmul_turbo!(a, b, c)
    @turbo for i in eachindex(a)
        c[i] = a[i] * b[i]
    end
end

function vmul_inbounds!(a, b, c)
    @inbounds for i in eachindex(a)
        c[i] = a[i] * b[i]
    end
end

function vmul_inbounds_simd!(a, b, c)
    @inbounds @simd for i in eachindex(a)
        c[i] = a[i] * b[i]
    end
end

function run_btime(func)
    a = Float64.(collect(1:64))
    b = Float64.(collect(1:64))
    c = zero(a)
    @btime $func($a, $b, $c)
end

print("baseline:")
run_btime(vmul!)
print("@turbo:")
run_btime(vmul_turbo!)
print("@inbounds:")
run_btime(vmul_inbounds!)
print("@inbounds @simd:")
run_btime(vmul_inbounds_simd!)

Mark the @turbo function @inline:

julia> a = rand(64);

julia> b = rand(64);

julia> c = rand(64);

julia> @inline function vmul_turbo!(a, b, c)
           @turbo for i in eachindex(a)
               c[i] = a[i] * b[i]
           end
       end
vmul_turbo! (generic function with 1 method)

julia> @inline function vmul_inbounds_simd!(a, b, c)
           @inbounds @simd for i in eachindex(a)
               c[i] = a[i] * b[i]
           end
       end
vmul_inbounds_simd! (generic function with 1 method)

julia> @inline function vmul_inbounds!(a, b, c)
           @inbounds for i in eachindex(a)
               c[i] = a[i] * b[i]
           end
       end
vmul_inbounds! (generic function with 1 method)

julia> @btime vmul_turbo!($c, $a, $b)
  3.374 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds_simd!($c, $a, $b)
  3.938 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds!($c, $a, $b)
  3.942 ns (0 allocations: 0 bytes)

This is what I got before (without @inline):

julia> @btime vmul_inbounds!($c, $a, $b) # inlined speed: 3.942 ns (0 allocations: 0 bytes)
  3.938 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds_simd!($c, $a, $b) # inlined speed: 3.938 ns (0 allocations: 0 bytes)
  3.944 ns (0 allocations: 0 bytes)

julia> @btime vmul_turbo!($c, $a, $b) # inlined speed: 3.374 ns (0 allocations: 0 bytes)
  11.054 ns (0 allocations: 0 bytes)

Basically, Julia’s compiler automatically inlines functions it estimates to be cheap.
It however automatically estimates @turbo to be very expensive (because it things llvmcall, which it relies on, is very expensive – even though it normally calls just a single instruction [or less!]).

Hence, the functions without @turbo inlined into your benchmark, but the @turbo version did not, adding function call overhead.

For a “fair” comparison, you should add @inline (or maybe @noinline) to all the functions.

Why are the timings for @inbounds and @inbounds @simd essentially the same, i.e. why do I not get any additional speedup from @simd ? lscpu returns sse , sse2 , and avx , so I’d expect some speedup for floating-point operations.

@simd tells the compiler it is okay to change the order of operations in order to SIMD your code, i.e. assume that floating point operations are associative (even though they aren’t!).
This speeds up your code, but changes the answer. Normally, it actually makes the answer more accurate, because using separate parallel accumulators in a reduction (e.g. a sum) makes it much more accurate (along with much faster).
However, this changes the result vs what you literally wrote, so you must give the compiler permission to do it.

With your simple loop, however, the compiler can SIMD it without changing the answer at all.
Try:

julia> @code_llvm debuginfo=:none vmul_inbounds!(c, a, b)

Among the output, I get:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %19 = getelementptr inbounds double, double* %16, i64 %index
  %20 = bitcast double* %19 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %20, align 8
  %21 = getelementptr inbounds double, double* %19, i64 8
  %22 = bitcast double* %21 to <8 x double>*
  %wide.load17 = load <8 x double>, <8 x double>* %22, align 8
  %23 = getelementptr inbounds double, double* %19, i64 16
  %24 = bitcast double* %23 to <8 x double>*
  %wide.load18 = load <8 x double>, <8 x double>* %24, align 8
  %25 = getelementptr inbounds double, double* %19, i64 24
  %26 = bitcast double* %25 to <8 x double>*
  %wide.load19 = load <8 x double>, <8 x double>* %26, align 8
  %27 = getelementptr inbounds double, double* %17, i64 %index
  %28 = bitcast double* %27 to <8 x double>*
  %wide.load20 = load <8 x double>, <8 x double>* %28, align 8
  %29 = getelementptr inbounds double, double* %27, i64 8
  %30 = bitcast double* %29 to <8 x double>*
  %wide.load21 = load <8 x double>, <8 x double>* %30, align 8
  %31 = getelementptr inbounds double, double* %27, i64 16
  %32 = bitcast double* %31 to <8 x double>*
  %wide.load22 = load <8 x double>, <8 x double>* %32, align 8
  %33 = getelementptr inbounds double, double* %27, i64 24
  %34 = bitcast double* %33 to <8 x double>*
  %wide.load23 = load <8 x double>, <8 x double>* %34, align 8
  %35 = fmul <8 x double> %wide.load, %wide.load20
  %36 = fmul <8 x double> %wide.load17, %wide.load21
  %37 = fmul <8 x double> %wide.load18, %wide.load22
  %38 = fmul <8 x double> %wide.load19, %wide.load23
  %39 = getelementptr inbounds double, double* %18, i64 %index
  %40 = bitcast double* %39 to <8 x double>*
  store <8 x double> %35, <8 x double>* %40, align 8
  %41 = getelementptr inbounds double, double* %39, i64 8
  %42 = bitcast double* %41 to <8 x double>*
  store <8 x double> %36, <8 x double>* %42, align 8
  %43 = getelementptr inbounds double, double* %39, i64 16
  %44 = bitcast double* %43 to <8 x double>*
  store <8 x double> %37, <8 x double>* %44, align 8
  %45 = getelementptr inbounds double, double* %39, i64 24
  %46 = bitcast double* %45 to <8 x double>*
  store <8 x double> %38, <8 x double>* %46, align 8
  %index.next = add i64 %index, 32
  %47 = icmp eq i64 %index.next, %n.vec
  br i1 %47, label %middle.block, label %vector.body

Note all the <8 x double>. This is an easy way to confirm SIMD.

Also, as a final comment, I’d be interested in seeing the chart with @inline (preferably) or @noinline applied to each of the functions. I expect LoopVectorization to win overall, particularly whenever N % 16 is more than a couple.

Topic		Replies	Views
Arithmetic performance of expression Performance	11	371	October 4, 2022
@turbo speeds routine, slows down everything else Performance loopvectorization	16	2610	June 5, 2021
A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd` Performance simd	8	1879	June 27, 2021
LoopVectorization almost doubles execution time? Performance loopvectorization	6	655	July 9, 2021
Poor performance of SIMD vectorization in the latest version of Julia (v1.11.2) Performance performance	19	813	January 8, 2025

LoopVectorization: @turbo performs worse than @inbounds on trivial loop

Related topics