I am a long-time admirer of julia who is finally getting serious about learning it. I am comparing the timings when applying @turbo, @inbounds, and @simd to the loop in a simple element-wise multiplication function and am surprised at the timings Iβm seeing. Here is my code:
# runtimings.jl
using LoopVectorization: @turbo
using BenchmarkTools: @btime
using Statistics: median
function vmul!(a, b, c)
for i in eachindex(a)
c[i] = a[i] * b[i]
end
end
function vmul_turbo!(a, b, c)
@turbo for i in eachindex(a)
c[i] = a[i] * b[i]
end
end
function vmul_inbounds!(a, b, c)
@inbounds for i in eachindex(a)
c[i] = a[i] * b[i]
end
end
function vmul_inbounds_simd!(a, b, c)
@inbounds @simd for i in eachindex(a)
c[i] = a[i] * b[i]
end
end
function run_btime(func)
a = Float64.(collect(1:64))
b = Float64.(collect(1:64))
c = zero(a)
@btime $func($a, $b, $c)
end
print("baseline:")
run_btime(vmul!)
print("@turbo:")
run_btime(vmul_turbo!)
print("@inbounds:")
run_btime(vmul_inbounds!)
print("@inbounds @simd:")
run_btime(vmul_inbounds_simd!)
I am running this on an i5-3320M with julia 1.6.1 and LoopVectorization v0.12.65.
I have two main questions:
Why are the timings for @inbounds and @inbounds @simd essentially the same, i.e. why do I not get any additional speedup from @simd? lscpu returns sse, sse2, and avx, so Iβd expect some speedup for floating-point operations.
Why is @turboworse than @inbounds?
I ran the timings across a range of array sizes to see what difference it makes and the relative speed ordering seems the same: @turbo < @inbounds == @inbounds @simd.
julia> a = rand(64);
julia> b = rand(64);
julia> c = rand(64);
julia> @inline function vmul_turbo!(a, b, c)
@turbo for i in eachindex(a)
c[i] = a[i] * b[i]
end
end
vmul_turbo! (generic function with 1 method)
julia> @inline function vmul_inbounds_simd!(a, b, c)
@inbounds @simd for i in eachindex(a)
c[i] = a[i] * b[i]
end
end
vmul_inbounds_simd! (generic function with 1 method)
julia> @inline function vmul_inbounds!(a, b, c)
@inbounds for i in eachindex(a)
c[i] = a[i] * b[i]
end
end
vmul_inbounds! (generic function with 1 method)
julia> @btime vmul_turbo!($c, $a, $b)
3.374 ns (0 allocations: 0 bytes)
julia> @btime vmul_inbounds_simd!($c, $a, $b)
3.938 ns (0 allocations: 0 bytes)
julia> @btime vmul_inbounds!($c, $a, $b)
3.942 ns (0 allocations: 0 bytes)
Basically, Juliaβs compiler automatically inlines functions it estimates to be cheap.
It however automatically estimates @turbo to be very expensive (because it things llvmcall, which it relies on, is very expensive β even though it normally calls just a single instruction [or less!]).
Hence, the functions without @turbo inlined into your benchmark, but the @turbo version did not, adding function call overhead.
For a βfairβ comparison, you should add @inline (or maybe @noinline) to all the functions.
Why are the timings for @inbounds and @inbounds @simd essentially the same, i.e. why do I not get any additional speedup from @simd ? lscpu returns sse , sse2 , and avx , so Iβd expect some speedup for floating-point operations.
@simd tells the compiler it is okay to change the order of operations in order to SIMD your code, i.e. assume that floating point operations are associative (even though they arenβt!).
This speeds up your code, but changes the answer. Normally, it actually makes the answer more accurate, because using separate parallel accumulators in a reduction (e.g. a sum) makes it much more accurate (along with much faster).
However, this changes the result vs what you literally wrote, so you must give the compiler permission to do it.
With your simple loop, however, the compiler can SIMD it without changing the answer at all.
Try:
julia> @code_llvm debuginfo=:none vmul_inbounds!(c, a, b)
Among the output, I get:
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%19 = getelementptr inbounds double, double* %16, i64 %index
%20 = bitcast double* %19 to <8 x double>*
%wide.load = load <8 x double>, <8 x double>* %20, align 8
%21 = getelementptr inbounds double, double* %19, i64 8
%22 = bitcast double* %21 to <8 x double>*
%wide.load17 = load <8 x double>, <8 x double>* %22, align 8
%23 = getelementptr inbounds double, double* %19, i64 16
%24 = bitcast double* %23 to <8 x double>*
%wide.load18 = load <8 x double>, <8 x double>* %24, align 8
%25 = getelementptr inbounds double, double* %19, i64 24
%26 = bitcast double* %25 to <8 x double>*
%wide.load19 = load <8 x double>, <8 x double>* %26, align 8
%27 = getelementptr inbounds double, double* %17, i64 %index
%28 = bitcast double* %27 to <8 x double>*
%wide.load20 = load <8 x double>, <8 x double>* %28, align 8
%29 = getelementptr inbounds double, double* %27, i64 8
%30 = bitcast double* %29 to <8 x double>*
%wide.load21 = load <8 x double>, <8 x double>* %30, align 8
%31 = getelementptr inbounds double, double* %27, i64 16
%32 = bitcast double* %31 to <8 x double>*
%wide.load22 = load <8 x double>, <8 x double>* %32, align 8
%33 = getelementptr inbounds double, double* %27, i64 24
%34 = bitcast double* %33 to <8 x double>*
%wide.load23 = load <8 x double>, <8 x double>* %34, align 8
%35 = fmul <8 x double> %wide.load, %wide.load20
%36 = fmul <8 x double> %wide.load17, %wide.load21
%37 = fmul <8 x double> %wide.load18, %wide.load22
%38 = fmul <8 x double> %wide.load19, %wide.load23
%39 = getelementptr inbounds double, double* %18, i64 %index
%40 = bitcast double* %39 to <8 x double>*
store <8 x double> %35, <8 x double>* %40, align 8
%41 = getelementptr inbounds double, double* %39, i64 8
%42 = bitcast double* %41 to <8 x double>*
store <8 x double> %36, <8 x double>* %42, align 8
%43 = getelementptr inbounds double, double* %39, i64 16
%44 = bitcast double* %43 to <8 x double>*
store <8 x double> %37, <8 x double>* %44, align 8
%45 = getelementptr inbounds double, double* %39, i64 24
%46 = bitcast double* %45 to <8 x double>*
store <8 x double> %38, <8 x double>* %46, align 8
%index.next = add i64 %index, 32
%47 = icmp eq i64 %index.next, %n.vec
br i1 %47, label %middle.block, label %vector.body
Note all the <8 x double>. This is an easy way to confirm SIMD.
Also, as a final comment, Iβd be interested in seeing the chart with @inline (preferably) or @noinline applied to each of the functions. I expect LoopVectorization to win overall, particularly whenever N % 16 is more than a couple.
That makes sense that the compiler can SIMD it. As a follow-up, is it guaranteed that the compiler will automatically SIMD code that has no associativity concerns?
As an aside, if I change my functions to compute the dot product instead of the element-wise product, and so julia cannot automatically SIMD it without permission, I see timings like I originally expected (N=64 here):
Indeed my goal with this exercise was to eventually recreate that wonderful sawtooth plot in the LoopVectorization.jl docs. I will rerun this tonight and post the result.
In the following, all variants are marked with @inline except for @turbo @noinline.
Here is the plot for the element-wise multiplication. I wish I had shrunk the x-range a bit so it would be easier to see detail, but unfortunately I didnβt keep the timings.
Interesting, @turbo is much less steady than I would have expected in the dot product benchmark.
It looks like @turbo is only faster when N % 16 >= 8.
@turboβs performance is much worse when N % 8 != 0, which strikes me as a little unusual. 8 iterations corresponds to a cacheline,
If you have code to copy/paste and run the benchmarks, I can run them on Haswell and Tiger Lake laptops.
Also, LVβs loop remainder handling is worse on older CPUs that have relatively slower vmaskmov instructions (not applicable to AVX512 CPUs, which donβt need it).
Something else you could run
using Random, BenchmarkTools, LoopVectorization
x = rand(512); y = rand(512);
zs = similar(x); Ns = shuffle(axes(x,1));
@inline function dotsimd(x,y)
s = zero(promote_type(eltype(x),eltype(y)))
@inbounds @simd for i in eachindex(x,y)
s += x[i]*y[i]
end
s
end
@inline function dotturbo(x,y)
s = zero(promote_type(eltype(x),eltype(y)))
@turbo for i in eachindex(x,y)
s += x[i]*y[i]
end
s
end
foreachn!(f::F, zs, x, y, Ns) where {F} = map!(n -> f(view(x,1:n),view(y,1:n)), zs, Ns)
@benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
@benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)
julia> @benchmark foreachn!(dotsimd, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 10.573 ΞΌs β¦ 28.419 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 10.819 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 11.057 ΞΌs Β± 784.480 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
19.7 ΞΌs Histogram: frequency by time 11.1 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark foreachn!(dotturbo, $zs, $x, $y, $Ns)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min β¦ max): 6.740 ΞΌs β¦ 14.426 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 7.771 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 7.774 ΞΌs Β± 609.367 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ β
12.4 ΞΌs Histogram: frequency by time 7.77 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.8.0-DEV.390
Commit c88db4e32a* (2021-08-23 19:46 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
I.e., much better performance over the random assortment of sizes from LoopVectorization.
However, IIRC, performance on my Haswell CPU is roughly comparable between the two.
Note that the main loop body is the same between @simd and @turbo here, the primary difference is just how they handle remainders.
Things are more likely to differ for more complicated loops, e.g. if you evaluate exp or log.
Also, on your CPU, exp vectorized by @turbo should actually be more accurate than the Base version.
Are you doing something special to get such tight histograms? I just left my laptop alone while it ran, but I still had a browser open and Iβm sure my OS was running things in the background that I could have disabled. Maybe CPU pinning or something?
The benchmarking code for that plot is nothing special, but here it is if you want to run it yourself:
Summary
using LoopVectorization: @turbo
using BenchmarkTools: @benchmark
using Statistics: median
@inline function dot_product(a, b)
dp = zero(eltype(a))
for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
@inline function dot_product_turbo(a, b)
dp = zero(eltype(a))
@turbo for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
@inline function dot_product_inbounds(a, b)
dp = zero(eltype(a))
@inbounds for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
@inline function dot_product_inbounds_simd(a, b)
dp = zero(eltype(a))
@inbounds @simd for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
@noinline function dot_product_turbo_noinline(a, b)
dp = zero(eltype(a))
@turbo for i in eachindex(a)
dp = dp + a[i] * b[i]
end
return dp
end
function get_times_dot_product(func, N)
times = Float64[]
for n in N
a = rand(n)
b = rand(n)
bench = @benchmark $func($a, $b)
push!(times, median(bench.times))
end
return times
end
N = range(32, 256, step=1)
times1 = get_times_dot_product(dot_product, N)
times2 = get_times_dot_product(dot_product_turbo, N)
times3 = get_times_dot_product(dot_product_inbounds, N)
times4 = get_times_dot_product(dot_product_inbounds_simd, N)
times5 = get_times_dot_product(dot_product_turbo_noinline, N)
# plots
using CairoMakie: Figure, Axis, Legend, lines!
fig = Figure(resolution = (1000,500))
ax = Axis(fig[1, 1], xlabel="N", ylabel="GFLOPS")
l1 = lines!(ax, N, N ./ times1)
l2 = lines!(ax, N, N ./ times2)
l3 = lines!(ax, N, N ./ times3)
l4 = lines!(ax, N, N ./ times4)
l5 = lines!(ax, N, N ./ times5)
fig[1, 2] = Legend(fig, [l1, l2, l3, l4, l5], ["Baseline", "@turbo", "@inbounds", "@inbounds @simd", "@turbo @noinline"])
display(fig)
Actually, I think itβs a bug in the histogram code.
Look at the min and max times (which I believe are reported correctly) vs the range of histogram values. They donβt match up.
A minimum and median time of 10.573 and 10.819 microseconds, respectively, while the histogramβs range is from a minimum of 19.7 microseconds to a maximum of 11.1 microseconds, even though 19.7 > 11.1.
You could try temporarily disabling your laptopβs turbo/boost to make sure it maintains a steady clock speed throughout your benchmark, but I didnβt do this.