Read this post if you care about zero-cost abstractions.
This is a test of Julia’s ability to produce highly optimized code for higher-order functions, taking a function as an argument. The two functions are equivalent, and the compiled code is mostly identical, except for the first function there is a penalty of 2ns due to some “cruft” at the generated machine code before and after the core of the function. This seems to have something to do with GC. Any idea why this happens? Can it be prevented by the user, or maybe optimized away by the compiler?
using BenchmarkTools
ilovehofs(a::Vector{T}) where T = myfoldl(+, 0f0, A) # EDIT: this was just a silly typo
function myfoldl(r, init::T, aa::Vector{T}) where T
@inbounds begin
acc = init
@simd for a in aa
acc = r(acc, a)
end
end
acc
end
function ihatehofs(A::Vector{T}) where T
@inbounds begin
a = zero(eltype(A))
@simd for ai in A
a = a + ai
end
end
a
end
A = rand(Float32, 1000)
@code_llvm ilovehofs(A)
@code_llvm ihatehofs(A)
@btime ilovehofs($A)
@btime ihatehofs($A)
They seem to perform the same?
julia> @btime ilovehofs(a)
106.434 μs (1 allocation: 16 bytes)
500105.25f0
julia> @btime ihatehofs(a)
106.639 μs (1 allocation: 16 bytes)
500105.25f0
For large arrays, yes. Like I said, it’s just an extra 2ns, the difference can only be noticed in small inputs.
Actually, on my machine the difference is much larger:
julia> @btime ilovehofs($A)
63.462 ns (1 allocation: 16 bytes)
508.37314f0
julia> @btime ihatehofs($A)
32.972 ns (0 allocations: 0 bytes)
508.37314f0
Looking at the code_native
, ilovehofs
fails to produce code that uses AVX instructions (no ymm
registers).
My versioninfo()
:
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
What’s yours?
On 1.2-rc2, ilovehofs
does result in AVX instructions and is consequently much closer to ihatehofs
, but there’s still a 4 ns gap on my machine. I’m mostly just very surprised that on 1.1 ilovehofs
not only fails to produce AVX instructions, but even results in an allocation! I thought it might be a failure to specialize on the type of r
so I changed the signature to myfoldl(r::R, init::T, aa::Vector{T}) where {R, T}
, but that didn’t help. Neither did adding @inline
or removing the @simd
annotation; in fact the latter results in downright horrible performance. Very surprising results.
On 1.1, it appears that making A
const
results in the same performance for both functions. Again, very surprising, as I expected this to be equivalent to just having the btime
interpolation with $
.
Edit:
Ah, d’oh.
ilovehofs(a::Vector{T}) where T = myfoldl(+, 0f0, A)
should be
ilovehofs(a::Vector{T}) where T = myfoldl(+, 0f0, a) # lower case a
3 Likes
Oops, sorry for that one!
To be clear, the typo completely explains the performance difference regardless of Julia version; it means that a non-const
global variable is referred to, which is a performance no-no, Performance Tips · The Julia Language.
4 Likes
Oh, OK, that was basically creating a closure instead of a pure function, is that right? The code seems identical now. Thanks again, now we are all free to love or hate higher-order functions as we wish!
1 Like