Seemingly unnecessary code cruft in a compiled higher-order function

Read this post if you care about zero-cost abstractions.

This is a test of Julia’s ability to produce highly optimized code for higher-order functions, taking a function as an argument. The two functions are equivalent, and the compiled code is mostly identical, except for the first function there is a penalty of 2ns due to some “cruft” at the generated machine code before and after the core of the function. This seems to have something to do with GC. Any idea why this happens? Can it be prevented by the user, or maybe optimized away by the compiler?

using BenchmarkTools

ilovehofs(a::Vector{T}) where T = myfoldl(+, 0f0, A) # EDIT: this was just a silly typo

function myfoldl(r, init::T, aa::Vector{T}) where T
    @inbounds begin
        acc = init
        @simd for a in aa
            acc = r(acc, a)
        end
    end
    acc
end

function ihatehofs(A::Vector{T}) where T
    @inbounds begin
        a = zero(eltype(A))
        @simd for ai in A
            a = a + ai
        end
    end
    a
end

A = rand(Float32, 1000)

@code_llvm ilovehofs(A)
@code_llvm ihatehofs(A)

@btime ilovehofs($A)
@btime ihatehofs($A)

They seem to perform the same?

julia> @btime ilovehofs(a)
  106.434 μs (1 allocation: 16 bytes)
500105.25f0

julia> @btime ihatehofs(a)
  106.639 μs (1 allocation: 16 bytes)
500105.25f0

For large arrays, yes. Like I said, it’s just an extra 2ns, the difference can only be noticed in small inputs.

Actually, on my machine the difference is much larger:

julia> @btime ilovehofs($A)
  63.462 ns (1 allocation: 16 bytes)
508.37314f0

julia> @btime ihatehofs($A)
  32.972 ns (0 allocations: 0 bytes)
508.37314f0

Looking at the code_native, ilovehofs fails to produce code that uses AVX instructions (no ymm registers).

My versioninfo():

Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

What’s yours?

On 1.2-rc2, ilovehofs does result in AVX instructions and is consequently much closer to ihatehofs, but there’s still a 4 ns gap on my machine. I’m mostly just very surprised that on 1.1 ilovehofs not only fails to produce AVX instructions, but even results in an allocation! I thought it might be a failure to specialize on the type of r so I changed the signature to myfoldl(r::R, init::T, aa::Vector{T}) where {R, T}, but that didn’t help. Neither did adding @inline or removing the @simd annotation; in fact the latter results in downright horrible performance. Very surprising results.

On 1.1, it appears that making A const results in the same performance for both functions. Again, very surprising, as I expected this to be equivalent to just having the btime interpolation with $.

Edit:
Ah, d’oh.

ilovehofs(a::Vector{T}) where T = myfoldl(+, 0f0, A)

should be

ilovehofs(a::Vector{T}) where T = myfoldl(+, 0f0, a) # lower case a
3 Likes

Oops, sorry for that one!

My versioninfo

Julia Version 1.3.0-DEV.540
Commit faefe2ae64* (2019-07-13 08:34 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel® Core™ i5-8250U CPU @ 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

To be clear, the typo completely explains the performance difference regardless of Julia version; it means that a non-const global variable is referred to, which is a performance no-no, https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables-1.

3 Likes

Oh, OK, that was basically creating a closure instead of a pure function, is that right? The code seems identical now. Thanks again, now we are all free to love or hate higher-order functions as we wish!