Seemingly unnecessary code cruft in a compiled higher-order function

xor0110 · July 14, 2019, 6:14pm

Read this post if you care about zero-cost abstractions.

This is a test of Julia’s ability to produce highly optimized code for higher-order functions, taking a function as an argument. The two functions are equivalent, and the compiled code is mostly identical, except for the first function there is a penalty of 2ns due to some “cruft” at the generated machine code before and after the core of the function. This seems to have something to do with GC. Any idea why this happens? Can it be prevented by the user, or maybe optimized away by the compiler?

using BenchmarkTools

ilovehofs(a::Vector{T}) where T = myfoldl(+, 0f0, A) # EDIT: this was just a silly typo

function myfoldl(r, init::T, aa::Vector{T}) where T
    @inbounds begin
        acc = init
        @simd for a in aa
            acc = r(acc, a)
        end
    end
    acc
end

function ihatehofs(A::Vector{T}) where T
    @inbounds begin
        a = zero(eltype(A))
        @simd for ai in A
            a = a + ai
        end
    end
    a
end

A = rand(Float32, 1000)

@code_llvm ilovehofs(A)
@code_llvm ihatehofs(A)

@btime ilovehofs($A)
@btime ihatehofs($A)

kristoffer.carlsson · July 14, 2019, 6:47pm

They seem to perform the same?

julia> @btime ilovehofs(a)
  106.434 μs (1 allocation: 16 bytes)
500105.25f0

julia> @btime ihatehofs(a)
  106.639 μs (1 allocation: 16 bytes)
500105.25f0

xor0110 · July 14, 2019, 9:09pm

For large arrays, yes. Like I said, it’s just an extra 2ns, the difference can only be noticed in small inputs.

tkoolen · July 14, 2019, 9:26pm

Actually, on my machine the difference is much larger:

julia> @btime ilovehofs($A)
  63.462 ns (1 allocation: 16 bytes)
508.37314f0

julia> @btime ihatehofs($A)
  32.972 ns (0 allocations: 0 bytes)
508.37314f0

Looking at the code_native, ilovehofs fails to produce code that uses AVX instructions (no ymm registers).

My versioninfo():

Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

What’s yours?

tkoolen · July 14, 2019, 9:48pm

On 1.2-rc2, ilovehofs does result in AVX instructions and is consequently much closer to ihatehofs, but there’s still a 4 ns gap on my machine. I’m mostly just very surprised that on 1.1 ilovehofs not only fails to produce AVX instructions, but even results in an allocation! I thought it might be a failure to specialize on the type of r so I changed the signature to myfoldl(r::R, init::T, aa::Vector{T}) where {R, T}, but that didn’t help. Neither did adding @inline or removing the @simd annotation; in fact the latter results in downright horrible performance. Very surprising results.

tkoolen · July 14, 2019, 10:14pm

On 1.1, it appears that making A const results in the same performance for both functions. Again, very surprising, as I expected this to be equivalent to just having the btime interpolation with $.

Edit:
Ah, d’oh.

ilovehofs(a::Vector{T}) where T = myfoldl(+, 0f0, A)

should be

ilovehofs(a::Vector{T}) where T = myfoldl(+, 0f0, a) # lower case a

xor0110 · July 15, 2019, 11:21am

Oops, sorry for that one!

xor0110 · July 15, 2019, 11:22am

My versioninfo

Julia Version 1.3.0-DEV.540
Commit faefe2ae64* (2019-07-13 08:34 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core™ i5-8250U CPU @ 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

tkoolen · July 15, 2019, 1:03pm

To be clear, the typo completely explains the performance difference regardless of Julia version; it means that a non-const global variable is referred to, which is a performance no-no, Performance Tips · The Julia Language.

xor0110 · July 15, 2019, 3:08pm

Oh, OK, that was basically creating a closure instead of a pure function, is that right? The code seems identical now. Thanks again, now we are all free to love or hate higher-order functions as we wish!

Topic		Replies	Views
Huge difference between passing a type, or using a hardcoded-type (in benchmarks)? Performance	6	791	May 1, 2020
Performance of generator expressions and higher order functions Performance performance , memory-allocation	7	1317	January 4, 2019
Why isn't 10^6 evaluated at compile time? Performance	21	1111	May 25, 2020
Julia v1.8 -> v1.9 makes this code 1000x slower Performance	6	574	June 6, 2023
Why fewer memory allocations does not necessarily suggest higher speed New to Julia performance , memory-allocation	5	781	June 6, 2021

Seemingly unnecessary code cruft in a compiled higher-order function

Related topics