Why is this small `@inline` function much slower than an equivalent macro?

Krastanov · June 26, 2021, 9:39pm

My understanding is that in Julia macros are meant to be used for expressiveness and to make DSLs, not for performance hacks. But in this bit-fiddling SIMD loop, a macro is much faster than the corresponding inline function. Could you help me understand why the inline function is so much slower and how to write this in idiomatic julia?

Problem statement: For a bit-matrix where the bits are packed in UInt64, I want to exchange the positions of two bits in each row. I calculate the index in the UInt64 matrix, then calculate the appropriate bitmask, then extract the bits (bit=matrix[row,column]&mask), reset them (matrix[row,column]&=~mask), and set them to the desired new bit (matrix[row,column]|=bit). I will be doing this in many different pieces of code, so I abstract away these bit operations in functions or macros called getbit and setbit. The macros are much faster.

Here is the code and benchmarks:

@inline getmask(col::Int) = UInt64(0x1) << ((col-1) & 63)
@inline getbigindex(col::Int) = (col-1)>>6+1

macro getxbit(s, r, cbig, mask)
    :( $(esc(s))[$(esc(r)),$(esc(cbig))] & $(esc(mask)) )
end
macro getzbit(matrix, r, cbig, mask)
    :( $(esc(matrix))[$(esc(r)),end÷2+$(esc(cbig))] & $(esc(mask)) )
end
macro setxbit(matrix, r, cbig, mask, x)
    quote
        $(esc(matrix))[$(esc(r)),$(esc(cbig))] &= ~$(esc(mask))
        $(esc(matrix))[$(esc(r)),$(esc(cbig))] |= $(esc(x))
    end
end
macro setzbit(matrix, r, cbig, mask, z)
    quote
        $(esc(matrix))[$(esc(r)),end÷2+$(esc(cbig))] &= ~$(esc(mask))
        $(esc(matrix))[$(esc(r)),end÷2+$(esc(cbig))] |= $(esc(z))
    end
end

@inline getxbit(matrix, r, cbig, mask) = @getxbit(matrix, r, cbig, mask)
@inline getzbit(matrix, r, cbig, mask) = @getzbit(matrix, r, cbig, mask)
@inline setxbit(matrix, r, cbig, mask, x) = @setxbit(matrix, r, cbig, mask, x)
@inline setzbit(matrix, r, cbig, mask, z) = @setzbit(matrix, r, cbig, mask, z)

function switch_bits_withmacro(matrix::AbstractMatrix{UInt64}, column)
    mask = getmask(column)
    cbig = getbigindex(column)
    @inbounds @simd for r in 1:size(matrix, 1)
        x = @getxbit(matrix, r, cbig, mask)
        z = @getzbit(matrix, r, cbig, mask)
        @setxbit(matrix, r, cbig, mask, z)
        @setzbit(matrix, r, cbig, mask, x)
    end
    matrix
end

function switch_bits_withinline(matrix::AbstractMatrix{UInt64}, column)
    mask = getmask(column)
    cbig = getbigindex(column)
    @inbounds @simd for r in 1:size(matrix, 1)
        x = getxbit(matrix, r, cbig, mask)
        z = getzbit(matrix, r, cbig, mask)
        setxbit(matrix, r, cbig, mask, z)
        setzbit(matrix, r, cbig, mask, x)
    end
    matrix
end

And here are the benchmarks:

julia> m = rand(UInt64, 50, 50);

julia> @btime switch_bits_withinline($m, 6);
  74.877 ns (0 allocations: 0 bytes)

julia> @btime switch_bits_withmacro($m, 6);
  28.273 ns (0 allocations: 0 bytes)

julia> switch_bits_withinline(copy(m),6) == switch_bits_withmacro(copy(m),6)
true

The inline functions are much slower than the macros… Am I doing something stupid?

simeonschaub · June 26, 2021, 10:02pm

You need to propagate the @inbounds annotation explitly to those functions by annotating them with Base.@propagate_inbounds. Otherwise the boundschecks will not be elided. See:

https://docs.julialang.org/en/v1/devdocs/boundscheck/#Propagating-inbounds

Also note that the explicit @inline annotation likely isn’t needed here (and Base.@propagate_inbounds implies inlining anyways), since the compiler should be smart enough to figure that out on its own.

Krastanov · June 26, 2021, 11:19pm

Well, this was embarrassingly simple ! The benchmark confirmed, this is all that was missing. Thank you!

Topic		Replies	Views
Inline macro Performance question	7	1216	September 22, 2021
What's the benefit of inlining? New to Julia question	7	16446	December 13, 2017
What do @inbounds and @inline do? General Usage question	2	1426	March 3, 2019
Small functions - best practices New to Julia question , compilation , function	5	812	February 8, 2021
Performance: replace conditional jump by some shift/and/or? Performance question	10	899	November 14, 2019

Why is this small `@inline` function much slower than an equivalent macro?

Related topics