# Why is this small `@inline` function much slower than an equivalent macro?

My understanding is that in Julia macros are meant to be used for expressiveness and to make DSLs, not for performance hacks. But in this bit-fiddling SIMD loop, a macro is much faster than the corresponding inline function. Could you help me understand why the inline function is so much slower and how to write this in idiomatic julia?

Problem statement: For a bit-matrix where the bits are packed in UInt64, I want to exchange the positions of two bits in each row. I calculate the index in the UInt64 matrix, then calculate the appropriate bitmask, then extract the bits (`bit=matrix[row,column]&mask`), reset them (`matrix[row,column]&=~mask`), and set them to the desired new bit (`matrix[row,column]|=bit`). I will be doing this in many different pieces of code, so I abstract away these bit operations in functions or macros called `getbit` and `setbit`. The macros are much faster.

Here is the code and benchmarks:

``````@inline getmask(col::Int) = UInt64(0x1) << ((col-1) & 63)
@inline getbigindex(col::Int) = (col-1)>>6+1

end
end
macro setxbit(matrix, r, cbig, mask, x)
quote
\$(esc(matrix))[\$(esc(r)),\$(esc(cbig))] |= \$(esc(x))
end
end
macro setzbit(matrix, r, cbig, mask, z)
quote
\$(esc(matrix))[\$(esc(r)),endĂ·2+\$(esc(cbig))] |= \$(esc(z))
end
end

@inline setxbit(matrix, r, cbig, mask, x) = @setxbit(matrix, r, cbig, mask, x)
@inline setzbit(matrix, r, cbig, mask, z) = @setzbit(matrix, r, cbig, mask, z)

function switch_bits_withmacro(matrix::AbstractMatrix{UInt64}, column)
cbig = getbigindex(column)
@inbounds @simd for r in 1:size(matrix, 1)
x = @getxbit(matrix, r, cbig, mask)
z = @getzbit(matrix, r, cbig, mask)
end
matrix
end

function switch_bits_withinline(matrix::AbstractMatrix{UInt64}, column)
cbig = getbigindex(column)
@inbounds @simd for r in 1:size(matrix, 1)
x = getxbit(matrix, r, cbig, mask)
z = getzbit(matrix, r, cbig, mask)
end
matrix
end
``````

And here are the benchmarks:

``````julia> m = rand(UInt64, 50, 50);

julia> @btime switch_bits_withinline(\$m, 6);
74.877 ns (0 allocations: 0 bytes)

julia> @btime switch_bits_withmacro(\$m, 6);
28.273 ns (0 allocations: 0 bytes)

julia> switch_bits_withinline(copy(m),6) == switch_bits_withmacro(copy(m),6)
true
``````

The inline functions are much slower than the macrosâ€¦ Am I doing something stupid?

You need to propagate the `@inbounds` annotation explitly to those functions by annotating them with `Base.@propagate_inbounds`. Otherwise the boundschecks will not be elided. See:

https://docs.julialang.org/en/v1/devdocs/boundscheck/#Propagating-inbounds

Also note that the explicit `@inline` annotation likely isnâ€™t needed here (and `Base.@propagate_inbounds` implies inlining anyways), since the compiler should be smart enough to figure that out on its own.

4 Likes

Well, this was embarrassingly simple ! The benchmark confirmed, this is all that was missing. Thank you!