I am attempting to write a high-performance round-up function that treats Int64
and UInt64
identically. The compiled output is the same, but performance is always worse with UInt
. It’s only a 10% difference, but it’s surprisingly consistent.
This raises questions about what BenchmarkTools is actually measuring, or if some extra steps are missing from the assembly output.
# ------ Setup --------
using BenchmarkTools
# See the next two function for a clearer description of what this does
function runbench_expand(::Type{T}) where T
rg = T(1):T(1000)
total = UInt(0)
for n0 in rg
for m0 in rg
n = n0 % UInt
m = m0 % UInt
r = rem(n, m)
if r == 0
total += n
else
total += n + m - r
end
end
end
total
end
# --- Unused, but more human-readable version of the above function -------
#=
Round up to next multiple.
For example: 53, 10 -> 60
=#
function round_up_to_multiple(n0, m0)
# conversion to UInt without rollover check
n = n0 % UInt
m = m0 % UInt
r = rem(n, m)
if r == 0
return n
end
n + m - r
end
function runbench(::Type{T}) where T
r = T(1):T(1000)
sum(round_up_to_multiple(n, m) for n in r for m in r)
end
# ------- Suspiciously-consistent benchmarks -------
# ------ Faster with Int ------
julia> @btime runbench_expand(Int);
2.585 ms (0 allocations: 0 bytes)
julia> @btime runbench_expand(Int);
2.585 ms (0 allocations: 0 bytes)
julia> @btime runbench_expand(Int);
2.585 ms (0 allocations: 0 bytes)
# ------- Slower with UInt --------
julia> @btime runbench_expand(UInt);
2.869 ms (0 allocations: 0 bytes)
julia> @btime runbench_expand(UInt);
2.869 ms (0 allocations: 0 bytes)
julia> @btime runbench_expand(UInt);
2.869 ms (0 allocations: 0 bytes)
Here is a diff of the llvm and native assembly output. The only differences between Int and UInt versions are the label offsets.
https://gist.github.com/milesfrain/12222c241f33d3527b53b98ee53ae457/revisions#diff-ff2f367dc3725e8bf3c1b9a976dffab8