Consistent performance difference but identical LLVM IR

I am attempting to write a high-performance round-up function that treats Int64 and UInt64 identically. The compiled output is the same, but performance is always worse with UInt. It’s only a 10% difference, but it’s surprisingly consistent.
This raises questions about what BenchmarkTools is actually measuring, or if some extra steps are missing from the assembly output.

# ------ Setup --------
using BenchmarkTools

# See the next two function for a clearer description of what this does
function runbench_expand(::Type{T}) where T
    rg = T(1):T(1000)
    total = UInt(0)
    for n0 in rg
        for m0 in rg
            n = n0 % UInt
            m = m0 % UInt
            
            r = rem(n, m)
            if r == 0
                total += n
            else
                total += n + m - r
            end
        end
    end
    total
end
# --- Unused, but more human-readable version of the above function -------
#=
Round up to next multiple.
For example: 53, 10 -> 60
=#
function round_up_to_multiple(n0, m0)
    # conversion to UInt without rollover check
    n = n0 % UInt
    m = m0 % UInt

    r = rem(n, m)
    if r == 0
        return n
    end
    n + m - r
end

function runbench(::Type{T}) where T
    r = T(1):T(1000)
    sum(round_up_to_multiple(n, m) for n in r for m in r)
end

# ------- Suspiciously-consistent benchmarks -------

# ------ Faster with Int ------
julia> @btime runbench_expand(Int);
  2.585 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.585 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.585 ms (0 allocations: 0 bytes)

# ------- Slower with UInt --------
julia> @btime runbench_expand(UInt);
  2.869 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.869 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.869 ms (0 allocations: 0 bytes)

Here is a diff of the llvm and native assembly output. The only differences between Int and UInt versions are the label offsets.
https://gist.github.com/milesfrain/12222c241f33d3527b53b98ee53ae457/revisions#diff-ff2f367dc3725e8bf3c1b9a976dffab8

I just ran these tests on my machine with julia 1.3.0-rc3 and I do not get your funny timings:

julia> @btime runbench_expand(Int);
  2.512 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.513 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.512 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.513 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.513 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.513 ms (0 allocations: 0 bytes)

Not sure why you got different timings. If I had to guess, given that you had identical machine code produced, it could be related to the sorts of weird traps discussed in this excellent talk: "Performance Matters" by Emery Berger - YouTube

Namely, that in certain programs, you can get up 40% performance difference depending on the addresses of your functions in memory!

1 Like
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)

also does not show the time difference.

1 Like

# ------- Suspiciously-consistent benchmarks -------

I have never seen such consistent times before.

julia> @btime runbench_expand(Int);
  2.458 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.458 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.458 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.458 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.458 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.458 ms (0 allocations: 0 bytes)
1 Like

Thanks to everyone for investigating. I restarted, and now I’m observing identical times for Int and UInt.

1 Like