Consistent performance difference but identical LLVM IR

milesf · October 7, 2019, 9:09pm

I am attempting to write a high-performance round-up function that treats Int64 and UInt64 identically. The compiled output is the same, but performance is always worse with UInt. It’s only a 10% difference, but it’s surprisingly consistent.
This raises questions about what BenchmarkTools is actually measuring, or if some extra steps are missing from the assembly output.

# ------ Setup --------
using BenchmarkTools

# See the next two function for a clearer description of what this does
function runbench_expand(::Type{T}) where T
    rg = T(1):T(1000)
    total = UInt(0)
    for n0 in rg
        for m0 in rg
            n = n0 % UInt
            m = m0 % UInt
            
            r = rem(n, m)
            if r == 0
                total += n
            else
                total += n + m - r
            end
        end
    end
    total
end
# --- Unused, but more human-readable version of the above function -------
#=
Round up to next multiple.
For example: 53, 10 -> 60
=#
function round_up_to_multiple(n0, m0)
    # conversion to UInt without rollover check
    n = n0 % UInt
    m = m0 % UInt

    r = rem(n, m)
    if r == 0
        return n
    end
    n + m - r
end

function runbench(::Type{T}) where T
    r = T(1):T(1000)
    sum(round_up_to_multiple(n, m) for n in r for m in r)
end

# ------- Suspiciously-consistent benchmarks -------

# ------ Faster with Int ------
julia> @btime runbench_expand(Int);
  2.585 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.585 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.585 ms (0 allocations: 0 bytes)

# ------- Slower with UInt --------
julia> @btime runbench_expand(UInt);
  2.869 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.869 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.869 ms (0 allocations: 0 bytes)

Here is a diff of the llvm and native assembly output. The only differences between Int and UInt versions are the label offsets.
https://gist.github.com/milesfrain/12222c241f33d3527b53b98ee53ae457/revisions#diff-ff2f367dc3725e8bf3c1b9a976dffab8

Mason · October 7, 2019, 11:02pm

I just ran these tests on my machine with julia 1.3.0-rc3 and I do not get your funny timings:

julia> @btime runbench_expand(Int);
  2.512 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.513 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.512 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.513 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.513 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.513 ms (0 allocations: 0 bytes)

Not sure why you got different timings. If I had to guess, given that you had identical machine code produced, it could be related to the sorts of weird traps discussed in this excellent talk: "Performance Matters" by Emery Berger - YouTube

Namely, that in certain programs, you can get up 40% performance difference depending on the addresses of your functions in memory!

jling · October 7, 2019, 11:11pm

Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)

also does not show the time difference.

Elrod · October 8, 2019, 12:08am

# ------- Suspiciously-consistent benchmarks -------

I have never seen such consistent times before.

julia> @btime runbench_expand(Int);
  2.458 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.458 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(Int);
  2.458 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.458 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.458 ms (0 allocations: 0 bytes)

julia> @btime runbench_expand(UInt);
  2.458 ms (0 allocations: 0 bytes)

milesf · October 8, 2019, 12:20am

Thanks to everyone for investigating. I restarted, and now I’m observing identical times for Int and UInt.

Topic		Replies	Views
Float64 comparison operator performance Performance	8	1047	September 26, 2019
Confusing benchmark time results and memory allocation depending on number of calls for function with zip Performance question	12	1117	March 12, 2019
Benchmarking and Pkg.test() Performance pkg , benchmarktools	3	606	August 19, 2019
Benchmark showing 2x slower result between nearly identical functions on Julia 0.6 General Usage	4	619	January 27, 2017
BenchmarkTools with simple, fast-running function New to Julia	3	2123	February 21, 2019

Consistent performance difference but identical LLVM IR

Related topics