Unrolled loop results in Improbable @benchmark timing

pixel27 · March 6, 2021, 11:46pm

The TLDR is why would I get improbably timings when benchmarking passing in by reference i.e. $c but get maybe reasonable timings when passing in the value i.e. ‘c’. And how much can I trust the timing passing in the value directly?

Now for the details. I’m working on optimizing some code moving it from using Vectors to using SVectors. Everything was going great, until I unrolled a loop. When I unrolled the loop my performance went to:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.020 ns (0.00% GC)
  median time:      0.022 ns (0.00% GC)
  mean time:        0.022 ns (0.00% GC)
  maximum time:     0.038 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Well that’s probably not right…I’m passing in my values by reference, so I figured I’d try just passing them in. For that I get:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     108.384 ns (0.00% GC)
  median time:      110.113 ns (0.00% GC)
  mean time:        110.021 ns (0.00% GC)
  maximum time:     156.001 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     929

Which seems more reasonable. But what kind of “overhead” does passing the values in have? Now I’m nervous.

This is my attempt at at MWE, it’s still longer than I would like but it needs to be long enough to take some time. Also bitrotate is a 1.5 instruction apparently so you need Julia 1.5:

using BenchmarkTools
using StaticArrays

function ro(state, m)
    local a, b, c, d = state[1], state[5], state[9], state[13]

    a = a + b + m[1]
    d = bitrotate(d ⊻ a, -16)
    c = c + d
    b = bitrotate(b ⊻ c, -12)

    a = a + b + m[2]
    d = bitrotate(d ⊻ a, -8)
    c = c + d
    b = bitrotate(b ⊻ c, -7)

    return SVector(
        a, state[2],  state[3],  state[4],
        b, state[6],  state[7],  state[8],
        c, state[10], state[11], state[12],
        d, state[14], state[15], state[16]
    )
end

function co(value, block, t1, t2, t3)
    local state = SVector{16, UInt32}(
        value[1], value[2], value[3], value[4],
        value[5], value[6], value[7], value[8],
        UInt32(0x6A09E667), UInt32(0xBB67AE85),
        UInt32(0x3C6EF372), UInt32(0xA54FF53A),
        UInt32(t1 & 0xffffffff), UInt32(t1 >> 32),
        t2, t3
    )

    for _ in 1:10
        state = ro(state, block)
    end

    return SVector(
        state[1] ⊻ state[ 9], state[2] ⊻ state[10],
        state[3] ⊻ state[11], state[4] ⊻ state[12],
        state[5] ⊻ state[13], state[6] ⊻ state[14],
        state[7] ⊻ state[15], state[8] ⊻ state[16],
        state[ 9] ⊻ value[1], state[10] ⊻ value[2],
        state[11] ⊻ value[3], state[12] ⊻ value[4],
        state[13] ⊻ value[5], state[14] ⊻ value[6],
        state[15] ⊻ value[7], state[16] ⊻ value[8]
    )
end

function co2(value, block, t1, t2, t3)
    local state = SVector{16, UInt32}(
        value[1], value[2], value[3], value[4],
        value[5], value[6], value[7], value[8],
        UInt32(0x6A09E667), UInt32(0xBB67AE85),
        UInt32(0x3C6EF372), UInt32(0xA54FF53A),
        UInt32(t1 & 0xffffffff), UInt32(t1 >> 32),
        t2, t3
    )

    state = ro(state, block)
    state = ro(state, block)
    state = ro(state, block)
    state = ro(state, block)
    state = ro(state, block)
    state = ro(state, block)
    state = ro(state, block)
    state = ro(state, block)
    state = ro(state, block)
    state = ro(state, block)

    return SVector(
        state[1] ⊻ state[ 9], state[2] ⊻ state[10],
        state[3] ⊻ state[11], state[4] ⊻ state[12],
        state[5] ⊻ state[13], state[6] ⊻ state[14],
        state[7] ⊻ state[15], state[8] ⊻ state[16],
        state[ 9] ⊻ value[1], state[10] ⊻ value[2],
        state[11] ⊻ value[3], state[12] ⊻ value[4],
        state[13] ⊻ value[5], state[14] ⊻ value[6],
        state[15] ⊻ value[7], state[16] ⊻ value[8]
    )
end

The only different between co() and co2() is that the loop is unrolled in co2().

Benchmarking co() gets:

julia> c = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);

julia> b = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);

julia> @benchmark co($c, $b, UInt64(1), UInt32(0), UInt32(0))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     38.219 ns (0.00% GC)
  median time:      38.327 ns (0.00% GC)
  mean time:        38.764 ns (0.00% GC)
  maximum time:     59.882 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     991

While benchmarking co2() gets:

julia> c = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);

julia> b = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);

julia> @benchmark co2($c, $b, UInt64(1), UInt32(0), UInt32(0))
BenchmarkTools.Trial: 
 memory estimate:  0 bytes
 allocs estimate:  0
 --------------
 minimum time:     0.017 ns (0.00% GC)
 median time:      0.022 ns (0.00% GC)
 mean time:        0.024 ns (0.00% GC)
 maximum time:     8.666 ns (0.00% GC)
 --------------
 samples:          10000
 evals/sample:     1000

You can then pass in the values directly and have it actually take some time:

julia> @benchmark co2(c, b, UInt64(1), UInt32(0), UInt32(0))
BenchmarkTools.Trial: 
  memory estimate:  80 bytes
  allocs estimate:  1
  --------------
  minimum time:     60.180 ns (0.00% GC)
  median time:      61.169 ns (0.00% GC)
  mean time:        68.641 ns (7.86% GC)
  maximum time:     3.425 μs (96.98% GC)
  --------------
  samples:          10000
  evals/sample:     980

However the timing is twice as long as co() is that accurate?

FPGro · March 7, 2021, 3:12am

The last timing is probably not accurate. The compiler is really good at propagating things lately, which make proper benchmarks incredibly hard. Your case is solvable by letting the @benchmark macro initialise the input locally. This avoids the lookup time for global variables and doesn’t get constant-propped away (at least not on my machine):

julia> @benchmark co(A, B, UInt64(1), UInt32(0), UInt32(0)) setup = begin A = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...); B = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...) end
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     54.813 ns (0.00% GC)
  median time:      54.914 ns (0.00% GC)
  mean time:        56.826 ns (0.00% GC)
  maximum time:     165.957 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     987

julia> @benchmark co2(A, B, UInt64(1), UInt32(0), UInt32(0)) setup = begin A = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...); B = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...) end
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     51.064 ns (0.00% GC)
  median time:      51.165 ns (0.00% GC)
  mean time:        52.313 ns (0.00% GC)
  maximum time:     140.932 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     987

Elrod · March 7, 2021, 5:06am

$c is “passing in by value”. If you want to “pass by reference”

@benchmark co($(Ref(c))[], $(Ref(b))[], UInt64(1), UInt32(0), UInt32(0))

Whenever I’m benchmarking with isbits, I do the above trick. Otherwise, the compiler is likely to optimize your benchmark away (which is what results in the <0.1ns timings).

baggepinnen · March 7, 2021, 7:12am

I wonder if this common use case could be supported by special syntax in the benchmarking macro, just like it supports interpolation with a special syntax?

pixel27 · March 7, 2021, 12:28pm

Thank you, that works.

DNF · March 7, 2021, 1:01pm

Off topic, but you can do this much simpler and faster by just writing

c = rand(SVector{16, UInt32})

Also, I’m curious what the purpose of local is here:

pixel27 · March 7, 2021, 2:18pm

Thank you, I should have realized that would work last night.

Mostly this is just my background. I come from languages (C/Rust/Go/Java) where you need to declare new variables. The local helps me realize I’m defining a variable there. I think it also may help if I’m messing around in the REPL and declare a global variable with the same name any subsequent runs of the file will still use a local variable rather than recompiled to use a global?

Lastly the habit helps when I have multiple contexts like:

function foo()
   local a = 7
   ....
   for i in 1:60
       local a = i * 76
       ....
   end
   ...
   return a
   ...
end

The above is especially important for longish functions where I come back months later and edit the function adding the new variable in a lower scope, and don’t realize I’m using the same name in the loop.

I know it’s not very “Julian” but I like my code to be consistent so all local’s get the local declaration rather than only doing it when the compiler would do the "wrong’ thing. I guess I could just not use the same variable names in lower scopes, something to think about. I try to Julianify my MWE, but I missed the local declarations.