The TLDR is why would I get improbably timings when benchmarking passing in by reference i.e. $c but get maybe reasonable timings when passing in the value i.e. ‘c’. And how much can I trust the timing passing in the value directly?
Now for the details. I’m working on optimizing some code moving it from using Vectors to using SVectors. Everything was going great, until I unrolled a loop. When I unrolled the loop my performance went to:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 0.020 ns (0.00% GC)
median time: 0.022 ns (0.00% GC)
mean time: 0.022 ns (0.00% GC)
maximum time: 0.038 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
Well that’s probably not right…I’m passing in my values by reference, so I figured I’d try just passing them in. For that I get:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 108.384 ns (0.00% GC)
median time: 110.113 ns (0.00% GC)
mean time: 110.021 ns (0.00% GC)
maximum time: 156.001 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 929
Which seems more reasonable. But what kind of “overhead” does passing the values in have? Now I’m nervous.
This is my attempt at at MWE, it’s still longer than I would like but it needs to be long enough to take some time. Also bitrotate is a 1.5 instruction apparently so you need Julia 1.5:
using BenchmarkTools
using StaticArrays
function ro(state, m)
local a, b, c, d = state[1], state[5], state[9], state[13]
a = a + b + m[1]
d = bitrotate(d ⊻ a, -16)
c = c + d
b = bitrotate(b ⊻ c, -12)
a = a + b + m[2]
d = bitrotate(d ⊻ a, -8)
c = c + d
b = bitrotate(b ⊻ c, -7)
return SVector(
a, state[2], state[3], state[4],
b, state[6], state[7], state[8],
c, state[10], state[11], state[12],
d, state[14], state[15], state[16]
)
end
function co(value, block, t1, t2, t3)
local state = SVector{16, UInt32}(
value[1], value[2], value[3], value[4],
value[5], value[6], value[7], value[8],
UInt32(0x6A09E667), UInt32(0xBB67AE85),
UInt32(0x3C6EF372), UInt32(0xA54FF53A),
UInt32(t1 & 0xffffffff), UInt32(t1 >> 32),
t2, t3
)
for _ in 1:10
state = ro(state, block)
end
return SVector(
state[1] ⊻ state[ 9], state[2] ⊻ state[10],
state[3] ⊻ state[11], state[4] ⊻ state[12],
state[5] ⊻ state[13], state[6] ⊻ state[14],
state[7] ⊻ state[15], state[8] ⊻ state[16],
state[ 9] ⊻ value[1], state[10] ⊻ value[2],
state[11] ⊻ value[3], state[12] ⊻ value[4],
state[13] ⊻ value[5], state[14] ⊻ value[6],
state[15] ⊻ value[7], state[16] ⊻ value[8]
)
end
function co2(value, block, t1, t2, t3)
local state = SVector{16, UInt32}(
value[1], value[2], value[3], value[4],
value[5], value[6], value[7], value[8],
UInt32(0x6A09E667), UInt32(0xBB67AE85),
UInt32(0x3C6EF372), UInt32(0xA54FF53A),
UInt32(t1 & 0xffffffff), UInt32(t1 >> 32),
t2, t3
)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
return SVector(
state[1] ⊻ state[ 9], state[2] ⊻ state[10],
state[3] ⊻ state[11], state[4] ⊻ state[12],
state[5] ⊻ state[13], state[6] ⊻ state[14],
state[7] ⊻ state[15], state[8] ⊻ state[16],
state[ 9] ⊻ value[1], state[10] ⊻ value[2],
state[11] ⊻ value[3], state[12] ⊻ value[4],
state[13] ⊻ value[5], state[14] ⊻ value[6],
state[15] ⊻ value[7], state[16] ⊻ value[8]
)
end
The only different between co() and co2() is that the loop is unrolled in co2().
Benchmarking co() gets:
julia> c = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);
julia> b = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);
julia> @benchmark co($c, $b, UInt64(1), UInt32(0), UInt32(0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 38.219 ns (0.00% GC)
median time: 38.327 ns (0.00% GC)
mean time: 38.764 ns (0.00% GC)
maximum time: 59.882 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 991
While benchmarking co2() gets:
julia> c = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);
julia> b = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);
julia> @benchmark co2($c, $b, UInt64(1), UInt32(0), UInt32(0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 0.017 ns (0.00% GC)
median time: 0.022 ns (0.00% GC)
mean time: 0.024 ns (0.00% GC)
maximum time: 8.666 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
You can then pass in the values directly and have it actually take some time:
julia> @benchmark co2(c, b, UInt64(1), UInt32(0), UInt32(0))
BenchmarkTools.Trial:
memory estimate: 80 bytes
allocs estimate: 1
--------------
minimum time: 60.180 ns (0.00% GC)
median time: 61.169 ns (0.00% GC)
mean time: 68.641 ns (7.86% GC)
maximum time: 3.425 μs (96.98% GC)
--------------
samples: 10000
evals/sample: 980
However the timing is twice as long as co() is that accurate?