The TLDR is why would I get improbably timings when benchmarking passing in by reference i.e. $c
but get maybe reasonable timings when passing in the value i.e. ‘c’. And how much can I trust the timing passing in the value directly?
Now for the details. I’m working on optimizing some code moving it from using Vector
s to using SVector
s. Everything was going great, until I unrolled a loop. When I unrolled the loop my performance went to:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 0.020 ns (0.00% GC)
median time: 0.022 ns (0.00% GC)
mean time: 0.022 ns (0.00% GC)
maximum time: 0.038 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
Well that’s probably not right…I’m passing in my values by reference, so I figured I’d try just passing them in. For that I get:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 108.384 ns (0.00% GC)
median time: 110.113 ns (0.00% GC)
mean time: 110.021 ns (0.00% GC)
maximum time: 156.001 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 929
Which seems more reasonable. But what kind of “overhead” does passing the values in have? Now I’m nervous.
This is my attempt at at MWE, it’s still longer than I would like but it needs to be long enough to take some time. Also bitrotate is a 1.5 instruction apparently so you need Julia 1.5:
using BenchmarkTools
using StaticArrays
function ro(state, m)
local a, b, c, d = state[1], state[5], state[9], state[13]
a = a + b + m[1]
d = bitrotate(d ⊻ a, -16)
c = c + d
b = bitrotate(b ⊻ c, -12)
a = a + b + m[2]
d = bitrotate(d ⊻ a, -8)
c = c + d
b = bitrotate(b ⊻ c, -7)
return SVector(
a, state[2], state[3], state[4],
b, state[6], state[7], state[8],
c, state[10], state[11], state[12],
d, state[14], state[15], state[16]
)
end
function co(value, block, t1, t2, t3)
local state = SVector{16, UInt32}(
value[1], value[2], value[3], value[4],
value[5], value[6], value[7], value[8],
UInt32(0x6A09E667), UInt32(0xBB67AE85),
UInt32(0x3C6EF372), UInt32(0xA54FF53A),
UInt32(t1 & 0xffffffff), UInt32(t1 >> 32),
t2, t3
)
for _ in 1:10
state = ro(state, block)
end
return SVector(
state[1] ⊻ state[ 9], state[2] ⊻ state[10],
state[3] ⊻ state[11], state[4] ⊻ state[12],
state[5] ⊻ state[13], state[6] ⊻ state[14],
state[7] ⊻ state[15], state[8] ⊻ state[16],
state[ 9] ⊻ value[1], state[10] ⊻ value[2],
state[11] ⊻ value[3], state[12] ⊻ value[4],
state[13] ⊻ value[5], state[14] ⊻ value[6],
state[15] ⊻ value[7], state[16] ⊻ value[8]
)
end
function co2(value, block, t1, t2, t3)
local state = SVector{16, UInt32}(
value[1], value[2], value[3], value[4],
value[5], value[6], value[7], value[8],
UInt32(0x6A09E667), UInt32(0xBB67AE85),
UInt32(0x3C6EF372), UInt32(0xA54FF53A),
UInt32(t1 & 0xffffffff), UInt32(t1 >> 32),
t2, t3
)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
state = ro(state, block)
return SVector(
state[1] ⊻ state[ 9], state[2] ⊻ state[10],
state[3] ⊻ state[11], state[4] ⊻ state[12],
state[5] ⊻ state[13], state[6] ⊻ state[14],
state[7] ⊻ state[15], state[8] ⊻ state[16],
state[ 9] ⊻ value[1], state[10] ⊻ value[2],
state[11] ⊻ value[3], state[12] ⊻ value[4],
state[13] ⊻ value[5], state[14] ⊻ value[6],
state[15] ⊻ value[7], state[16] ⊻ value[8]
)
end
The only different between co()
and co2()
is that the loop is unrolled in co2()
.
Benchmarking co()
gets:
julia> c = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);
julia> b = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);
julia> @benchmark co($c, $b, UInt64(1), UInt32(0), UInt32(0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 38.219 ns (0.00% GC)
median time: 38.327 ns (0.00% GC)
mean time: 38.764 ns (0.00% GC)
maximum time: 59.882 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 991
While benchmarking co2()
gets:
julia> c = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);
julia> b = SVector{16, UInt32}([rand(UInt32) for x in 1:16]...);
julia> @benchmark co2($c, $b, UInt64(1), UInt32(0), UInt32(0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 0.017 ns (0.00% GC)
median time: 0.022 ns (0.00% GC)
mean time: 0.024 ns (0.00% GC)
maximum time: 8.666 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
You can then pass in the values directly and have it actually take some time:
julia> @benchmark co2(c, b, UInt64(1), UInt32(0), UInt32(0))
BenchmarkTools.Trial:
memory estimate: 80 bytes
allocs estimate: 1
--------------
minimum time: 60.180 ns (0.00% GC)
median time: 61.169 ns (0.00% GC)
mean time: 68.641 ns (7.86% GC)
maximum time: 3.425 μs (96.98% GC)
--------------
samples: 10000
evals/sample: 980
However the timing is twice as long as co()
is that accurate?