I have a high-level question motivated by a specific example of something strange I’m observing.
The high-level question is whether L2/L3 memory caching can have an impact on benchmarking using BenchmarkTools.jl. Specifically, if I define a matrix and then use this in a benchmark of a memory bound operation, will I obtain optimistic measurements due to the matrix being cached between trials.
The motivation for this comes from experimenting with ways to quickly compute large batches of small matrix multiplications. My inputs/outputs are given as D x D x N arrays.
I have played around with using reinterpret
(which in all honestly, I don’t fully understand) to consider each matrix as an SMatrix
. When benchmarking this, I was surprised to find the speed was faster than the theoretical speed from saturating my RAM bandwidth. I assume this must be due to parts of the array being cached closer to the CPU than the RAM (giving an unfair estimate of practical performance).
I compared this to benchmarking with different matrices generated each time, brought the speed closer to the theoretical expectation but still a fair bit faster, leaving me still a bit confused.
Code and output is below.
using BenchmarkTools
using LinearAlgebra
using StaticArrays
D = 2
N = 1000000
T = Float64
A = rand(T, 2, 2, N);
B = rand(T, 2, 2, N);
C = similar(A);
function matmul_2x2_static_contiguous!(C, A_squashed, B_squashed)
N = size(A, 2)
T = eltype(A)
A_reinterp = reinterpret(reshape, SMatrix{2,2,T,4}, A_squashed)
B_reinterp = reinterpret(reshape, SMatrix{2,2,T,4}, B_squashed)
original_threads = BLAS.get_num_threads()
BLAS.set_num_threads(1)
try
Threads.@threads for n in 1:N
C[:, :, n] .= A_reinterp[n] * B_reinterp[n]
end
finally
BLAS.set_num_threads(original_threads)
end
return C
end
A_squashed = reshape(A, 4, N);
B_squashed = reshape(B, 4, N);
println("Questionable benchmark:")
@btime matmul_2x2_static_contiguous!($C, $A_squashed, $B_squashed);
# Memory bound
bandwidth = (
4800 * # clock speed (MHz)
1e6 * # MHz to Hz
2 * # DDR
8 * # 64-bit transfers
4 # channels
)
mem_use = 3 * sizeof(A) # reads and writes
bound = mem_use / bandwidth # seconds
println("\nMemory bound: $(bound * 1e6) μs")
println("\nBenchmark will independent setup:")
@btime matmul_2x2_static_contiguous!(C, A_squashed, B_squashed) setup = begin
A = rand(T, 2, 2, N);
B = rand(T, 2, 2, N);
C = similar(A);
A_squashed = reshape(A, 4, N);
B_squashed = reshape(B, 4, N);
end;
Questionable benchmark:
55.765 μs (295 allocations: 29.16 KiB)
Memory bound: 312.5 μs
Benchmark with independent setup:
115.155 μs (295 allocations: 29.16 KiB)
When computing products over a vector of SArrays, the speed is slower than the memory bound as would be expected.
Specifically, what might be going on with this benchmark? More generally, how should I think about the interaction of caching and benchmarking?