Interaction Between Caching and Benchmarking

I have a high-level question motivated by a specific example of something strange I’m observing.

The high-level question is whether L2/L3 memory caching can have an impact on benchmarking using BenchmarkTools.jl. Specifically, if I define a matrix and then use this in a benchmark of a memory bound operation, will I obtain optimistic measurements due to the matrix being cached between trials.

The motivation for this comes from experimenting with ways to quickly compute large batches of small matrix multiplications. My inputs/outputs are given as D x D x N arrays.

I have played around with using reinterpret (which in all honestly, I don’t fully understand) to consider each matrix as an SMatrix. When benchmarking this, I was surprised to find the speed was faster than the theoretical speed from saturating my RAM bandwidth. I assume this must be due to parts of the array being cached closer to the CPU than the RAM (giving an unfair estimate of practical performance).

I compared this to benchmarking with different matrices generated each time, brought the speed closer to the theoretical expectation but still a fair bit faster, leaving me still a bit confused.

Code and output is below.

using BenchmarkTools
using LinearAlgebra
using StaticArrays

D = 2
N = 1000000
T = Float64

A = rand(T, 2, 2, N);
B = rand(T, 2, 2, N);
C = similar(A);

function matmul_2x2_static_contiguous!(C, A_squashed, B_squashed)
    N = size(A, 2)
    T = eltype(A)
    A_reinterp = reinterpret(reshape, SMatrix{2,2,T,4}, A_squashed)
    B_reinterp = reinterpret(reshape, SMatrix{2,2,T,4}, B_squashed)
    original_threads = BLAS.get_num_threads()
    BLAS.set_num_threads(1)
    try
        Threads.@threads for n in 1:N
            C[:, :, n] .= A_reinterp[n] * B_reinterp[n]
        end
    finally
        BLAS.set_num_threads(original_threads)
    end
    return C
end

A_squashed = reshape(A, 4, N);
B_squashed = reshape(B, 4, N);

println("Questionable benchmark:")
@btime matmul_2x2_static_contiguous!($C, $A_squashed, $B_squashed);

# Memory bound
bandwidth = (
    4800 *  # clock speed (MHz)
    1e6 *   # MHz to Hz
    2 *     # DDR
    8 *     # 64-bit transfers
    4       # channels
)
mem_use = 3 * sizeof(A)  # reads and writes
bound = mem_use / bandwidth   # seconds
println("\nMemory bound: $(bound * 1e6) μs")

println("\nBenchmark will independent setup:")
@btime matmul_2x2_static_contiguous!(C, A_squashed, B_squashed) setup = begin
    A = rand(T, 2, 2, N);
    B = rand(T, 2, 2, N);
    C = similar(A);
    A_squashed = reshape(A, 4, N);
    B_squashed = reshape(B, 4, N);
end;
Questionable benchmark:
  55.765 μs (295 allocations: 29.16 KiB)

Memory bound: 312.5 μs

Benchmark with independent setup:
  115.155 μs (295 allocations: 29.16 KiB)

When computing products over a vector of SArrays, the speed is slower than the memory bound as would be expected.

Specifically, what might be going on with this benchmark? More generally, how should I think about the interaction of caching and benchmarking?

Yes, BenchmarkTools will run the same function multiple times. Hence it will tell you “if I run the same function many times, on identical data, what will be the throughput”.

Previous runs will affect microarchitectural state – caches, branch-predictor, etc will still be hot.

So what you should do is to always rescale the resultant timings by N, and then measure (plot) for multiple different values of N.

Furthermore, you should always have a simplified reference thing – in your example, something like simplifiedModel(C,A,B) = Threads.@threads for n=1:length(C) @inbounds C[n] = A[n] + B[n] end (read two large arrays, do a trivial computation, write it to output array).

For output, you should not just print “Memory bound”, but should also print the raw data going in there, like e.g. “ok, 32MB arrays”, as well as the cache size of your CPU (e.g. lscpu).

You don’t need to change BLAS threading – you’re doing 2x2 matmuls on StaticArrays, that should be completely inlined, no BLAS involved. And it is obvious that the arithmetic density is pitiful – that is, you should tweak your code until it benchmarks similar to simplifiedModel for large arrays, i.e. you should definitely saturate main memory bandwidth (should you saturate L3 bandwidth? Good question. You can see that by plotting normalized runtime against N and seeing whether some of the L1/L2/L3/main-memory plateaus vanish!).

This is, in general, a very common misconception about benchmarking: People assume a simplified model where each operation/function takes a certain amount of time, and the amount of time taken for a bunch of ops is the sum of the times taken for each one. Lol nope, “time taken” is not even approximatively additive for smallish times.