Interaction Between Caching and Benchmarking

Yes, BenchmarkTools will run the same function multiple times. Hence it will tell you “if I run the same function many times, on identical data, what will be the throughput”.

Previous runs will affect microarchitectural state – caches, branch-predictor, etc will still be hot.

So what you should do is to always rescale the resultant timings by N, and then measure (plot) for multiple different values of N.

Furthermore, you should always have a simplified reference thing – in your example, something like simplifiedModel(C,A,B) = Threads.@threads for n=1:length(C) @inbounds C[n] = A[n] + B[n] end (read two large arrays, do a trivial computation, write it to output array).

For output, you should not just print “Memory bound”, but should also print the raw data going in there, like e.g. “ok, 32MB arrays”, as well as the cache size of your CPU (e.g. lscpu).

You don’t need to change BLAS threading – you’re doing 2x2 matmuls on StaticArrays, that should be completely inlined, no BLAS involved. And it is obvious that the arithmetic density is pitiful – that is, you should tweak your code until it benchmarks similar to simplifiedModel for large arrays, i.e. you should definitely saturate main memory bandwidth (should you saturate L3 bandwidth? Good question. You can see that by plotting normalized runtime against N and seeing whether some of the L1/L2/L3/main-memory plateaus vanish!).

This is, in general, a very common misconception about benchmarking: People assume a simplified model where each operation/function takes a certain amount of time, and the amount of time taken for a bunch of ops is the sum of the times taken for each one. Lol nope, “time taken” is not even approximatively additive for smallish times.

2 Likes