I presented an example of access to an array: read and write. The aspect of this that puzzles me is the unexpected win of the storage scheme where the accessed entries of the array are not adjacent in memory. Is there an explanation to this? Anyone knows?
For easier reference, the code is abbreviated here:
function f5!(v)
for i in 1:size(v, 1)
x, y, z = v[i, 1], v[i, 2], v[i, 3]
v[i, 1], v[i, 2], v[i, 3] = 2x, x*y, x*z
end
end
function f6!(v)
for i in 1:size(v, 2)
x, y, z = v[1, i], v[2, i], v[3, i]
v[1, i], v[2, i], v[3, i] = 2x, x*y, x*z
end
end
using BenchmarkTools
v5 = rand(1000000, 3);
v6 = rand(3, 1000000);
@btime f5!($v5) # 1.518 ms (0 allocations: 0 bytes) <========== FASTER!?
@btime f6!($v6) # 1.739 ms (0 allocations: 0 bytes)
where all the operations on the tmp_ variables are SIMD operations.
In the second case you cannot really do like this because the memory layout mixes the different components ([x, y, z, x, y, z, ...]) so you cannot operate easily on a “chunk” of e.g. x easily.
I’d recommend adding @inbounds @simd. In this case, with 3 million doubles in an array (about 22.8 megabytes), this well exceeds the amount of L3 cache most CPUs have, making this operation memory bottle-necked.
At small sizes, with @inbounds @simd, I’d expect the struct of arrays version to be much faster.
This distance isn’t really important. If they were three disjoint arrays, they’d likely be even further apart. Would you expect accessing memory in more than one array at a time to be slow?
The cache doesn’t know what an array is. You mentioned distant memory locations, which are being accessed sequentially. I think you expect thrashing because you’re going back and forth between distant memory locations, accessing each sequentially. This access pattern is very common and caches are designed for it. As far as I know, this corresponds to cache lines. E.g. if you have 4 cache lines, and are accessing 5 distant memory regions sequentially, then I’d expect contention between the cache lines.