I presented an example of access to an array: read and write. The aspect of this that puzzles me is the unexpected win of the storage scheme where the accessed entries of the array are not adjacent in memory. Is there an explanation to this? Anyone knows?
For easier reference, the code is abbreviated here:
for i in 1:size(v, 1)
x, y, z = v[i, 1], v[i, 2], v[i, 3]
v[i, 1], v[i, 2], v[i, 3] = 2x, x*y, x*z
for i in 1:size(v, 2)
x, y, z = v[1, i], v[2, i], v[3, i]
v[1, i], v[2, i], v[3, i] = 2x, x*y, x*z
v5 = rand(1000000, 3);
v6 = rand(3, 1000000);
@btime f5!($v5) # 1.518 ms (0 allocations: 0 bytes) <========== FASTER!?
@btime f6!($v6) # 1.739 ms (0 allocations: 0 bytes)
This is basically a Struct Of Array (SoA) vs Array of Struct (AoS) layout difference where
v5 is SoA and
v6 is AoS.
f5! the loop is unrolled by a factor of 4 and then the body does something like:
tmp_x = v[i:i+4, 1]
tmp_y = v[i:i+4, 2]
tmp_z = v[i:i+4, 3]
v[i:i+4, 1] = 2 * tmp_x
v[i:i+4, 2] = tmp_x * tmp_y
v[i:i+4, 3] = tmp_x * tmp_z
i += 4
where all the operations on the
tmp_ variables are SIMD operations.
In the second case you cannot really do like this because the memory layout mixes the different components (
[x, y, z, x, y, z, ...]) so you cannot operate easily on a “chunk” of e.g.
This sounds like a good explanation. It does require fetching data from widely separated memory locations, though.
I’d recommend adding
@inbounds @simd. In this case, with 3 million doubles in an array (about 22.8 megabytes), this well exceeds the amount of L3 cache most CPUs have, making this operation memory bottle-necked.
At small sizes, with
@inbounds @simd, I’d expect the struct of arrays version to be much faster.
This distance isn’t really important. If they were three disjoint arrays, they’d likely be even further apart. Would you expect accessing memory in more than one array at a time to be slow?
Right, and there are multiple cache lines, and each (of two, in this case) array can be cached in its own line, without contention most likely.
Two? This is a read-write on a single array, isn’t it?
The cache doesn’t know what an array is. You mentioned distant memory locations, which are being accessed sequentially. I think you expect thrashing because you’re going back and forth between distant memory locations, accessing each sequentially. This access pattern is very common and caches are designed for it. As far as I know, this corresponds to cache lines. E.g. if you have 4 cache lines, and are accessing 5 distant memory regions sequentially, then I’d expect contention between the cache lines.