What is going on with memory locality when accessing an array?

I presented an example of access to an array: read and write. The aspect of this that puzzles me is the unexpected win of the storage scheme where the accessed entries of the array are not adjacent in memory. Is there an explanation to this? Anyone knows?

For easier reference, the code is abbreviated here:

function f5!(v)
    for i in 1:size(v, 1)
        x, y, z = v[i, 1], v[i, 2], v[i, 3]
        v[i, 1], v[i, 2], v[i, 3] = 2x, x*y, x*z

function f6!(v)
    for i in 1:size(v, 2)
        x, y, z = v[1, i], v[2, i], v[3, i]
        v[1, i], v[2, i], v[3, i] = 2x, x*y, x*z

using BenchmarkTools
v5 = rand(1000000, 3);
v6 = rand(3, 1000000);

@btime f5!($v5)  # 1.518 ms (0 allocations: 0 bytes)  <========== FASTER!?
@btime f6!($v6)    # 1.739 ms (0 allocations: 0 bytes)

This is basically a Struct Of Array (SoA) vs Array of Struct (AoS) layout difference where v5 is SoA and v6 is AoS.

For f5! the loop is unrolled by a factor of 4 and then the body does something like:

while !done
    tmp_x = v[i:i+4, 1]
    tmp_y = v[i:i+4, 2]
    tmp_z = v[i:i+4, 3]

    v[i:i+4, 1] = 2 * tmp_x
    v[i:i+4, 2]  = tmp_x * tmp_y
    v[i:i+4, 3]  = tmp_x * tmp_z
    i += 4

where all the operations on the tmp_ variables are SIMD operations.

In the second case you cannot really do like this because the memory layout mixes the different components ([x, y, z, x, y, z, ...]) so you cannot operate easily on a “chunk” of e.g. x easily.


This sounds like a good explanation. It does require fetching data from widely separated memory locations, though.

I’d recommend adding @inbounds @simd. In this case, with 3 million doubles in an array (about 22.8 megabytes), this well exceeds the amount of L3 cache most CPUs have, making this operation memory bottle-necked.
At small sizes, with @inbounds @simd, I’d expect the struct of arrays version to be much faster.

This distance isn’t really important. If they were three disjoint arrays, they’d likely be even further apart. Would you expect accessing memory in more than one array at a time to be slow?

1 Like

Right, and there are multiple cache lines, and each (of two, in this case) array can be cached in its own line, without contention most likely.

Two? This is a read-write on a single array, isn’t it?

The cache doesn’t know what an array is. You mentioned distant memory locations, which are being accessed sequentially. I think you expect thrashing because you’re going back and forth between distant memory locations, accessing each sequentially. This access pattern is very common and caches are designed for it. As far as I know, this corresponds to cache lines. E.g. if you have 4 cache lines, and are accessing 5 distant memory regions sequentially, then I’d expect contention between the cache lines.