What is going on with memory locality when accessing an array?

PetrKryslUCSD · December 19, 2019, 4:38pm

I presented an example of access to an array: read and write. The aspect of this that puzzles me is the unexpected win of the storage scheme where the accessed entries of the array are not adjacent in memory. Is there an explanation to this? Anyone knows?

For easier reference, the code is abbreviated here:

function f5!(v)
    for i in 1:size(v, 1)
        x, y, z = v[i, 1], v[i, 2], v[i, 3]
        v[i, 1], v[i, 2], v[i, 3] = 2x, x*y, x*z
    end
end

function f6!(v)
    for i in 1:size(v, 2)
        x, y, z = v[1, i], v[2, i], v[3, i]
        v[1, i], v[2, i], v[3, i] = 2x, x*y, x*z
    end
end

using BenchmarkTools
v5 = rand(1000000, 3);
v6 = rand(3, 1000000);

@btime f5!($v5)  # 1.518 ms (0 allocations: 0 bytes)  <========== FASTER!?
@btime f6!($v6)    # 1.739 ms (0 allocations: 0 bytes)

kristoffer.carlsson · December 19, 2019, 4:54pm

This is basically a Struct Of Array (SoA) vs Array of Struct (AoS) layout difference where v5 is SoA and v6 is AoS.

For f5! the loop is unrolled by a factor of 4 and then the body does something like:

while !done
    tmp_x = v[i:i+4, 1]
    tmp_y = v[i:i+4, 2]
    tmp_z = v[i:i+4, 3]

    v[i:i+4, 1] = 2 * tmp_x
    v[i:i+4, 2]  = tmp_x * tmp_y
    v[i:i+4, 3]  = tmp_x * tmp_z
    i += 4
end

where all the operations on the tmp_ variables are SIMD operations.

In the second case you cannot really do like this because the memory layout mixes the different components ([x, y, z, x, y, z, ...]) so you cannot operate easily on a “chunk” of e.g. x easily.

PetrKryslUCSD · December 19, 2019, 11:57pm

This sounds like a good explanation. It does require fetching data from widely separated memory locations, though.

Elrod · December 20, 2019, 4:25am

I’d recommend adding @inbounds @simd. In this case, with 3 million doubles in an array (about 22.8 megabytes), this well exceeds the amount of L3 cache most CPUs have, making this operation memory bottle-necked.
At small sizes, with @inbounds @simd, I’d expect the struct of arrays version to be much faster.

This distance isn’t really important. If they were three disjoint arrays, they’d likely be even further apart. Would you expect accessing memory in more than one array at a time to be slow?

goretkin · December 20, 2019, 5:50am

Right, and there are multiple cache lines, and each (of two, in this case) array can be cached in its own line, without contention most likely.

PetrKryslUCSD · December 20, 2019, 2:41pm

Two? This is a read-write on a single array, isn’t it?

goretkin · December 20, 2019, 6:37pm

The cache doesn’t know what an array is. You mentioned distant memory locations, which are being accessed sequentially. I think you expect thrashing because you’re going back and forth between distant memory locations, accessing each sequentially. This access pattern is very common and caches are designed for it. As far as I know, this corresponds to cache lines. E.g. if you have 4 cache lines, and are accessing 5 distant memory regions sequentially, then I’d expect contention between the cache lines.

Topic		Replies	Views
Struct of Arrays (SoA) vs Array of Structs (AoS) Performance performance	30	9807	March 12, 2022
Speed of writing vs. reading elements in an array General Usage performance	4	1074	May 8, 2017
StructArrays Memory Layout New to Julia question , structarrays	14	1482	July 29, 2021
Reordering for memory contiguity/memory access Numerics question	3	480	July 14, 2020
Difference in allocations when constructing array-of-arrays vs array-of-structs General Usage	5	942	February 25, 2018

What is going on with memory locality when accessing an array?

Related topics