Bad performance looping over SOAs with SIMD

I’m trying to write performant code (in particular a Monte Carlo simulation) and have been stumped by the SOA vs AOS debate for a while now. Generally, when trying to use SIMD, I have seen it’s recommended to use the SOA approach. However, in my program, I seem to get worse performance using this approach. I made a minimal example where I have a collection of SOAs and a function that works on any of those. The function itself can be up to 2x as performant when run on an SOA vs AOS, but when randomly picking one of the structures in a loop, and then running the function, the performance is about the same or even worse for SOA.

Consider the example:

using LoopVectorization

struct Connections

struct Graph
    adj_aos::Vector{Vector{Tuple{Int32, Float32}}}

# Initialize graph with given size and number of random connections per node
function Graph(size, n_connections)
    # Initialize state
    state = rand(Float32, size)

    # Initialize the array of structures
    adj_aos = Vector{Vector{Tuple{Int32, Float32}}}(undef, size)
    for i in 1:size
        adj_aos[i] = Vector{Tuple{Int32, Float32}}(undef, n_connections)
        for j in 1:n_connections
            adj_aos[i][j] = (rand(UnitRange{Int32}(1:size)), rand(Float32))
        sort!(adj_aos[i], by = x -> x[1])

    # Copy the data into a structure of arrays
    adj_soa = Vector{Connections}(undef, size)
    for i in 1:size
        conns = Connections([x[1] for x in adj_aos[i]], [x[2] for x in adj_aos[i]])
        adj_soa[i] = conns

    return Graph(state, adj_aos, adj_soa)

# Get the energy for a given state and connections
function getE_aos(state, connections)
    e = 0f0
    @inbounds @simd for conns_idx in eachindex(connections)
        c_idx = connections[conns_idx][1]
        weight = connections[conns_idx][2]
        e += -state[c_idx]*weight
    return e

# Get the energy for a given state and connections
function getE_soa(state, connections)
    e = 0f0
    idxs = connections.idxs
    weights = connections.weights
    @turbo for conns_idx in eachindex(idxs)
        idx = idxs[conns_idx]
        weight = weights[conns_idx]
        e += -state[idx]*weight
    return e

# Main loop AOS
function mainloop_aos(graph, iterations)
    etot = 0.
    @inbounds for _ in 1:iterations
        idx = rand(1:length(graph.state))
        etot += graph.state[idx]*getE_aos(graph.state, graph.adj_aos[idx])

    return etot

# Main loop SOA
function mainloop_soa(graph, iterations)
    etot = 0.
    @inbounds for _ in 1:iterations
        idx = rand(1:length(graph.state))
        etot += graph.state[idx]*getE_soa(graph.state, graph.adj_soa[idx])

    return etot

const g1000 = Graph(500^2, 1000)

Now I find that getE_soa is 2x as fast as getE_aos, but when benchmarking the mainloop functions, they’re about the same speed, even though basically all of the computation time per loop should be taken by evaluating the getE function. Can anybody help me interpret these results?

I’ve written posts about this before, here for example I provided a program more close to what I’m actually working with. For my actual code, using the SOA gives me actually worse performance generally, though for this example they seem to perform about the same, so I’m hoping it’s because of the same thing. I didn’t get any responses on my previous post, presumably because the code was not exactly minimal. Hopefully this example suffices.

You’re bottlnecked by memory.
When benchmarking one call to the function, the connections and state can be loaded in cache.
When running the main loop, however, you’re iterating across these, and therefore spending all your time streaming memory.
The actual compute performance becomes mostly irrelevant.
Note that AOS can SIMD, too.
It just requires a few more shuffle instructions to fix the memory layout, so the perf difference shouldn’t be too large to begin with.

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       mainloop_aos(g1000, 1_000_000);
╶ cpu-cycles               6.87e+09   60.0%  #  3.9 cycles per ns
┌ instructions             7.54e+08   60.0%  #  0.1 insns per cycle
│ branch-instructions      3.30e+07   60.0%  #  4.4% of insns
└ branch-misses            3.17e+03   60.0%  #  0.0% of branch insns
┌ task-clock               1.77e+09  100.0%  #  1.8 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.12e+09   20.0%  # 490.8% of dcache loads
│ L1-dcache-loads          2.27e+08   20.0%
└ L1-icache-load-misses    1.81e+04   20.0%
┌ dTLB-load-misses         3.14e+06   20.0%  #  1.4% of dTLB loads
└ dTLB-loads               2.27e+08   20.0%
┌ iTLB-load-misses         3.76e+01   39.9%  # 750.0% of iTLB loads
└ iTLB-loads               5.01e+00   39.9%

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       mainloop_soa(g1000, 1_000_000);
╶ cpu-cycles               6.76e+09   60.0%  #  3.9 cycles per ns
┌ instructions             4.91e+08   60.0%  #  0.1 insns per cycle
│ branch-instructions      3.00e+07   60.0%  #  6.1% of insns
└ branch-misses            3.97e+03   60.0%  #  0.0% of branch insns
┌ task-clock               1.74e+09  100.0%  #  1.7 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.11e+09   20.0%  # 503.7% of dcache loads
│ L1-dcache-loads          2.21e+08   20.0%
└ L1-icache-load-misses    3.04e+05   20.0%
┌ dTLB-load-misses         2.58e+06   20.0%  #  1.2% of dTLB loads
└ dTLB-loads               2.21e+08   20.0%
┌ iTLB-load-misses         3.00e+01   40.0%  # 171.4% of iTLB loads
└ iTLB-loads               1.75e+01   40.0%

Note soa requires 4.91e8 instructions vs 7.54e8 for aos.
We’re executing about 50% more instructions for AOS on my computer. Of course, a 50% speed improvement would be huge if we could get that.
But, here’s the striking thing:
While we have 4.91e8 or 7.54e8 instructions, we have around 1.19e9 L1d cache load misses!
That is, we have 1.5 (for aos) or 2 (for soa) L1d cache load misses per instruction we execute!

On my computer, performance isn’t much better when just calling getE_* over and over again.


Thanks so much, this is super helpful, though still a little bit outside of my comfort zone at the moment. At least I know some topics I have to read up on now.

I have a few questions though, adding the @simd to the AOS case does seem help with performance from my testing, so it does look like it does something? Still the performance difference compared to @turbo and SOA in the cached case is ~2x for large arrays, so that seems to imply to me the @simd macro should actually do nothing in that case, right?

You said AOS can simd too, which would be very helpful in my case, as it seems from what you said the memory layout of AOS is actually a bit more convenient for my use case. Do you have any pointers on how I could make it simd? I cannot figure out a way to make @turbo work.

Lastly, could you conceive of some better way of storing my data than these two options, or some other way of optimizing the memory access? This is an area with which I’m quite unfamiliar, but I do want to learn about. Any good references you might reccomend to read up on would also be highly appreciated, be it Julia specific or not.

Yes, it uses SIMD instructions.
The AOS data layout is “wrong” for SIMD, but there are SIMD permute and shuffle instructions.

These can be used to load the data, and then get it into the correct layout.
SOA is often faster, because it’s generally better not to have to do this work.
For example, here is SOA assembly:

.LBB0_5:                                # %L50
                                        # =>This Inner Loop Header: Depth=1
        vmovups zmm1, zmmword ptr [r11 + 4*rdi]
        vmovups zmm2, zmmword ptr [r11 + 4*rdi + 64]
        vmovups zmm3, zmmword ptr [r11 + 4*rdi + 128]
        vmovups zmm4, zmmword ptr [r11 + 4*rdi + 192]
        kxnorw  k1, k0, k0
        vxorps  xmm5, xmm5, xmm5
        vgatherdps      zmm5 {k1}, zmmword ptr [rax + 4*zmm1]
        kxnorw  k1, k0, k0
        vxorps  xmm1, xmm1, xmm1
        vgatherdps      zmm1 {k1}, zmmword ptr [rax + 4*zmm2]
        vfmsub132ps     zmm5, zmm0, zmmword ptr [rdx + 4*rdi] # zmm5 = (zmm5 * mem) - zmm0
        vfmadd132ps     zmm1, zmm5, zmmword ptr [rdx + 4*rdi + 64] # zmm1 = (zmm1 * mem) + zmm5
        kxnorw  k1, k0, k0
        vxorps  xmm2, xmm2, xmm2
        vgatherdps      zmm2 {k1}, zmmword ptr [rax + 4*zmm3]
        kxnorw  k1, k0, k0
        vxorps  xmm0, xmm0, xmm0
        vgatherdps      zmm0 {k1}, zmmword ptr [rax + 4*zmm4]
        vfmadd132ps     zmm2, zmm1, zmmword ptr [rdx + 4*rdi + 128] # zmm2 = (zmm2 * mem) + zmm1
        vfnmsub132ps    zmm0, zmm2, zmmword ptr [rdx + 4*rdi + 192] # zmm0 = -(zmm0 * mem) - zmm2
        add     rdi, 64
        add     rsi, -4
        jne     .LBB0_5

Versus AOS:

.LBB0_7:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        vmovdqu64       zmm6, zmmword ptr [rcx + 8*rdi]
        vmovdqu64       zmm7, zmmword ptr [rcx + 8*rdi + 64]
        vmovdqu64       zmm8, zmmword ptr [rcx + 8*rdi + 128]
        vmovdqu64       zmm9, zmmword ptr [rcx + 8*rdi + 192]
        vmovdqu64       zmm10, zmmword ptr [rcx + 8*rdi + 320]
        vmovdqu64       zmm11, zmmword ptr [rcx + 8*rdi + 256]
        vmovdqu64       zmm12, zmmword ptr [rcx + 8*rdi + 448]
        vmovdqu64       zmm13, zmmword ptr [rcx + 8*rdi + 384]
        vmovdqa64       zmm14, zmm6
        vpermt2d        zmm14, zmm2, zmm7
        vmovdqa64       zmm15, zmm8
        vpermt2d        zmm15, zmm2, zmm9
        vmovdqa64       zmm16, zmm11
        vpermt2d        zmm16, zmm2, zmm10
        vmovdqa64       zmm17, zmm13
        vpermt2d        zmm17, zmm2, zmm12
        vpermt2d        zmm6, zmm3, zmm7
        vpermt2d        zmm8, zmm3, zmm9
        kxnorw  k1, k0, k0
        vpxor   xmm7, xmm7, xmm7
        kxnorw  k2, k0, k0
        vpxor   xmm9, xmm9, xmm9
        kxnorw  k3, k0, k0
        vpermt2d        zmm11, zmm3, zmm10
        vpxor   xmm10, xmm10, xmm10
        vgatherdps      zmm7 {k1}, zmmword ptr [rdx + 4*zmm14 - 4]
        kxnorw  k1, k0, k0
        vgatherdps      zmm9 {k2}, zmmword ptr [rdx + 4*zmm15 - 4]
        vxorps  xmm14, xmm14, xmm14
        vgatherdps      zmm10 {k3}, zmmword ptr [rdx + 4*zmm16 - 4]
        vgatherdps      zmm14 {k1}, zmmword ptr [rdx + 4*zmm17 - 4]
        vpermt2d        zmm13, zmm3, zmm12
        vfnmadd231ps    zmm1, zmm6, zmm7        # zmm1 = -(zmm6 * zmm7) + zmm1
        vfnmadd231ps    zmm0, zmm8, zmm9        # zmm0 = -(zmm8 * zmm9) + zmm0
        vfnmadd231ps    zmm4, zmm11, zmm10      # zmm4 = -(zmm11 * zmm10) + zmm4
        vfnmadd231ps    zmm5, zmm13, zmm14      # zmm5 = -(zmm13 * zmm14) + zmm5
        add     rdi, 64
        cmp     rsi, rdi
        jne     .LBB0_7

Notice all those vpermt2d instructions.
These permute elements of the SIMD vectors, transforming from AOS to SOA layout.

In this case, almost all the time is taken up by the gather instructions, which can run in parallel to the permutes.
Thus, we may need many more instructions, but the AOS isn’t really expected to run slower.
Note that the number of expected cycles is pretty similar, e.g. 35.6 vs 36.8 cycles for tiger lake.
So less than a 4% advantage for SOA.

You can see that those 4 gathers make up almost the entire cost.

Note that you can turn on the trace, and then click “open trace” to visualize what is going on.
It shows clock cycles, as well as which phase of execution each instruction (and the uops that they break down into) is in.