# Bad performance looping over SOAs with SIMD

I’m trying to write performant code (in particular a Monte Carlo simulation) and have been stumped by the SOA vs AOS debate for a while now. Generally, when trying to use SIMD, I have seen it’s recommended to use the SOA approach. However, in my program, I seem to get worse performance using this approach. I made a minimal example where I have a collection of SOAs and a function that works on any of those. The function itself can be up to 2x as performant when run on an SOA vs AOS, but when randomly picking one of the structures in a loop, and then running the function, the performance is about the same or even worse for SOA.

Consider the example:

``````using LoopVectorization

struct Connections
idxs::Vector{Int32}
weights::Vector{Float32}
end

struct Graph
state::Vector{Float32}
end

# Initialize graph with given size and number of random connections per node
function Graph(size, n_connections)
# Initialize state
state = rand(Float32, size)

# Initialize the array of structures
adj_aos = Vector{Vector{Tuple{Int32, Float32}}}(undef, size)
for i in 1:size
adj_aos[i] = Vector{Tuple{Int32, Float32}}(undef, n_connections)
for j in 1:n_connections
adj_aos[i][j] = (rand(UnitRange{Int32}(1:size)), rand(Float32))
end

sort!(adj_aos[i], by = x -> x[1])
end

# Copy the data into a structure of arrays
adj_soa = Vector{Connections}(undef, size)
for i in 1:size
conns = Connections([x[1] for x in adj_aos[i]], [x[2] for x in adj_aos[i]])
end

end

# Get the energy for a given state and connections
function getE_aos(state, connections)
e = 0f0
@inbounds @simd for conns_idx in eachindex(connections)
c_idx = connections[conns_idx][1]
weight = connections[conns_idx][2]
e += -state[c_idx]*weight
end
return e
end

# Get the energy for a given state and connections
function getE_soa(state, connections)
e = 0f0
idxs = connections.idxs
weights = connections.weights
@turbo for conns_idx in eachindex(idxs)
idx = idxs[conns_idx]
weight = weights[conns_idx]
e += -state[idx]*weight
end
return e
end

# Main loop AOS
function mainloop_aos(graph, iterations)
etot = 0.
@inbounds for _ in 1:iterations
idx = rand(1:length(graph.state))
etot += graph.state[idx]*getE_aos(graph.state, graph.adj_aos[idx])
end

return etot
end

# Main loop SOA
function mainloop_soa(graph, iterations)
etot = 0.
@inbounds for _ in 1:iterations
idx = rand(1:length(graph.state))
etot += graph.state[idx]*getE_soa(graph.state, graph.adj_soa[idx])
end

return etot
end

const g1000 = Graph(500^2, 1000)
``````

Now I find that getE_soa is 2x as fast as getE_aos, but when benchmarking the mainloop functions, they’re about the same speed, even though basically all of the computation time per loop should be taken by evaluating the getE function. Can anybody help me interpret these results?

I’ve written posts about this before, here for example I provided a program more close to what I’m actually working with. For my actual code, using the SOA gives me actually worse performance generally, though for this example they seem to perform about the same, so I’m hoping it’s because of the same thing. I didn’t get any responses on my previous post, presumably because the code was not exactly minimal. Hopefully this example suffices.

You’re bottlnecked by memory.
When benchmarking one call to the function, the connections and state can be loaded in cache.
When running the main loop, however, you’re iterating across these, and therefore spending all your time streaming memory.
The actual compute performance becomes mostly irrelevant.
Note that AOS can SIMD, too.
It just requires a few more shuffle instructions to fix the memory layout, so the perf difference shouldn’t be too large to begin with.

``````julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
mainloop_aos(g1000, 1_000_000);
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               6.87e+09   60.0%  #  3.9 cycles per ns
┌ instructions             7.54e+08   60.0%  #  0.1 insns per cycle
│ branch-instructions      3.30e+07   60.0%  #  4.4% of insns
└ branch-misses            3.17e+03   60.0%  #  0.0% of branch insns
┌ task-clock               1.77e+09  100.0%  #  1.8 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.12e+09   20.0%  # 490.8% of dcache loads
│ L1-dcache-loads          2.27e+08   20.0%
└ L1-icache-load-misses    1.81e+04   20.0%
┌ dTLB-load-misses         3.14e+06   20.0%  #  1.4% of dTLB loads
└ dTLB-loads               2.27e+08   20.0%
┌ iTLB-load-misses         3.76e+01   39.9%  # 750.0% of iTLB loads
└ iTLB-loads               5.01e+00   39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

mainloop_soa(g1000, 1_000_000);
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               6.76e+09   60.0%  #  3.9 cycles per ns
┌ instructions             4.91e+08   60.0%  #  0.1 insns per cycle
│ branch-instructions      3.00e+07   60.0%  #  6.1% of insns
└ branch-misses            3.97e+03   60.0%  #  0.0% of branch insns
┌ task-clock               1.74e+09  100.0%  #  1.7 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.11e+09   20.0%  # 503.7% of dcache loads
│ L1-dcache-loads          2.21e+08   20.0%
└ L1-icache-load-misses    3.04e+05   20.0%
┌ dTLB-load-misses         2.58e+06   20.0%  #  1.2% of dTLB loads
└ dTLB-loads               2.21e+08   20.0%
┌ iTLB-load-misses         3.00e+01   40.0%  # 171.4% of iTLB loads
└ iTLB-loads               1.75e+01   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
``````

Note `soa` requires `4.91e8` instructions vs `7.54e8` for aos.
We’re executing about 50% more instructions for AOS on my computer. Of course, a 50% speed improvement would be huge if we could get that.
But, here’s the striking thing:
While we have `4.91e8` or `7.54e8` instructions, we have around `1.19e9` L1d cache load misses!
That is, we have 1.5 (for aos) or 2 (for soa) L1d cache load misses per instruction we execute!

On my computer, performance isn’t much better when just calling `getE_*` over and over again.

2 Likes

Thanks so much, this is super helpful, though still a little bit outside of my comfort zone at the moment. At least I know some topics I have to read up on now.

I have a few questions though, adding the @simd to the AOS case does seem help with performance from my testing, so it does look like it does something? Still the performance difference compared to @turbo and SOA in the cached case is ~2x for large arrays, so that seems to imply to me the @simd macro should actually do nothing in that case, right?

You said AOS can simd too, which would be very helpful in my case, as it seems from what you said the memory layout of AOS is actually a bit more convenient for my use case. Do you have any pointers on how I could make it simd? I cannot figure out a way to make @turbo work.

Lastly, could you conceive of some better way of storing my data than these two options, or some other way of optimizing the memory access? This is an area with which I’m quite unfamiliar, but I do want to learn about. Any good references you might reccomend to read up on would also be highly appreciated, be it Julia specific or not.

Yes, it uses SIMD instructions.
The AOS data layout is “wrong” for SIMD, but there are SIMD permute and shuffle instructions.

These can be used to load the data, and then get it into the correct layout.
SOA is often faster, because it’s generally better not to have to do this work.
For example, here is SOA assembly:

``````.LBB0_5:                                # %L50
# =>This Inner Loop Header: Depth=1
vmovups zmm1, zmmword ptr [r11 + 4*rdi]
vmovups zmm2, zmmword ptr [r11 + 4*rdi + 64]
vmovups zmm3, zmmword ptr [r11 + 4*rdi + 128]
vmovups zmm4, zmmword ptr [r11 + 4*rdi + 192]
kxnorw  k1, k0, k0
vxorps  xmm5, xmm5, xmm5
vgatherdps      zmm5 {k1}, zmmword ptr [rax + 4*zmm1]
kxnorw  k1, k0, k0
vxorps  xmm1, xmm1, xmm1
vgatherdps      zmm1 {k1}, zmmword ptr [rax + 4*zmm2]
vfmsub132ps     zmm5, zmm0, zmmword ptr [rdx + 4*rdi] # zmm5 = (zmm5 * mem) - zmm0
vfmadd132ps     zmm1, zmm5, zmmword ptr [rdx + 4*rdi + 64] # zmm1 = (zmm1 * mem) + zmm5
kxnorw  k1, k0, k0
vxorps  xmm2, xmm2, xmm2
vgatherdps      zmm2 {k1}, zmmword ptr [rax + 4*zmm3]
kxnorw  k1, k0, k0
vxorps  xmm0, xmm0, xmm0
vgatherdps      zmm0 {k1}, zmmword ptr [rax + 4*zmm4]
vfmadd132ps     zmm2, zmm1, zmmword ptr [rdx + 4*rdi + 128] # zmm2 = (zmm2 * mem) + zmm1
vfnmsub132ps    zmm0, zmm2, zmmword ptr [rdx + 4*rdi + 192] # zmm0 = -(zmm0 * mem) - zmm2
jne     .LBB0_5
``````

Versus AOS:

``````.LBB0_7:                                # %vector.body
# =>This Inner Loop Header: Depth=1
vmovdqu64       zmm6, zmmword ptr [rcx + 8*rdi]
vmovdqu64       zmm7, zmmword ptr [rcx + 8*rdi + 64]
vmovdqu64       zmm8, zmmword ptr [rcx + 8*rdi + 128]
vmovdqu64       zmm9, zmmword ptr [rcx + 8*rdi + 192]
vmovdqu64       zmm10, zmmword ptr [rcx + 8*rdi + 320]
vmovdqu64       zmm11, zmmword ptr [rcx + 8*rdi + 256]
vmovdqu64       zmm12, zmmword ptr [rcx + 8*rdi + 448]
vmovdqu64       zmm13, zmmword ptr [rcx + 8*rdi + 384]
vmovdqa64       zmm14, zmm6
vpermt2d        zmm14, zmm2, zmm7
vmovdqa64       zmm15, zmm8
vpermt2d        zmm15, zmm2, zmm9
vmovdqa64       zmm16, zmm11
vpermt2d        zmm16, zmm2, zmm10
vmovdqa64       zmm17, zmm13
vpermt2d        zmm17, zmm2, zmm12
vpermt2d        zmm6, zmm3, zmm7
vpermt2d        zmm8, zmm3, zmm9
kxnorw  k1, k0, k0
vpxor   xmm7, xmm7, xmm7
kxnorw  k2, k0, k0
vpxor   xmm9, xmm9, xmm9
kxnorw  k3, k0, k0
vpermt2d        zmm11, zmm3, zmm10
vpxor   xmm10, xmm10, xmm10
vgatherdps      zmm7 {k1}, zmmword ptr [rdx + 4*zmm14 - 4]
kxnorw  k1, k0, k0
vgatherdps      zmm9 {k2}, zmmword ptr [rdx + 4*zmm15 - 4]
vxorps  xmm14, xmm14, xmm14
vgatherdps      zmm10 {k3}, zmmword ptr [rdx + 4*zmm16 - 4]
vgatherdps      zmm14 {k1}, zmmword ptr [rdx + 4*zmm17 - 4]
vpermt2d        zmm13, zmm3, zmm12
vfnmadd231ps    zmm1, zmm6, zmm7        # zmm1 = -(zmm6 * zmm7) + zmm1
vfnmadd231ps    zmm0, zmm8, zmm9        # zmm0 = -(zmm8 * zmm9) + zmm0
vfnmadd231ps    zmm4, zmm11, zmm10      # zmm4 = -(zmm11 * zmm10) + zmm4
vfnmadd231ps    zmm5, zmm13, zmm14      # zmm5 = -(zmm13 * zmm14) + zmm5
cmp     rsi, rdi
jne     .LBB0_7
``````

Notice all those `vpermt2d` instructions.
These permute elements of the SIMD vectors, transforming from AOS to SOA layout.

In this case, almost all the time is taken up by the `gather` instructions, which can run in parallel to the permutes.
Thus, we may need many more instructions, but the AOS isn’t really expected to run slower.
AOS
SOA
Note that the number of expected cycles is pretty similar, e.g. 35.6 vs 36.8 cycles for tiger lake.
So less than a 4% advantage for SOA.

You can see that those 4 gathers make up almost the entire cost.

Note that you can turn on the trace, and then click “open trace” to visualize what is going on.
It shows clock cycles, as well as which phase of execution each instruction (and the uops that they break down into) is in.