I’m trying to write performant code (in particular a Monte Carlo simulation) and have been stumped by the SOA vs AOS debate for a while now. Generally, when trying to use SIMD, I have seen it’s recommended to use the SOA approach. However, in my program, I seem to get worse performance using this approach. I made a minimal example where I have a collection of SOAs and a function that works on any of those. The function itself can be up to 2x as performant when run on an SOA vs AOS, but when randomly picking one of the structures in a loop, and then running the function, the performance is about the same or even worse for SOA.
Consider the example:
using LoopVectorization
struct Connections
idxs::Vector{Int32}
weights::Vector{Float32}
end
struct Graph
state::Vector{Float32}
adj_aos::Vector{Vector{Tuple{Int32, Float32}}}
adj_soa::Vector{Connections}
end
# Initialize graph with given size and number of random connections per node
function Graph(size, n_connections)
# Initialize state
state = rand(Float32, size)
# Initialize the array of structures
adj_aos = Vector{Vector{Tuple{Int32, Float32}}}(undef, size)
for i in 1:size
adj_aos[i] = Vector{Tuple{Int32, Float32}}(undef, n_connections)
for j in 1:n_connections
adj_aos[i][j] = (rand(UnitRange{Int32}(1:size)), rand(Float32))
end
sort!(adj_aos[i], by = x -> x[1])
end
# Copy the data into a structure of arrays
adj_soa = Vector{Connections}(undef, size)
for i in 1:size
conns = Connections([x[1] for x in adj_aos[i]], [x[2] for x in adj_aos[i]])
adj_soa[i] = conns
end
return Graph(state, adj_aos, adj_soa)
end
# Get the energy for a given state and connections
function getE_aos(state, connections)
e = 0f0
@inbounds @simd for conns_idx in eachindex(connections)
c_idx = connections[conns_idx][1]
weight = connections[conns_idx][2]
e += -state[c_idx]*weight
end
return e
end
# Get the energy for a given state and connections
function getE_soa(state, connections)
e = 0f0
idxs = connections.idxs
weights = connections.weights
@turbo for conns_idx in eachindex(idxs)
idx = idxs[conns_idx]
weight = weights[conns_idx]
e += -state[idx]*weight
end
return e
end
# Main loop AOS
function mainloop_aos(graph, iterations)
etot = 0.
@inbounds for _ in 1:iterations
idx = rand(1:length(graph.state))
etot += graph.state[idx]*getE_aos(graph.state, graph.adj_aos[idx])
end
return etot
end
# Main loop SOA
function mainloop_soa(graph, iterations)
etot = 0.
@inbounds for _ in 1:iterations
idx = rand(1:length(graph.state))
etot += graph.state[idx]*getE_soa(graph.state, graph.adj_soa[idx])
end
return etot
end
const g1000 = Graph(500^2, 1000)
Now I find that getE_soa is 2x as fast as getE_aos, but when benchmarking the mainloop functions, they’re about the same speed, even though basically all of the computation time per loop should be taken by evaluating the getE function. Can anybody help me interpret these results?
I’ve written posts about this before, here for example I provided a program more close to what I’m actually working with. For my actual code, using the SOA gives me actually worse performance generally, though for this example they seem to perform about the same, so I’m hoping it’s because of the same thing. I didn’t get any responses on my previous post, presumably because the code was not exactly minimal. Hopefully this example suffices.