Understanding the performance and overhead of a vector of SOA vs a vector of AOS for SIMD and the effect of push!

I’ve been working on an interactive simulation tool for Monte Carlo simulations of Ising(-like) Models. Performance is key, and something I’ve been a bit confused by. I made a very simplified version of the program here, which can be run as is.

My main data structure is a graph that is represented by a state vector, holding scalars, and an adjacency list, representing the connectivity of the graph. This needs the be fast to iterate over all the connections per vertex in the graph, and I’ve been investigating the best way to represent the adjacency list. At the time I found that a Vector{Vector{Tuple{Int32, Float32}}} was quite fast. The first index represents the vertex index, and then the inner vector can be iterated over fast in every loop.

The only function using my loop is a function gathering part of the energy:

@inline function getEFactor(state::Vector{Float32},connections::Vector{Tuple{Int32,Float32}}
  efac = 0f0
  @inbounds @simd for loop_index in eachindex(connections)
    conn_idx = first(connections[loop_index])
    weight = last(connections[loop_index])
    efactor += state[conn_idx]*weight
  end
end

The macro @simd here works (it gives better performance than without), but representing the adjacency list this way goes against the common knowledge that a SOA should work better for SIMD than a AOS, and this way makes using @turbo from LoopVectorization.jl unusable.

I tried instead to define the following:

struct Connections
  idxs::Vector{Int32}
  weights::Vector{Float32}
end

struct AdjList{T <: Connections} <: AbstractVector{T}
  data::T
end

@inline function getEFactor(state, connections::Connections)
# gather energy factor using @turbo
end

Benchmarking both functions, I find the SOA is a clear winner. For a graph with 8 connections per vertex I find

@benchmark getEFactor(state, $adj_aos[1])
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.166 ns … 21.334 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.250 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.284 ns ±  0.267 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▄     █     ▇     ▂    ▃     ▄                        ▁
  ▇▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▆▁▁▁▁▁▆▁▁▁▁█ █
  5.17 ns      Histogram: log(frequency) by time     5.58 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark getEFactor(state, $adj_soa[1])
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.834 ns … 23.833 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.958 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.979 ns ±  0.371 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▆      █     ▄      ▁     ▄      ▂           ▁ ▁
  ▄▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▇▁▁▁▁▁█ █
  3.83 ns      Histogram: log(frequency) by time     4.21 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

And the performance difference grows as the connectivity becomes bigger, for 440 connections:

@benchmark getEFactor(state, $adj_aos[1])
 BenchmarkTools.Trial: 10000 samples with 446 evaluations.
   Range (min … max):  231.036 ns … 471.787 ns  ┊ GC (min … max): 0.00% … 0.00%
   Time  (median):     231.594 ns               ┊ GC (median):    0.00%
   Time  (mean ± σ):   234.491 ns ±   7.250 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  
    █▆ ▁▁ ▁  ▁▁  ▃▃▃▂▂▁            ▁▁                             ▁
    ████████████████████████▇███▆▇▇██▇▇▆▆▆▅▅▅▅▅▄▅▆▅▃▄▄▅▄▅▅▅▄▅▆▆▆▆ █
    231 ns        Histogram: log(frequency) by time        260 ns <
  
   Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark getEFactor(state, $adj_soa[1])
 BenchmarkTools.Trial: 10000 samples with 919 evaluations.
   Range (min … max):  114.753 ns … 192.419 ns  ┊ GC (min … max): 0.00% … 0.00%
   Time  (median):     114.980 ns               ┊ GC (median):    0.00%
   Time  (mean ± σ):   116.215 ns ±   3.145 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  
    █▇▁▁▁  ▁▁▁▁▁  ▂▃▂▁▁                ▁▂                         ▁
    █████████████████████▇▇▇████▇▇▇▇▇▇▆██▆▇▇▇▅▆▅▄▄▅▅▃▄▃▅▄▅▃▄▃▄▆▅▄ █
    115 ns        Histogram: log(frequency) by time        127 ns <
  
   Memory estimate: 0 bytes, allocs estimate: 0.

So basically a 2-times speed up. This is all good and expected, but now here comes my confusion: real-world performance can be worse, depending on the connectivity, using the SOA method in my program. I’m quite confused as to why because the only point where the adjacency list is used is in the function above. My main algorithm looks approximately like this:

mutable struct Sim{G}
  const graph::G
  shouldrun::Bool
  updates::Int64
end

function mainMetropolisLoop(sim, state, adjlist:T) where T #to force specialization
  while sim.shouldrun
    idx = rand(UnitRange{Int32}(1:statelength))
    # Basic Metropolis algo
    ...

    efactor = getEFactor(state, adjlist[idx])

    # change the state stochastically based on efactor
    ...

    sim.updates += 1
  end
end

function startsim(sim)
  graph = sim.g
  state = g.state
  adj = g.adj

  mainMetropolisLoop(sim, state, adj)
end

Threads.@spawn startsim(sim)

To benchmark the energy function, I use @benchmark from BenchmarkTools, and to benchmark the actual loop, I increment an int every iteration of the while loop and run the loop for a given amount of time and then quit out of the loop.

I did multiple tests for different connectivity of the graph which can be found here, with the comments showing the results on my machine.

Short summary of the results:

Getting the energy function is always faster using @turbo and SOA and monotonically increases with the connectivity increasing, ranging from about ~1.33x relative performance for small connectivity to about 2x as I showed. However, actually running the loop gives worse results, where for small connectivity using the AOS is about ~1.3x faster. For higher connectivity the SOA is faster again, having a peak at 48 connections, where the performance is about ~1.33x faster, and then going close to 1 for higher connectivity of 440 again. Thus somewhere there is some overhead that doesn’t seem to come from the energy function, which is the only function in the loop actually using the adjlist. Where is this, is this fixable, and how can I find out myself where these kinds of problems lie?

Then there is another thing I don’t completely understand, which may be related. I made a topic about this earlier, but didn’t really understand the problem yet. The old topic is here. If I push to the internal vectors, presumably data has to be relocated. I made a function that pushes to all the internal vectors and then deletes the entry again, leaving the data unchanged but thus presumably relocating the data of the internal vectors. This can have quite an impact on performance.

If I then make a deepcopy of all the internal data and copy this to the graph struct, I get much better performance (almost 2x for small connectivity and using the AOS approach). I don’t completely understand how this can have such a big impact on performance, but I don’t think I have enough knowledge of how Julia internally handles its data structures. I’ve also included tests for this, and it seems to have non-trivial interactions with the SOA and AOS approaches. Any comments on this would be appreciated as well.

My apologies, it seems I have been a bit sloppy. I was trying to write this up quickly and in between a lot of other activities. The first function above, getEFactor of course should return the float efac. Also, I pushed the latest version of my code, everything should be working immediately. The file code.jl contains all the relevant functions, and the tests can be run after importing that file, which should reproduce my findings directly.

I’m really hoping someone could give me some insight on why there is such a big discrepancy between the performance of the getEFactor function and the performance of the actual simulation loop. For high connectivity, most of the time of a single iteration of the loop should be taken by evaluating this function, so I would expect to see a similar performance ratio, which as seen from my results, is not the case at all.