Struct of Arrays (SoA) vs Array of Structs (AoS)

The Nxd (I think you meant 2\times N) array case should be 100x faster because all memory accesses happen in the optimal order, the element a[1,i] is exactly contiguous to a[2,i], where as a[i,1] and a[i,2] are at two distant locations.

PS: I made a mistake, it’s slower than SoA.

No. The 2xN is the slowest. That was the purpose of my posting.

1 Like

Have an example you can point to here?
The issue is probably that the d is dynamic instead of known at compile time, while for the AoS it is at least known at compile time.
Being known at compile time allows for some fairly efficient shuffling to load/store the data.

If you want to vectorize a loop, the easiest way is to perform multiple loop iterations in parallel. That means what you want is a[i,1] and a[i+1,1] to be close in memory, and you also want a[i,2] and a[i+1,2] to be close in memory.
That is what you get from Nxd and from SoA.

With AoS, however, a[1,i] and a[1,i+1] aren’t close. But if d is known at compile time and the d “loop” or operations across d are unrolled, the compiler can use a sequence of shuffles/permutes/etc to transition from AoS to SoA (on load) and then back to AoS on store.
This of course adds a little overhead, but generally not too much when it works.

If the stride of the dxN array is not known, this makes it much less efficient/may prevent the optimization.

4 Likes

Here is the complete test with all cases.

using BenchmarkTools,  StructArrays

struct ComplexAoS
  real::Float64
  imag::Float64
end

struct ComplexSoA
  real::Vector{Float64}
  imag::Vector{Float64}
end

function normSoA(x)
  out = Array{Float64,1}(undef, length(x.real))
  @inbounds for i in 1:length(x.real)
    out[i] = sqrt(x.real[i]^2 + x.imag[i]^2)
  end
  out
end

function normAoS(x)
  out = Array{Float64,1}(undef, length(x))
  @inbounds for i in 1:length(x)
    out[i] = sqrt(x[i].real^2 + x[i].imag^2)
  end
  out
end

function norm_dN(x)
  n = size(x,2);
  out = Array{Float64,1}(undef, n)
  @inbounds for i in 1:n
    out[i] = sqrt(x[1,i]^2 + x[2,i]^2)
  end
  out
end

function norm_Nd(x)
  n = size(x,1);
  out = Array{Float64,1}(undef, n)
  @inbounds for i in 1:n
    out[i] = sqrt(x[i,1]^2 + x[i,2]^2)
  end
  out
end

N = 10_000_000;

arr_Nd = rand(N,2);
arr_dN = permutedims(arr_Nd);

arrAoS = [ComplexAoS(arr_Nd[i,1],arr_Nd[i,2]) for i in 1:N];
arrSoA = ComplexSoA(arr_Nd[:,1],arr_Nd[:,2]);

@btime normAoS($arrAoS);
@btime normSoA($arrSoA);
@btime norm_Nd($arr_Nd);
@btime norm_dN($arr_dN);

# [normAoS(arrAoS) normSoA(arrSoA) norm_Nd(arr_Nd) norm_dN(arr_dN)]

For better benchmarking, I suggest

  1. preallocating the result vector. No need to involve the GC.
  2. using smaller arrays so that the memory involved fits in cache, unless you do want to focus on benchmarking memory bandwidth.
4 Likes

What is the proper Julia way to declare a fixed value for d in this example?
I want to write the general code for all d but state that “for this run please specialize the code for d = 2 only”

It is subtle to do something like this right and grasp the correctly implications, but I would go with:

# Generic function that does not specialize on 'd'
f(d :: Int, other_parameters) = ...
# Function that compiles for specific 'd'
f(::Val{d}, other_parameters) where {d} = f(d, other_parameters)

function main()
    ...
    # now I want to call f specialized for a specific d
   f(Val(2), other_parameters)
   ...
end
2 Likes

Note that if you have a dynamically sized array, you should probably also add an assert on the size, or something to tell LLVM that the stride == d.
Result from adding a D == stride(Y,2) || throw(...):

julia> @btime interleave0!($Y, Val(3)) # without the check
  261.521 ns (0 allocations: 0 bytes)
341.0932625746682

julia> @btime interleave1!($Y, Val(3)) # with the check
  146.241 ns (0 allocations: 0 bytes)
341.0932625746683

Nearly a 2x improvement from adding a check! =)

Full code:

julia> using BenchmarkTools

julia> function interleave0!(Y, ::Val{D}) where {D}
           s = 0.0
           @fastmath for i in axes(Y,2)
               s += sqrt(mapreduce(abs2, +, ntuple(d -> @inbounds(Y[d,i]), Val(D))))
           end
           return s
       end
interleave0! (generic function with 1 method)

julia> function interleave1!(Y, ::Val{D}) where {D}
           # same as interleave0, except we add a check
           D == stride(Y,2) || throw(ArgumentError("Unexpected stride value for Y!"))
           s = 0.0
           @fastmath for i in axes(Y,2)
               s += sqrt(mapreduce(abs2, +, ntuple(d -> @inbounds(Y[d,i]), Val(D))))
           end
           return s
       end
interleave1! (generic function with 1 method)

julia> Y = randn(3,200);

julia> @btime interleave0!($Y, Val(3)) # without an assert
  261.521 ns (0 allocations: 0 bytes)
341.0932625746682

julia> @btime interleave1!($Y, Val(3)) # with the assert
  146.241 ns (0 allocations: 0 bytes)
341.0932625746683

EDIT:
I initially used an @assert, but replaced it with the check & throw because at some point in the future, @asserts may gain the option to be disabled for performance.
If this replaces @asserts with @llvm.assumes, that’d be fine. But, if it just removes them, we’d suddenly get worse performance instead!
Hence, I figured I’d go with throw instead.

9 Likes

That is interesting: so the compiler is smart enough to the deduce
that it can use the stride as guaranteed from the throw?

1 Like

Yes.
LLVM actually generates nice looking vector code for interleave0! – behind a check if stride(Y,2) == 1. Oops.
Of course, stride(Y,2) != 1, so this branch and the nice looking vector code will never be used.

The throw/manual branch tells it that stride(Y,2) == D, so from there it generates much less nice looking code that assumes/requires stride(Y,2) == D, but is still much faster than the scalar code the first version has to take (even though the vector sqrt’s throughput doesn’t actually scale with vector size on my computer…).
Knowing stride(Y,2) == D lets it load contiguous vectors and then permute them to go AoS to SoA efficiently.

If stride(Y,2) == 1, that of course isn’t necessary, which is why the code from interleave0! looks better at first glance, despite of course being unusable and thus just adding a check and increasing code size.

1 Like

Is this the planned behavior when @assert gets disabled?