Struct of Arrays (SoA) vs Array of Structs (AoS)

e3c6 · October 17, 2019, 8:34am

There are two ways to arrange data in memory. SoA means Struct of Arrays and AoS means Array of Structs. I’m trying to understand when one is better than the other in Julia.

I was reading @ChrisRackauckas fantastic lectures (The Different Flavors of Parallelism), where he has an example of this.
We consider two ways of storing complex numbers in memory.

struct ComplexAoS
  real::Float64
  imag::Float64
end
struct ComplexSoA
  real::Vector{Float64}
  imag::Vector{Float64}
end

N = 1000
arrAoS = [ComplexAoS(rand(),rand()) for i in 1:N]
arrSoA = ComplexSoA(rand(N),rand(N))

Our goal is to compute the total norm of all the numbers. Depending on the data layout in memory, we can use two methods for this:

function normSoA(x)
  out = 0.0
  @simd for i in 1:length(x.real)
    @inbounds out += sqrt(x.real[i]^2 + x.imag[i]^2)
  end
  out
end

function normAoS(x)
  out = 0.0
  @simd for i in 1:length(x)
    @inbounds out += sqrt(x[i].real^2 + x[i].imag^2)
  end
  out
end

The idea here (if I understood Chris’ explanation correctly) is that since the operations we are using use the real and imaginary parts of each complex together, the AoS format is more memory aligned and should be faster.

However, this is not what benchmarking shows.

julia> using BenchmarkTools

julia> @btime normAoS($arrAoS);
  1.719 μs (0 allocations: 0 bytes)

julia> @btime normSoA($arrSoA);
  875.167 ns (0 allocations: 0 bytes)
80.5214449364876

Surprisingly the SoA approach is faster.

Note that I have made some modifications compared to Chris’ original code based on the comments in this thread.

So what is going on here? Shouldn’t AoS be faster in this example?

baggepinnen · October 17, 2019, 10:14am

You need @inbounds to make sure the loops simd properly to see the true difference. When simd is possible for one of them but not for the other, then you’ll see quite the difference

cortner · October 17, 2019, 11:16am

Also make sure to interpolate the argument

julia> @btime normSoA($arrSoA)

Elrod · October 17, 2019, 11:35am

Struct of Arrays are great when you want to SIMD across loop iterations.
That is, if your code is written as a loop that does some calculations with structs, the Struct of Arrays memory layout allows the compiler to calculate multiple iterations of the loop at a time. This is because it is straight forward to turn each access to a number into a load of an entire SIMD vector of them, etc for each subsequent operation.

This vectorization requires @inbounds to eliminate branches. It may require @simd for or even @simd ivdep for to vectorize.
You can check if it has been vectorized with @code_llvm.

e3c6 · October 17, 2019, 12:28pm

@baggepinnen @cortner @Elrod Thanks for your suggestions. Please see edited code example in the original post (Struct of Arrays (SoA) vs Array of Structs (AoS)). (I also tried the ivdep modifier for @simd but it makes no difference).

Now the benchmarking shows that the SoA approach is significantly faster than the AoS. I find this very surprising. I hope someone can shed some light on this.

DNF · October 17, 2019, 12:33pm

Did you check that they return the same answer? How can you have simd with an accumulator, like you have?

kristoffer.carlsson · October 17, 2019, 12:33pm

Should be rand(1000) there?

e3c6 · October 17, 2019, 12:35pm

Yes, fixed. Thanks.

kristoffer.carlsson · October 17, 2019, 12:35pm

It already got answered? Struct of Arrays (SoA) vs Array of Structs (AoS) - #4 by Elrod

e3c6 · October 17, 2019, 12:37pm

They use different random data.

Hmmm yes, good point. I just took this example from the lectures. Maybe we can think of a better example?
Though actually the example of sum (also in the lectures) seems to take advantage of SIMD, even though it is also an accumulator.

e3c6 · October 17, 2019, 12:39pm

No, because here each loop iteration should be using the fields of a single struct together. So I don’t see why SoA is better here. But maybe I’m missing something.

kristoffer.carlsson · October 17, 2019, 12:46pm

Because you can unroll the loop by a factor of 4 and do:

i = 1
while not_done
    tmp_x1 = x.real[i:i+3]
    tmp_x2 = tmp_1.^2
    tmp_y1 = x.imag[i:i+3]
    tmp_y2 = tmp_2.^2
    tmp_xy = tmp_x2 + tmp_y2
    tmp_xy_sqrt = sqrt(tmp_xy)
    out += tmp_xy_sqrt
    i += 4
end

where each of these operations is a SIMD instruction operating on 4 values at the same time.

If you have x and y intertwined in memory you cannot easily load a chunk from the real part and a chunk from the imaginary part directly into a “SIMD lane” and start operating on them independently.

ChrisRackauckas · October 17, 2019, 12:49pm

Interesting. I’ve seen the compiler go the other way on this for more complicated examples. Did it just not find out how to unroll it? Seeing it like this I see that you can make SoA always better if you put the time to make sure that it’s always vectorizing, but you just can expect it to auto-vectorize beyond a certain point?

DNF · October 17, 2019, 12:51pm

I should have said, the same answer with and without @simd.

Elrod · October 17, 2019, 1:34pm

Yes, I do think that when optimizing code SoA is better much more often than not.
It’s a hammer you can apply whenever your arrays are of length > 4; all you have to do is make sure the compiler is actually figuring it out.

Whenever the number of loop iterations is greater than the number of operations that can happen in parallel within a loop iteration, SoA is probably better.
Here, within a loop iteration, when we add tmp_x2 and tmp_y2, we can’t vectorize that, so AoS is pretty much stuck. Versus just using SoA and calculating 4 (or more!) tmp_x2s, tmp_y2s, and tmp_xys at a time ([i:i+3]).

If a loop iteration gets fancy enough so that maybe vectorization is possible, suddenly we have to start digging through everything that goes on, make sure we can actually load everything contiguously, and that there isn’t some hidden vectorization bottleneck. Or, make life easy, use SoA so that we know it’s possible to vectorize seemlessly, and then just check the @code_llvm (yuyichao strongly reccomends against @code_native; by actualy naming temporiaries, the llvm is definitely easier to read – added bonus is that discoure recognizes llvm as a language!).
A caveat is that maybe you’ll have to worry about register pressure if you’re getting really fancy.

There’s a great video somewhere of Keno talking about what he did with struct of arrays while working on Celeste.

julia> @code_llvm optimize=true debuginfo=:none normSoA(arrSoA)

; Function Attrs: uwtable
define double @julia_normSoA_16509(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(16)) #0 {
top:
  %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %2 = bitcast %jl_value_t addrspace(11)* %1 to %jl_value_t addrspace(10)* addrspace(11)*
  %3 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %2, align 8
  %4 = addrspacecast %jl_value_t addrspace(10)* %3 to %jl_value_t addrspace(11)*
  %5 = bitcast %jl_value_t addrspace(11)* %4 to %jl_array_t addrspace(11)*
  %6 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %5, i64 0, i32 1
  %7 = load i64, i64 addrspace(11)* %6, align 8
  %8 = icmp sgt i64 %7, 0
  %9 = select i1 %8, i64 %7, i64 0
  %10 = add nsw i64 %9, -1
  %11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)
  %12 = extractvalue { i64, i1 } %11, 1
  br i1 %12, label %L18, label %L23

L18:                                              ; preds = %top
  call void @julia_throw_overflowerr_binaryop_12453(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 69001352 to %jl_value_t*) to %jl_value_t addrspace(10)*), i64 %10, i64 1)
  call void @llvm.trap()
  unreachable

L23:                                              ; preds = %top
  %13 = extractvalue { i64, i1 } %11, 0
  %14 = icmp slt i64 %13, 1
  br i1 %14, label %L67, label %L30.lr.ph

L30.lr.ph:                                        ; preds = %L23
  %15 = addrspacecast %jl_value_t addrspace(10)* %3 to %jl_value_t addrspace(11)*
  %16 = bitcast %jl_value_t addrspace(11)* %15 to double addrspace(13)* addrspace(11)*
  %17 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %16, align 8
  %18 = bitcast %jl_value_t addrspace(11)* %1 to i8 addrspace(11)*
  %19 = getelementptr inbounds i8, i8 addrspace(11)* %18, i64 8
  %20 = bitcast i8 addrspace(11)* %19 to %jl_value_t addrspace(10)* addrspace(11)*
  %21 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %20, align 8
  %22 = addrspacecast %jl_value_t addrspace(10)* %21 to %jl_value_t addrspace(11)*
  %23 = bitcast %jl_value_t addrspace(11)* %22 to double addrspace(13)* addrspace(11)*
  %24 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %23, align 8
  %min.iters.check = icmp ult i64 %9, 16
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L30.lr.ph
  %n.vec = and i64 %9, 9223372036854775792
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %57, %vector.body ]
  %vec.phi22 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %58, %vector.body ]
  %vec.phi23 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %59, %vector.body ]
  %vec.phi24 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %60, %vector.body ]
  %25 = getelementptr inbounds double, double addrspace(13)* %17, i64 %index
  %26 = bitcast double addrspace(13)* %25 to <4 x double> addrspace(13)*
  %wide.load = load <4 x double>, <4 x double> addrspace(13)* %26, align 8
  %27 = getelementptr double, double addrspace(13)* %25, i64 4
  %28 = bitcast double addrspace(13)* %27 to <4 x double> addrspace(13)*
  %wide.load25 = load <4 x double>, <4 x double> addrspace(13)* %28, align 8
  %29 = getelementptr double, double addrspace(13)* %25, i64 8
  %30 = bitcast double addrspace(13)* %29 to <4 x double> addrspace(13)*
  %wide.load26 = load <4 x double>, <4 x double> addrspace(13)* %30, align 8
  %31 = getelementptr double, double addrspace(13)* %25, i64 12
  %32 = bitcast double addrspace(13)* %31 to <4 x double> addrspace(13)*
  %wide.load27 = load <4 x double>, <4 x double> addrspace(13)* %32, align 8
  %33 = getelementptr inbounds double, double addrspace(13)* %24, i64 %index
  %34 = bitcast double addrspace(13)* %33 to <4 x double> addrspace(13)*
  %wide.load28 = load <4 x double>, <4 x double> addrspace(13)* %34, align 8
  %35 = getelementptr double, double addrspace(13)* %33, i64 4
  %36 = bitcast double addrspace(13)* %35 to <4 x double> addrspace(13)*
  %wide.load29 = load <4 x double>, <4 x double> addrspace(13)* %36, align 8
  %37 = getelementptr double, double addrspace(13)* %33, i64 8
  %38 = bitcast double addrspace(13)* %37 to <4 x double> addrspace(13)*
  %wide.load30 = load <4 x double>, <4 x double> addrspace(13)* %38, align 8
  %39 = getelementptr double, double addrspace(13)* %33, i64 12
  %40 = bitcast double addrspace(13)* %39 to <4 x double> addrspace(13)*
  %wide.load31 = load <4 x double>, <4 x double> addrspace(13)* %40, align 8
  %41 = fmul <4 x double> %wide.load, %wide.load
  %42 = fmul <4 x double> %wide.load25, %wide.load25
  %43 = fmul <4 x double> %wide.load26, %wide.load26
  %44 = fmul <4 x double> %wide.load27, %wide.load27
  %45 = fmul <4 x double> %wide.load28, %wide.load28
  %46 = fmul <4 x double> %wide.load29, %wide.load29
  %47 = fmul <4 x double> %wide.load30, %wide.load30
  %48 = fmul <4 x double> %wide.load31, %wide.load31
  %49 = fadd <4 x double> %41, %45
  %50 = fadd <4 x double> %42, %46
  %51 = fadd <4 x double> %43, %47
  %52 = fadd <4 x double> %44, %48
  %53 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %49)
  %54 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %50)
  %55 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %51)
  %56 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %52)
  %57 = fadd fast <4 x double> %vec.phi, %53
  %58 = fadd fast <4 x double> %vec.phi22, %54
  %59 = fadd fast <4 x double> %vec.phi23, %55
  %60 = fadd fast <4 x double> %vec.phi24, %56
  %index.next = add i64 %index, 16
  %61 = icmp eq i64 %index.next, %n.vec
  br i1 %61, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd fast <4 x double> %58, %57
  %bin.rdx32 = fadd fast <4 x double> %59, %bin.rdx
  %bin.rdx33 = fadd fast <4 x double> %60, %bin.rdx32
  %rdx.shuf = shufflevector <4 x double> %bin.rdx33, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx34 = fadd fast <4 x double> %bin.rdx33, %rdx.shuf
  %rdx.shuf35 = shufflevector <4 x double> %bin.rdx34, <4 x double> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx36 = fadd fast <4 x double> %bin.rdx34, %rdx.shuf35
  %62 = extractelement <4 x double> %bin.rdx36, i32 0
  %cmp.n = icmp eq i64 %9, %n.vec
  br i1 %cmp.n, label %L67, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L30.lr.ph
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L30.lr.ph ]
  %bc.merge.rdx = phi double [ %62, %middle.block ], [ 0.000000e+00, %L30.lr.ph ]
  br label %L30

L30:                                              ; preds = %scalar.ph, %L30
  %value_phi218 = phi i64 [ %bc.resume.val, %scalar.ph ], [ %72, %L30 ]
  %value_phi17 = phi double [ %bc.merge.rdx, %scalar.ph ], [ %71, %L30 ]
  %63 = getelementptr inbounds double, double addrspace(13)* %17, i64 %value_phi218
  %64 = load double, double addrspace(13)* %63, align 8
  %65 = getelementptr inbounds double, double addrspace(13)* %24, i64 %value_phi218
  %66 = load double, double addrspace(13)* %65, align 8
  %67 = fmul double %64, %64
  %68 = fmul double %66, %66
  %69 = fadd double %67, %68
  %70 = call double @llvm.sqrt.f64(double %69)
  %71 = fadd fast double %value_phi17, %70
  %72 = add nuw nsw i64 %value_phi218, 1
  %73 = icmp ult i64 %72, %13
  br i1 %73, label %L30, label %L67

L67:                                              ; preds = %L30, %middle.block, %L23
  %value_phi6 = phi double [ 0.000000e+00, %L23 ], [ %71, %L30 ], [ %62, %middle.block ]
  ret double %value_phi6
}

The exciting bit:

  %41 = fmul <4 x double> %wide.load, %wide.load
  %42 = fmul <4 x double> %wide.load25, %wide.load25
  %43 = fmul <4 x double> %wide.load26, %wide.load26
  %44 = fmul <4 x double> %wide.load27, %wide.load27
  %45 = fmul <4 x double> %wide.load28, %wide.load28
  %46 = fmul <4 x double> %wide.load29, %wide.load29
  %47 = fmul <4 x double> %wide.load30, %wide.load30
  %48 = fmul <4 x double> %wide.load31, %wide.load31
  %49 = fadd <4 x double> %41, %45
  %50 = fadd <4 x double> %42, %46
  %51 = fadd <4 x double> %43, %47
  %52 = fadd <4 x double> %44, %48
  %53 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %49)
  %54 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %50)
  %55 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %51)
  %56 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %52)
  %57 = fadd fast <4 x double> %vec.phi, %53
  %58 = fadd fast <4 x double> %vec.phi22, %54
  %59 = fadd fast <4 x double> %vec.phi23, %55
  %60 = fadd fast <4 x double> %vec.phi24, %56

With all the <4 x double> (SIMD vector of 4 Float64s) that it’s doing what kristoffer.carlsson said, except that it unrolled that loop yet again by another factor of 4 (so it’s doing 4 copies of what he wrote) to also take advantage of OOO (out of order) execution.

e3c6 · October 17, 2019, 3:06pm

That’s very interesting. Thanks @kristoffer.carlsson & @Elrod

e3c6 · November 3, 2019, 9:59am

By the way, I was satisfied to find that StructArrays seems to facilitate the SIMD optimizations.

using StructArrays 

s = StructArray(ComplexAoS(rand(),rand()) for i in 1:N);

julia> @btime normAoS($arrAoS);
  2.076 μs (0 allocations: 0 bytes)

julia> @btime normSoA($arrSoA);
  1.142 μs (0 allocations: 0 bytes)

julia> @btime normSoA($s);
  1.229 μs (0 allocations: 0 bytes)

Edit: I’d be curious to see if someone who understands @code_llvm better than me saw the output produced by the StructArrays construct and checked that the SIMD optimizations were indeed there.

atrabattoni · March 1, 2020, 2:41pm

I’ve been playing with the example @e3c6 proposed but instead of computing the total norm I computed the individual norm:

function normSoA(x)
  out = Array{Float64,1}(undef, length(x.real))
  for i in 1:length(x.real)
    @inbounds out[i] = sqrt(x.real[i]^2 + x.imag[i]^2)
  end
  out
end

function normAoS(x)
  out = Array{Float64,1}(undef, length(x))
  for i in 1:length(x)
    @inbounds out[i] = sqrt(x[i].real^2 + x[i].imag^2)
  end
  out
end

In that case benchmarking give the exact same results for both cases. Using code_llvm shows that both codes use <4 x double> SIMD instructions.

Not sure how to interpret those findings but I wanted to share it with you. I’m a new Julia user.

yewalenikhil65 · February 19, 2022, 5:14pm

Interesting, @e3c6 I understand this is an old post, but i ran the same code you posted

and i get almost same benchmark time. No difference !

julia> N = 100000;

julia> arrAoS = [ComplexAoS(rand(),rand()) for i in 1:N];

julia> arrSoA = ComplexSoA(rand(N),rand(N));

julia> @btime normAoS($arrAoS);
  111.503 μs (0 allocations: 0 bytes)

julia> @btime normSoA($arrSoA);
  111.425 μs (0 allocations: 0 bytes)

pitsianis · February 26, 2022, 1:34pm

I have been following this discussion with great interest. The explanations provided by @Elrod and @kristoffer.carlsson make sense.

In the special case that the struct is a vector of length d, we can express our data as a two dimensional array, Nxd or dxN.

So I extended the above benchmark test to include the two cases above. In my computer, the dxN case which places the elements of the struct contiguous in memory, ends up being substantially slower than the Nxd version.

I find it surprising that the compiler vectorizes the AoS as well as the Nxd, but it fails to do the same on the dxN array instance.

Topic		Replies	Views
Bad performance looping over SOAs with SIMD General Usage	3	357	July 10, 2023
Can I make my code faster with parallelism, or just plain better coding? Performance	33	2781	February 19, 2019
Julia style? Arrays/matrix vs structs General Usage	19	3881	December 18, 2019
StructArrays Memory Layout New to Julia question , structarrays	14	1442	July 29, 2021
SOA transform Performance	8	406	October 14, 2021

Struct of Arrays (SoA) vs Array of Structs (AoS)

Related topics