Struct of Arrays (SoA) vs Array of Structs (AoS)

You need @inbounds to make sure the loops simd properly to see the true difference. When simd is possible for one of them but not for the other, then you’ll see quite the difference

3 Likes

Also make sure to interpolate the argument

julia> @btime normSoA($arrSoA)

Struct of Arrays are great when you want to SIMD across loop iterations.
That is, if your code is written as a loop that does some calculations with structs, the Struct of Arrays memory layout allows the compiler to calculate multiple iterations of the loop at a time. This is because it is straight forward to turn each access to a number into a load of an entire SIMD vector of them, etc for each subsequent operation.

This vectorization requires @inbounds to eliminate branches. It may require @simd for or even @simd ivdep for to vectorize.
You can check if it has been vectorized with @code_llvm.

@baggepinnen @cortner @Elrod Thanks for your suggestions. Please see edited code example in the original post (Struct of Arrays (SoA) vs Array of Structs (AoS)). (I also tried the ivdep modifier for @simd but it makes no difference).

Now the benchmarking shows that the SoA approach is significantly faster than the AoS. I find this very surprising. I hope someone can shed some light on this.

1 Like

Did you check that they return the same answer? How can you have simd with an accumulator, like you have?

Should be rand(1000) there?

Yes, fixed. Thanks.

It already got answered? Struct of Arrays (SoA) vs Array of Structs (AoS) - #4 by Elrod

They use different random data.

Hmmm yes, good point. I just took this example from the lectures. Maybe we can think of a better example?
Though actually the example of sum (also in the lectures) seems to take advantage of SIMD, even though it is also an accumulator.

No, because here each loop iteration should be using the fields of a single struct together. So I don’t see why SoA is better here. But maybe I’m missing something.

Because you can unroll the loop by a factor of 4 and do:

i = 1
while not_done
    tmp_x1 = x.real[i:i+3]
    tmp_x2 = tmp_1.^2
    tmp_y1 = x.imag[i:i+3]
    tmp_y2 = tmp_2.^2
    tmp_xy = tmp_x2 + tmp_y2
    tmp_xy_sqrt = sqrt(tmp_xy)
    out += tmp_xy_sqrt
    i += 4
end

where each of these operations is a SIMD instruction operating on 4 values at the same time.

If you have x and y intertwined in memory you cannot easily load a chunk from the real part and a chunk from the imaginary part directly into a “SIMD lane” and start operating on them independently.

5 Likes

Interesting. I’ve seen the compiler go the other way on this for more complicated examples. Did it just not find out how to unroll it? Seeing it like this I see that you can make SoA always better if you put the time to make sure that it’s always vectorizing, but you just can expect it to auto-vectorize beyond a certain point?

1 Like

I should have said, the same answer with and without @simd.

Yes, I do think that when optimizing code SoA is better much more often than not.
It’s a hammer you can apply whenever your arrays are of length > 4; all you have to do is make sure the compiler is actually figuring it out.

Whenever the number of loop iterations is greater than the number of operations that can happen in parallel within a loop iteration, SoA is probably better.
Here, within a loop iteration, when we add tmp_x2 and tmp_y2, we can’t vectorize that, so AoS is pretty much stuck. Versus just using SoA and calculating 4 (or more!) tmp_x2s, tmp_y2s, and tmp_xys at a time ([i:i+3]).

If a loop iteration gets fancy enough so that maybe vectorization is possible, suddenly we have to start digging through everything that goes on, make sure we can actually load everything contiguously, and that there isn’t some hidden vectorization bottleneck. Or, make life easy, use SoA so that we know it’s possible to vectorize seemlessly, and then just check the @code_llvm (yuyichao strongly reccomends against @code_native; by actualy naming temporiaries, the llvm is definitely easier to read – added bonus is that discoure recognizes llvm as a language!).
A caveat is that maybe you’ll have to worry about register pressure if you’re getting really fancy.

There’s a great video somewhere of Keno talking about what he did with struct of arrays while working on Celeste.

julia> @code_llvm optimize=true debuginfo=:none normSoA(arrSoA)

; Function Attrs: uwtable
define double @julia_normSoA_16509(%jl_value_t addrspace(10)* nonnull align 8 dereferenceable(16)) #0 {
top:
  %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %2 = bitcast %jl_value_t addrspace(11)* %1 to %jl_value_t addrspace(10)* addrspace(11)*
  %3 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %2, align 8
  %4 = addrspacecast %jl_value_t addrspace(10)* %3 to %jl_value_t addrspace(11)*
  %5 = bitcast %jl_value_t addrspace(11)* %4 to %jl_array_t addrspace(11)*
  %6 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %5, i64 0, i32 1
  %7 = load i64, i64 addrspace(11)* %6, align 8
  %8 = icmp sgt i64 %7, 0
  %9 = select i1 %8, i64 %7, i64 0
  %10 = add nsw i64 %9, -1
  %11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)
  %12 = extractvalue { i64, i1 } %11, 1
  br i1 %12, label %L18, label %L23

L18:                                              ; preds = %top
  call void @julia_throw_overflowerr_binaryop_12453(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 69001352 to %jl_value_t*) to %jl_value_t addrspace(10)*), i64 %10, i64 1)
  call void @llvm.trap()
  unreachable

L23:                                              ; preds = %top
  %13 = extractvalue { i64, i1 } %11, 0
  %14 = icmp slt i64 %13, 1
  br i1 %14, label %L67, label %L30.lr.ph

L30.lr.ph:                                        ; preds = %L23
  %15 = addrspacecast %jl_value_t addrspace(10)* %3 to %jl_value_t addrspace(11)*
  %16 = bitcast %jl_value_t addrspace(11)* %15 to double addrspace(13)* addrspace(11)*
  %17 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %16, align 8
  %18 = bitcast %jl_value_t addrspace(11)* %1 to i8 addrspace(11)*
  %19 = getelementptr inbounds i8, i8 addrspace(11)* %18, i64 8
  %20 = bitcast i8 addrspace(11)* %19 to %jl_value_t addrspace(10)* addrspace(11)*
  %21 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %20, align 8
  %22 = addrspacecast %jl_value_t addrspace(10)* %21 to %jl_value_t addrspace(11)*
  %23 = bitcast %jl_value_t addrspace(11)* %22 to double addrspace(13)* addrspace(11)*
  %24 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %23, align 8
  %min.iters.check = icmp ult i64 %9, 16
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L30.lr.ph
  %n.vec = and i64 %9, 9223372036854775792
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %57, %vector.body ]
  %vec.phi22 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %58, %vector.body ]
  %vec.phi23 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %59, %vector.body ]
  %vec.phi24 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %60, %vector.body ]
  %25 = getelementptr inbounds double, double addrspace(13)* %17, i64 %index
  %26 = bitcast double addrspace(13)* %25 to <4 x double> addrspace(13)*
  %wide.load = load <4 x double>, <4 x double> addrspace(13)* %26, align 8
  %27 = getelementptr double, double addrspace(13)* %25, i64 4
  %28 = bitcast double addrspace(13)* %27 to <4 x double> addrspace(13)*
  %wide.load25 = load <4 x double>, <4 x double> addrspace(13)* %28, align 8
  %29 = getelementptr double, double addrspace(13)* %25, i64 8
  %30 = bitcast double addrspace(13)* %29 to <4 x double> addrspace(13)*
  %wide.load26 = load <4 x double>, <4 x double> addrspace(13)* %30, align 8
  %31 = getelementptr double, double addrspace(13)* %25, i64 12
  %32 = bitcast double addrspace(13)* %31 to <4 x double> addrspace(13)*
  %wide.load27 = load <4 x double>, <4 x double> addrspace(13)* %32, align 8
  %33 = getelementptr inbounds double, double addrspace(13)* %24, i64 %index
  %34 = bitcast double addrspace(13)* %33 to <4 x double> addrspace(13)*
  %wide.load28 = load <4 x double>, <4 x double> addrspace(13)* %34, align 8
  %35 = getelementptr double, double addrspace(13)* %33, i64 4
  %36 = bitcast double addrspace(13)* %35 to <4 x double> addrspace(13)*
  %wide.load29 = load <4 x double>, <4 x double> addrspace(13)* %36, align 8
  %37 = getelementptr double, double addrspace(13)* %33, i64 8
  %38 = bitcast double addrspace(13)* %37 to <4 x double> addrspace(13)*
  %wide.load30 = load <4 x double>, <4 x double> addrspace(13)* %38, align 8
  %39 = getelementptr double, double addrspace(13)* %33, i64 12
  %40 = bitcast double addrspace(13)* %39 to <4 x double> addrspace(13)*
  %wide.load31 = load <4 x double>, <4 x double> addrspace(13)* %40, align 8
  %41 = fmul <4 x double> %wide.load, %wide.load
  %42 = fmul <4 x double> %wide.load25, %wide.load25
  %43 = fmul <4 x double> %wide.load26, %wide.load26
  %44 = fmul <4 x double> %wide.load27, %wide.load27
  %45 = fmul <4 x double> %wide.load28, %wide.load28
  %46 = fmul <4 x double> %wide.load29, %wide.load29
  %47 = fmul <4 x double> %wide.load30, %wide.load30
  %48 = fmul <4 x double> %wide.load31, %wide.load31
  %49 = fadd <4 x double> %41, %45
  %50 = fadd <4 x double> %42, %46
  %51 = fadd <4 x double> %43, %47
  %52 = fadd <4 x double> %44, %48
  %53 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %49)
  %54 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %50)
  %55 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %51)
  %56 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %52)
  %57 = fadd fast <4 x double> %vec.phi, %53
  %58 = fadd fast <4 x double> %vec.phi22, %54
  %59 = fadd fast <4 x double> %vec.phi23, %55
  %60 = fadd fast <4 x double> %vec.phi24, %56
  %index.next = add i64 %index, 16
  %61 = icmp eq i64 %index.next, %n.vec
  br i1 %61, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd fast <4 x double> %58, %57
  %bin.rdx32 = fadd fast <4 x double> %59, %bin.rdx
  %bin.rdx33 = fadd fast <4 x double> %60, %bin.rdx32
  %rdx.shuf = shufflevector <4 x double> %bin.rdx33, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx34 = fadd fast <4 x double> %bin.rdx33, %rdx.shuf
  %rdx.shuf35 = shufflevector <4 x double> %bin.rdx34, <4 x double> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx36 = fadd fast <4 x double> %bin.rdx34, %rdx.shuf35
  %62 = extractelement <4 x double> %bin.rdx36, i32 0
  %cmp.n = icmp eq i64 %9, %n.vec
  br i1 %cmp.n, label %L67, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L30.lr.ph
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %L30.lr.ph ]
  %bc.merge.rdx = phi double [ %62, %middle.block ], [ 0.000000e+00, %L30.lr.ph ]
  br label %L30

L30:                                              ; preds = %scalar.ph, %L30
  %value_phi218 = phi i64 [ %bc.resume.val, %scalar.ph ], [ %72, %L30 ]
  %value_phi17 = phi double [ %bc.merge.rdx, %scalar.ph ], [ %71, %L30 ]
  %63 = getelementptr inbounds double, double addrspace(13)* %17, i64 %value_phi218
  %64 = load double, double addrspace(13)* %63, align 8
  %65 = getelementptr inbounds double, double addrspace(13)* %24, i64 %value_phi218
  %66 = load double, double addrspace(13)* %65, align 8
  %67 = fmul double %64, %64
  %68 = fmul double %66, %66
  %69 = fadd double %67, %68
  %70 = call double @llvm.sqrt.f64(double %69)
  %71 = fadd fast double %value_phi17, %70
  %72 = add nuw nsw i64 %value_phi218, 1
  %73 = icmp ult i64 %72, %13
  br i1 %73, label %L30, label %L67

L67:                                              ; preds = %L30, %middle.block, %L23
  %value_phi6 = phi double [ 0.000000e+00, %L23 ], [ %71, %L30 ], [ %62, %middle.block ]
  ret double %value_phi6
}

The exciting bit:

  %41 = fmul <4 x double> %wide.load, %wide.load
  %42 = fmul <4 x double> %wide.load25, %wide.load25
  %43 = fmul <4 x double> %wide.load26, %wide.load26
  %44 = fmul <4 x double> %wide.load27, %wide.load27
  %45 = fmul <4 x double> %wide.load28, %wide.load28
  %46 = fmul <4 x double> %wide.load29, %wide.load29
  %47 = fmul <4 x double> %wide.load30, %wide.load30
  %48 = fmul <4 x double> %wide.load31, %wide.load31
  %49 = fadd <4 x double> %41, %45
  %50 = fadd <4 x double> %42, %46
  %51 = fadd <4 x double> %43, %47
  %52 = fadd <4 x double> %44, %48
  %53 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %49)
  %54 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %50)
  %55 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %51)
  %56 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %52)
  %57 = fadd fast <4 x double> %vec.phi, %53
  %58 = fadd fast <4 x double> %vec.phi22, %54
  %59 = fadd fast <4 x double> %vec.phi23, %55
  %60 = fadd fast <4 x double> %vec.phi24, %56

With all the <4 x double> (SIMD vector of 4 Float64s) that it’s doing what kristoffer.carlsson said, except that it unrolled that loop yet again by another factor of 4 (so it’s doing 4 copies of what he wrote) to also take advantage of OOO (out of order) execution.

3 Likes

That’s very interesting. Thanks @kristoffer.carlsson & @Elrod

By the way, I was satisfied to find that StructArrays seems to facilitate the SIMD optimizations.

using StructArrays 

s = StructArray(ComplexAoS(rand(),rand()) for i in 1:N);

julia> @btime normAoS($arrAoS);
  2.076 μs (0 allocations: 0 bytes)

julia> @btime normSoA($arrSoA);
  1.142 μs (0 allocations: 0 bytes)

julia> @btime normSoA($s);
  1.229 μs (0 allocations: 0 bytes)

Edit: I’d be curious to see if someone who understands @code_llvm better than me saw the output produced by the StructArrays construct and checked that the SIMD optimizations were indeed there.

2 Likes

I’ve been playing with the example @cossio proposed but instead of computing the total norm I computed the individual norm:

function normSoA(x)
  out = Array{Float64,1}(undef, length(x.real))
  for i in 1:length(x.real)
    @inbounds out[i] = sqrt(x.real[i]^2 + x.imag[i]^2)
  end
  out
end

function normAoS(x)
  out = Array{Float64,1}(undef, length(x))
  for i in 1:length(x)
    @inbounds out[i] = sqrt(x[i].real^2 + x[i].imag^2)
  end
  out
end

In that case benchmarking give the exact same results for both cases. Using code_llvm shows that both codes use <4 x double> SIMD instructions.

Not sure how to interpret those findings but I wanted to share it with you. I’m a new Julia user. :slight_smile:

2 Likes

Interesting, @cossio I understand this is an old post, but i ran the same code you posted

and i get almost same benchmark time. No difference !

julia> N = 100000;

julia> arrAoS = [ComplexAoS(rand(),rand()) for i in 1:N];

julia> arrSoA = ComplexSoA(rand(N),rand(N));

julia> @btime normAoS($arrAoS);
  111.503 μs (0 allocations: 0 bytes)

julia> @btime normSoA($arrSoA);
  111.425 μs (0 allocations: 0 bytes)

1 Like

I have been following this discussion with great interest. The explanations provided by @Elrod and @kristoffer.carlsson make sense.

In the special case that the struct is a vector of length d, we can express our data as a two dimensional array, Nxd or dxN.

So I extended the above benchmark test to include the two cases above. In my computer, the dxN case which places the elements of the struct contiguous in memory, ends up being substantially slower than the Nxd version.

I find it surprising that the compiler vectorizes the AoS as well as the Nxd, but it fails to do the same on the dxN array instance.

The Nxd (I think you meant 2\times N) array case should be 100x faster because all memory accesses happen in the optimal order, the element a[1,i] is exactly contiguous to a[2,i], where as a[i,1] and a[i,2] are at two distant locations.

PS: I made a mistake, it’s slower than SoA.