Strange performance of a loop

The compiler only knows the type, not the size of the arrays, and therefore not the length of the loop. That is the main point of StaticArrays, that the type itself contains size information.

There are other properties of StaticArrays that help performance too, such as immutability and stack-allocation, while ordinary Arrays are mutable and heap-allocated.

Imaging we have a scalar operation we want to apply on an array in element wise fashion (For instance calculating the log() of each element or it cube, etc…).

In C we would write a function which accepts a pointer to array and an integer of the number of elements in the array.
The function loops over the element and applies the operation per element.
In that case any modern compiler would do 3 things:

  1. Vectorize the operation.
  2. Apply Loop Unrolling.
  3. If operation is complex enough (Not to be memory bounded) it will Multi Thread the code.

The code the Compiler will generate will be the most efficient as it could as this is a toy example.

Now, leave alone the Multi Threading, will Julia generate the same code as above for a function which accepts the array? Namely with all loop unrolling and vecotrization?

Will the code generated by Static Array be faster then the C code in the toy example above?
How could that be?

Not in general, no. If it’s a parameter then a C compiler would generate generic code for any possible value.

Julia uses LLVM to generate code which is the same backend as e.g. clang uses. So modulo the exact optimization passes used, the same code will be generated. The question that is relevant is for what input does the generated code for the function be valid. As an example, a small sum function:

function sum(x::AbstractVector{Float64})
    s = 0.0
    @inbounds @simd for i in 1:length(x)
        s += x[i]
    end
    return s
end

import StaticArrays

@code_llvm sum(rand(8))
@code_llvm sum(rand(SVector{8}))

Looking at some of the code in the first example:

vector.ph:                                        ; preds = %min.iters.checked
  %18 = insertelement <2 x double> <double undef, double 0.000000e+00>, double %s.0.ph, i32 0
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <2 x double> [ %18, %vector.ph ], [ %23, %vector.body ]
  %vec.phi122 = phi <2 x double> [ zeroinitializer, %vector.ph ], [ %24, %vector.body ]
  %19 = getelementptr double, double* %14, i64 %index
  %20 = bitcast double* %19 to <2 x double>*
  %wide.load = load <2 x double>, <2 x double>* %20, align 8
  %21 = getelementptr double, double* %19, i64 2
  %22 = bitcast double* %21 to <2 x double>*
  %wide.load124 = load <2 x double>, <2 x double>* %22, align 8
  %23 = fadd fast <2 x double> %vec.phi, %wide.load
  %24 = fadd fast <2 x double> %vec.phi122, %wide.load124
  %index.next = add i64 %index, 4
  %25 = icmp eq i64 %index.next, %n.vec
  br i1 %25, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd fast <2 x double> %24, %23
  %rdx.shuf = shufflevector <2 x double> %bin.rdx, <2 x double> undef, <2 x i32> <i32 1, i32 undef>
  %bin.rdx127 = fadd fast <2 x double> %bin.rdx, %rdx.shuf
  %26 = extractelement <2 x double> %bin.rdx127, i32 0
  %cmp.n = icmp eq i64 %5, %n.vec
  br i1 %cmp.n, label %L11.outer.L11.outer.split_crit_edge.loopexit, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %min.iters.checked, %if12.lr.ph
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %if12.lr.ph ], [ 0, %min.iters.checked ]
  %bc.merge.rdx = phi double [ %26, %middle.block ], [ %s.0.ph, %if12.lr.ph ], [ %s.0.ph, %min.iters.checked ]
  br label %if12

Since this code has to be valid no matter the length of the array, it unrolls the loop by a factor of 4, tries to do SIMD, then has a fallback when the number of elements is not divisible by 4 etc etc. Very nice and fast if the array is big, but some overhead if it is small.

Now let’s look at the code for the static array

julia> @code_llvm sum(rand(SVector{8}))

define double @julia_sum_33701({ [8 x double] } addrspace(11)* nocapture nonnull readonly dereferenceable(64)) {
top:
  %1 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 0
  %2 = load double, double addrspace(11)* %1, align 8
  %3 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 1
  %4 = load double, double addrspace(11)* %3, align 8
  %5 = fadd fast double %2, %4
  %6 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 2
  %7 = load double, double addrspace(11)* %6, align 8
  %8 = fadd fast double %5, %7
  %9 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 3
  %10 = load double, double addrspace(11)* %9, align 8
  %11 = fadd fast double %8, %10
  %12 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 4
  %13 = load double, double addrspace(11)* %12, align 8
  %14 = fadd fast double %11, %13
  %15 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 5
  %16 = load double, double addrspace(11)* %15, align 8
  %17 = fadd fast double %14, %16
  %18 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 6
  %19 = load double, double addrspace(11)* %18, align 8
  %20 = fadd fast double %17, %19
  %21 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 7
  %22 = load double, double addrspace(11)* %21, align 8
  %23 = fadd fast double %20, %22
  ret double %23
}

Here, LLVM knows that there is no need to generate all the general code because this function only needs to be valid for input arrays of length 8 and it knows all the complicated stuff above is not worth it. The exact size is known the whole loop can be unrolled.

julia> @btime sum($(rand(8)))
  6.402 ns (0 allocations: 0 bytes)
2.405228701641119

julia> @btime sum($(rand(SVector{8})))
  3.278 ns (0 allocations: 0 bytes)
4.368139012221248

In e.g. C++ you would typically do this with templates, encoding the size of the arrays as a template parameter.

4 Likes

Wonderful!

So in general Julia create optimal code as C for input parameter of the size of the array.
In case of Static Arrays it removes the loop completely.

By the way, can we create a set of Aligned Arrays?
With the default being aligned by 512 Bits (For AVX512)?

It means that the code for those, though are not static, will be as efficient as for Static Arrays.

The alignment is set here

https://github.com/JuliaLang/julia/blob/9ed628780776f59bd5e52fa3d7f8244d615a0301/src/julia_internal.h#L252-L253

and depends on the size of the array. Large arrays are already 64 bytes aligned:

julia> Int(pointer(rand(32))) % 64
48

julia> Int(pointer(rand(1024))) % 64
0

Hello,

That is true as has been demonstrated. However, to me it appears that people do not use the possibilities julia has to their full extent. Solutions like

or

never would have occured to me as a beginner !

Actually, from the introduction to Static Arrays which mentions a limit of about 100 in size at which the “normal” arrays become more efficient I came to a similar conclusion as the thread starter (I hope I understood him correctly, correct me if I am wrong !), i.e. that you cannot really use them if your actual arrays can be larger and that converting would be painful. As has been demonstrated this is wrong, it actually is quite painless and requires only small code changes.

Now, https://docs.julialang.org/en/latest/manual/performance-tips/ does not even mention the word “static” ! And that for a problem which appears to be a stumbling block so often that two people asked basically the same question at the same time :slight_smile: And as far as I tested this loop fusion and in-place operations (i.e. using @. and/or .=) does not give the best performance (ok, I should retest with static arrays) you can reach with for-loops in my case. This also appears to be something which happens for some more people, confirm https://discourse.julialang.org/t/efficient-finite-difference-operators/12439/11. While I can read that loop I also like the vectorized syntax. Although Pre-allocating-outputs points in the right direction.

EDIT: I also tried @. @views which did not really help before going the pre-allocation route, rewriting quite some things and, to make pre-allocation work as it appeared to be orthogonal to fusion, doing everything in explicit loops. And the the array size thing got me by suprise :slight_smile:

So it appears to me that it might be useful if these scenarios, which now have been already discussed and answered in these threads, could be included into a performance FAQ at some point ?

Thank you all very much for the helpful discussion.

Best Regards

Christof

2 Likes

What you show is only the address of the 1st element.
What I suggest is making sure any Loop with vectorization won’t have “Anomaly” to take care of.

I meant something like Intel IPP.

If we define 1D array it will be padded to have size which is multiplication of 16 / 32 / 64 Bytes.
If you define 2D array it will be padded with rows which are also multiplication of 16 / 32 / 64 Bytes.

This way all loops will be able to be unrolled and vectorized with no issues about taking care of edge cases.

Isn’t it a bit dangerous, relying on the layout of data in memory?

This is not going to change.

The famous last words … :wink:

There’s no reason to change it since the layout we have now is the optimal one and it has never had a different layout since the types involved have existed.

Are you referring to my post?

No.

Does unaligned vs aligned load really matter?

In Rzyen (page 88), Haswell (page 210), Skylake (page 243, no avx-512), Skylake-X (page 260, avx-512), Knight’s Landing (page 328, has avx-512) – it’s all the same.

When I look at really old processors, like the k10 (page 36), which only has up to xmm registers, it lists “MOVUPS/D” for operands “m,r” as having twice the reciprocal throughput (ie, half the throughput) as “MOVAPS/D” for “m,r”.
For Intel, that was the case going back to Sandy Bridge Agner`s CPU blog - Test results for Intel's Sandy Bridge processor

On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.

So, at least for x86-64 processors that are avx capable, it seems alignment doesn’t matter?

I’m no expert, but that’s basically all I found.

I think that prior to Sandy Bridge if you accessed Aligned Data using the non aligned load it wouldn’t be efficient.
In modern CPU’s if the data is aligned it doesn’t matter if you use the load which assumes alignment or not.
But I still think accessing unaligned data is slower than aligned data.

But my point is different.
We must make sure the length of the data allocated it a multiplication of 16 Bytes (For SSE) / 32 Bytes (For AVX) / 64 Byte (For AVX512).

The tricky part is dealing with 1D / 2D / 3D / Etc… arrays.

Yeah, if operations on vectors of length 8 are faster than on vectors of length 5, why not pad it with 3 hidden zeros?
(Ditto for columns of a matrix as vectors).

At the very least, it’d be straightforward to create an array type that wraps regular arrays and experiment.

That’s what I suggest.
This is what Intel does on Intel IPP.
All arrays are padded so no edge cases, you can always work on set of 8 elements.