Strange performance of a loop

rtemp = x[i] - x[j]
dist2 += rtemp' * rtemp

They don’t have the same indexing behavior, they share the same memory layout—and indeed the very same memory. One behaves as a two-dimensional array, the other as a vector of vectors. But they both are represented in-memory by the same contiguous collection of floating-point values:

julia> A = rand(3, 5)
3×5 Array{Float64,2}:
 0.826885  0.523742  0.0546594  0.848417  0.678522
 0.308623  0.866344  0.0319207  0.249763  0.93445
 0.667832  0.195743  0.296378   0.996889  0.706275

julia> v = reinterpret(SVector{3,Float64}, reshape(A, :))
5-element reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,1}):
 [0.826885, 0.308623, 0.667832]
 [0.523742, 0.866344, 0.195743]
 [0.0546594, 0.0319207, 0.296378]
 [0.848417, 0.249763, 0.996889]
 [0.678522, 0.93445, 0.706275]

julia> A[2,3] *= -1
-0.031920712969890186

julia> A
3×5 Array{Float64,2}:
 0.826885  0.523742   0.0546594  0.848417  0.678522
 0.308623  0.866344  -0.0319207  0.249763  0.93445
 0.667832  0.195743   0.296378   0.996889  0.706275

julia> v
5-element reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,1}):
 [0.826885, 0.308623, 0.667832]
 [0.523742, 0.866344, 0.195743]
 [0.0546594, -0.0319207, 0.296378]
 [0.848417, 0.249763, 0.996889]
 [0.678522, 0.93445, 0.706275]

julia> v[3][2]
-0.031920712969890186

julia> pointer(A)
Ptr{Float64} @0x0000000119cebee0

julia> pointer(v)
Ptr{SArray{Tuple{3},Float64,1,3}} @0x0000000119cebee0
2 Likes

Thank you all for the helpful discussions, the SArrays version below is more than 20% faster than the Intel Fortran equivalent. I’m amazed by the performance of Julia with StaticArrays, small code changes payed me back very well.

using BenchmarkTools
using StaticArrays

function potential(r, a)
    m, n = size(r)
    Epot = 0.0
    @inbounds for i=1:n-1, j=i+1:n
        dist2 = 0.0
        for k = 1:m
            dist2 += (r[k,i]-r[k,j])^2
        end 
        r6 = (1/dist2)^3
        r12 = r6*r6
        Epot += r12 - r6
        for k = 1:m
            tmp = (r6 - 2r12)/dist2 * (r[k,i]-r[k,j])
            a[k,i] -= tmp
            a[k,j] += tmp
        end
    end
  return Epot
end

n = 550
r = rand(3,n)
a = zeros(3,n)
@btime potential($r,$a)

function potential2(r, a)
    n = size(r,1)
    Epot = 0.0
    @inbounds for i=1:n-1, j=i+1:n
        dist2 = 0.0
        rij = r[i]-r[j]
        dist2 = dot(rij,rij)

        r6 = (1/dist2)^3
        r12 = r6*r6
        Epot += r12 - r6

        tmp = (r6 - 2r12)/dist2 .* (r[i] - r[j])
        a[i] -= tmp
        a[j] += tmp
    end
  return Epot
end

n = 550
r = [@SVector rand(3) for i in 1:n]
a = [@SVector zeros(3) for i in 1:n]
@btime potential2($r,$a)

and timings

  1.471 ms (0 allocations: 0 bytes)  # Original Julia code
  1.010 ms (0 allocations: 0 bytes)  # Using StaticArrays
  1.234 ms                           # Intel Fortran equivalent
7 Likes

I wonder, in C wouldn’t the compiler be able to do some loop unrolling even if the number of loops is a parameter for the function?

Julia does it too, but the unrolling can be by too much. For example, in a simple dot product:

function dot_product(x::AbstractArray{T},y::AbstractArray{T}) where T
    @boundscheck length(x) == length(y) || throw(BoundsError())
    out = zero(T)
    @fastmath @inbounds for i ∈ eachindex(x)
        out += x[i] * y[i]
    end
    out
end

It unrolls by a factor of 16 on a computer with avx2 (it’d be 32 given avx512):

x = randn(1);
y = randn(1);
@code_native dot_product(x,y)

Gives output including:

; Function getindex; {
; Location: array.jl:731
L160:
	vmovupd	(%rdx,%rcx,8), %ymm4
	vmovupd	32(%rdx,%rcx,8), %ymm5
	vmovupd	64(%rdx,%rcx,8), %ymm6
	vmovupd	96(%rdx,%rcx,8), %ymm7
;}
; Function add_fast; {
; Location: fastmath.jl:161
	vfmadd231pd	(%rax,%rcx,8), %ymm4, %ymm0
	vfmadd231pd	32(%rax,%rcx,8), %ymm5, %ymm1
	vfmadd231pd	64(%rax,%rcx,8), %ymm6, %ymm2
	vfmadd231pd	96(%rax,%rcx,8), %ymm7, %ymm3
	addq	$16, %rcx
	cmpq	%rcx, %r8
	jne	L160
	vaddpd	%ymm0, %ymm1, %ymm0
	cmpq	%r8, %r9
	vaddpd	%ymm0, %ymm2, %ymm0
	vaddpd	%ymm0, %ymm3, %ymm0
	vextractf128	$1, %ymm0, %xmm1
	vaddpd	%ymm1, %ymm0, %ymm0
	vhaddpd	%ymm0, %ymm0, %ymm0
;}

This is the for loop.
The block following L160: says to move data from memory (eg, (%rdx,%rcx,8) into registers (eg, %ymm4). Registers starting with “y” are 256 bits, meaning that’s 4 doubles. It says to do this 4 times, for 16 doubles total.
With avx-512, the "y"s would be "z"s, and that’d be 32 numbers total.

Next up are four vfmadd231pd, saying to do four vector fused mutiply-adds on numbers from memory, those numbers just loaded, adding and storing them registers ymm0 through ymm3.
Then jne L160 says to conditionally jump back to L160, to repeat this process.

The problem is, if the loop is shorter than 16, it doesn’t get unrolled at all.

julia> using BenchmarkTools

julia> x15 = randn(15); y15 = randn(15);

julia> x16 = randn(16); y16 = randn(16);

julia> @benchmark dot_product($x15, $y15)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.567 ns (0.00% GC)
  median time:      9.628 ns (0.00% GC)
  mean time:        9.648 ns (0.00% GC)
  maximum time:     23.377 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark dot_product($x16, $y16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.540 ns (0.00% GC)
  median time:      5.580 ns (0.00% GC)
  mean time:        5.636 ns (0.00% GC)
  maximum time:     20.368 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> x8 = randn(8); y8 = randn(8);

julia> @benchmark dot_product($x8, $y8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.510 ns (0.00% GC)
  median time:      5.580 ns (0.00% GC)
  mean time:        5.626 ns (0.00% GC)
  maximum time:     19.136 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Basically, it unrolls to optimize for long loops. If you know the loops are short, it helps to pass along that information.
With SVectors (which are amazingly faster), it does speed up for vectors of length 8 vs 16:

julia> sx16 = @SVector randn(16);

julia> sy16 = @SVector randn(16);

julia> @benchmark dot_product($sx16, $sy16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.616 ns (0.00% GC)
  median time:      3.647 ns (0.00% GC)
  mean time:        3.865 ns (0.00% GC)
  maximum time:     22.252 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> sx8 = @SVector randn(8);

julia> sy8 = @SVector randn(8);

julia> @benchmark dot_product($sx8, $sy8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.915 ns (0.00% GC)
  median time:      2.955 ns (0.00% GC)
  mean time:        3.105 ns (0.00% GC)
  maximum time:     24.025 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Not that much faster though, which I think is I think why LLVM normally unrolls by 4 (instructions aren’t done one at a time).
15 is slower, I guess because it doesn’t neatly divide into chunks of 4:

julia> sx15 = @SVector randn(15);

julia> sy15 = @SVector randn(15);

julia> @benchmark dot_product($sx15, $sy15)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.548 ns (0.00% GC)
  median time:      4.589 ns (0.00% GC)
  mean time:        4.830 ns (0.00% GC)
  maximum time:     22.592 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
1 Like

Hello,

just wanted to chime in. I am the starter of The cost of size() in for-loops. What a coincidence.

This is for me a nice opportunity to learn how to correctly use static arrays ! I would have mostly the same questions, we have also some Lennard-Jones 12-6 in the mix.

Best Regards

Christof

I’m not an expert yet I’d thought this would be advantage of Julia that given a loop, since the data is abstracted and Julia knows about it more than a regular compiler, it will maximize the performance.

For instance, I don’t see why in the case of a Dot Product using Static Arrays would have any advantage to regular arrays.
At the beginning of the loop Julia’s knowledge should be the same (Length of the loop, size and type of the arrays).

The compiler only knows the type, not the size of the arrays, and therefore not the length of the loop. That is the main point of StaticArrays, that the type itself contains size information.

There are other properties of StaticArrays that help performance too, such as immutability and stack-allocation, while ordinary Arrays are mutable and heap-allocated.

Imaging we have a scalar operation we want to apply on an array in element wise fashion (For instance calculating the log() of each element or it cube, etc…).

In C we would write a function which accepts a pointer to array and an integer of the number of elements in the array.
The function loops over the element and applies the operation per element.
In that case any modern compiler would do 3 things:

  1. Vectorize the operation.
  2. Apply Loop Unrolling.
  3. If operation is complex enough (Not to be memory bounded) it will Multi Thread the code.

The code the Compiler will generate will be the most efficient as it could as this is a toy example.

Now, leave alone the Multi Threading, will Julia generate the same code as above for a function which accepts the array? Namely with all loop unrolling and vecotrization?

Will the code generated by Static Array be faster then the C code in the toy example above?
How could that be?

Not in general, no. If it’s a parameter then a C compiler would generate generic code for any possible value.

Julia uses LLVM to generate code which is the same backend as e.g. clang uses. So modulo the exact optimization passes used, the same code will be generated. The question that is relevant is for what input does the generated code for the function be valid. As an example, a small sum function:

function sum(x::AbstractVector{Float64})
    s = 0.0
    @inbounds @simd for i in 1:length(x)
        s += x[i]
    end
    return s
end

import StaticArrays

@code_llvm sum(rand(8))
@code_llvm sum(rand(SVector{8}))

Looking at some of the code in the first example:

vector.ph:                                        ; preds = %min.iters.checked
  %18 = insertelement <2 x double> <double undef, double 0.000000e+00>, double %s.0.ph, i32 0
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <2 x double> [ %18, %vector.ph ], [ %23, %vector.body ]
  %vec.phi122 = phi <2 x double> [ zeroinitializer, %vector.ph ], [ %24, %vector.body ]
  %19 = getelementptr double, double* %14, i64 %index
  %20 = bitcast double* %19 to <2 x double>*
  %wide.load = load <2 x double>, <2 x double>* %20, align 8
  %21 = getelementptr double, double* %19, i64 2
  %22 = bitcast double* %21 to <2 x double>*
  %wide.load124 = load <2 x double>, <2 x double>* %22, align 8
  %23 = fadd fast <2 x double> %vec.phi, %wide.load
  %24 = fadd fast <2 x double> %vec.phi122, %wide.load124
  %index.next = add i64 %index, 4
  %25 = icmp eq i64 %index.next, %n.vec
  br i1 %25, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = fadd fast <2 x double> %24, %23
  %rdx.shuf = shufflevector <2 x double> %bin.rdx, <2 x double> undef, <2 x i32> <i32 1, i32 undef>
  %bin.rdx127 = fadd fast <2 x double> %bin.rdx, %rdx.shuf
  %26 = extractelement <2 x double> %bin.rdx127, i32 0
  %cmp.n = icmp eq i64 %5, %n.vec
  br i1 %cmp.n, label %L11.outer.L11.outer.split_crit_edge.loopexit, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %min.iters.checked, %if12.lr.ph
  %bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %if12.lr.ph ], [ 0, %min.iters.checked ]
  %bc.merge.rdx = phi double [ %26, %middle.block ], [ %s.0.ph, %if12.lr.ph ], [ %s.0.ph, %min.iters.checked ]
  br label %if12

Since this code has to be valid no matter the length of the array, it unrolls the loop by a factor of 4, tries to do SIMD, then has a fallback when the number of elements is not divisible by 4 etc etc. Very nice and fast if the array is big, but some overhead if it is small.

Now let’s look at the code for the static array

julia> @code_llvm sum(rand(SVector{8}))

define double @julia_sum_33701({ [8 x double] } addrspace(11)* nocapture nonnull readonly dereferenceable(64)) {
top:
  %1 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 0
  %2 = load double, double addrspace(11)* %1, align 8
  %3 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 1
  %4 = load double, double addrspace(11)* %3, align 8
  %5 = fadd fast double %2, %4
  %6 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 2
  %7 = load double, double addrspace(11)* %6, align 8
  %8 = fadd fast double %5, %7
  %9 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 3
  %10 = load double, double addrspace(11)* %9, align 8
  %11 = fadd fast double %8, %10
  %12 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 4
  %13 = load double, double addrspace(11)* %12, align 8
  %14 = fadd fast double %11, %13
  %15 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 5
  %16 = load double, double addrspace(11)* %15, align 8
  %17 = fadd fast double %14, %16
  %18 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 6
  %19 = load double, double addrspace(11)* %18, align 8
  %20 = fadd fast double %17, %19
  %21 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 7
  %22 = load double, double addrspace(11)* %21, align 8
  %23 = fadd fast double %20, %22
  ret double %23
}

Here, LLVM knows that there is no need to generate all the general code because this function only needs to be valid for input arrays of length 8 and it knows all the complicated stuff above is not worth it. The exact size is known the whole loop can be unrolled.

julia> @btime sum($(rand(8)))
  6.402 ns (0 allocations: 0 bytes)
2.405228701641119

julia> @btime sum($(rand(SVector{8})))
  3.278 ns (0 allocations: 0 bytes)
4.368139012221248

In e.g. C++ you would typically do this with templates, encoding the size of the arrays as a template parameter.

4 Likes

Wonderful!

So in general Julia create optimal code as C for input parameter of the size of the array.
In case of Static Arrays it removes the loop completely.

By the way, can we create a set of Aligned Arrays?
With the default being aligned by 512 Bits (For AVX512)?

It means that the code for those, though are not static, will be as efficient as for Static Arrays.

The alignment is set here

and depends on the size of the array. Large arrays are already 64 bytes aligned:

julia> Int(pointer(rand(32))) % 64
48

julia> Int(pointer(rand(1024))) % 64
0

Hello,

That is true as has been demonstrated. However, to me it appears that people do not use the possibilities julia has to their full extent. Solutions like

or

never would have occured to me as a beginner !

Actually, from the introduction to Static Arrays which mentions a limit of about 100 in size at which the “normal” arrays become more efficient I came to a similar conclusion as the thread starter (I hope I understood him correctly, correct me if I am wrong !), i.e. that you cannot really use them if your actual arrays can be larger and that converting would be painful. As has been demonstrated this is wrong, it actually is quite painless and requires only small code changes.

Now, https://docs.julialang.org/en/latest/manual/performance-tips/ does not even mention the word “static” ! And that for a problem which appears to be a stumbling block so often that two people asked basically the same question at the same time :slight_smile: And as far as I tested this loop fusion and in-place operations (i.e. using @. and/or .=) does not give the best performance (ok, I should retest with static arrays) you can reach with for-loops in my case. This also appears to be something which happens for some more people, confirm https://discourse.julialang.org/t/efficient-finite-difference-operators/12439/11. While I can read that loop I also like the vectorized syntax. Although Pre-allocating-outputs points in the right direction.

EDIT: I also tried @. @views which did not really help before going the pre-allocation route, rewriting quite some things and, to make pre-allocation work as it appeared to be orthogonal to fusion, doing everything in explicit loops. And the the array size thing got me by suprise :slight_smile:

So it appears to me that it might be useful if these scenarios, which now have been already discussed and answered in these threads, could be included into a performance FAQ at some point ?

Thank you all very much for the helpful discussion.

Best Regards

Christof

2 Likes

What you show is only the address of the 1st element.
What I suggest is making sure any Loop with vectorization won’t have “Anomaly” to take care of.

I meant something like Intel IPP.

If we define 1D array it will be padded to have size which is multiplication of 16 / 32 / 64 Bytes.
If you define 2D array it will be padded with rows which are also multiplication of 16 / 32 / 64 Bytes.

This way all loops will be able to be unrolled and vectorized with no issues about taking care of edge cases.

Isn’t it a bit dangerous, relying on the layout of data in memory?

This is not going to change.

The famous last words … :wink:

There’s no reason to change it since the layout we have now is the optimal one and it has never had a different layout since the types involved have existed.

Are you referring to my post?