# Strange performance of a loop

I’m not sure if this is known and was discussed before, but I noticed a performance difference (20% loss) when using

``````m = size(a,1)
for i = 1:m
do something a[i,k]
end
``````

than when using the literal constant 3 instead:

``````for i = 1:3
do something a[i,k]
end
``````

I don’t know if this is an unimplemented compiler optimization, but the difference is much bigger in v0.6 than in v0.7. Here is a complete example:

``````using BenchmarkTools

function potential(r, a)
m, n = size(r)
Epot = 0.0
@inbounds for i=1:n-1, j=i+1:n
dist2 = 0.0
for k = 1:m
dist2 += (r[k,i]-r[k,j])^2
end

r6 = (1/dist2)^3
r12 = r6*r6
Epot += r12 - r6

for k = 1:m
tmp = (r6 - 2r12)/dist2 * (r[k,i]-r[k,j])
a[k,i] -= tmp
a[k,j] += tmp
end
end
return Epot
end

n = 550
r = rand(3,n)
a = zeros(3,n)

julia> @btime potential(\$r,\$a)
1.471 ms (0 allocations: 0 bytes)
2.125152650915103e27
``````

and when replacing m by 3:

`````` julia> @btime potential(\$r,\$a)
1.214 ms (0 allocations: 0 bytes)
2.1251526509151058e27
``````

In v0.6, the difference is more obvious as I said:

``````julia> @btime potential(\$r, \$a)
2.513 ms (0 allocations: 0 bytes)
5.334031316343885e22
``````

and

``````julia> @btime potential(\$r, \$a)
1.219 ms (0 allocations: 0 bytes)
5.334031316343885e22
``````

Isn’t this the exact same discussion as The cost of size() in for-loops? If the number of iterations is constant, the loop can be unrolled. If you want the compiler to specialize on specific array sizes, use StaticArrays.jl.

4 Likes

Thank you Stefan and sorry I totally overlooked that question. The discussion there is really helpful, but I don’t want to change to `SArrays` because it will force many code changes. I don’t know if it is a priority now, but I still think this is a missed optimization opportunity. In many of my own benchmarks, Julia outperformed C++ and could even match the same performance as `ifort` when using `SArrays`. AFAIK, Fortran implements this by default, so I expect Julia to close this tiny gap in a near 1.x release (or I hope).

Such as? You should be able to write generic code that works with either `Array`s or `SArray`s (or `MArray`s if they need to be mutable) and automatically get loop unrolling and other fixed-size-array optimizations.

2 Likes

I will have to use an Array of SVectors because n is large and many other changes:

``````r = rand(3,n)
# becomes
r = [SVector(rand(3)...,) for i=1:n]

a = zeros(3,n)
# becomes
a = [SVector(zeros(3)...,) for i=1:n]

m, n = size(r)
# becomes
m = size(r,1)
n = size(r,1)

dist2 += (r[k,i]-r[k,j])^2
# becomes
dist2 += (r[i][k]-r[j][k])^2

# The loop
for k = 1:m
tmp = (r6 - 2r12)/dist2 * (r[k,i]-r[k,j])
a[k,i] -= tmp
a[k,j] += tmp
end
# becomes
tmp = (r6 - 2r12)/dist2 .* (r[i] - r[j])
a[i] -= tmp
a[j] += tmp
``````

The code becomes less natural and in this specific case, no performance improvement is obtained.

I see, so the issue is that the array has a static size in one dimension but a dynamic size in another dimension?

Yes, exactly.

It’s worth noting that a `Vector{SVector{3, Float64}}` with length `n` has exactly the same memory layout as a `Matrix{Float64}` of size `3 x n`, so you can even reinterpret from one to the other without copying the underlying data:

``````julia> using StaticArrays

julia> x = rand(3, 5)
3×5 Array{Float64,2}:
0.734312  0.112768  0.582428  0.225611  0.649733
0.75691   0.719704  0.613424  0.757309  0.95314
0.635051  0.921791  0.94049   0.123226  0.262944

julia> reinterpret(SVector{3, Float64}, x, (size(x, 2),))
5-element Array{StaticArrays.SArray{Tuple{3},Float64,1,3},1}:
[0.734312, 0.75691, 0.635051]
[0.112768, 0.719704, 0.921791]
[0.582428, 0.613424, 0.94049]
[0.225611, 0.757309, 0.123226]
[0.649733, 0.95314, 0.262944]
``````

in this way you can pretty seamlessly move from one interpretation of the data to another without substantial changes in your code.

5 Likes

I haven’t tested most of your code, but the way you are constructing the `SVector` arrays is very much suboptimal. For `n = 20`:

``````julia> @btime [SVector(rand(3)...,) for i in 1:\$n];
9.866 μs (123 allocations: 5.00 KiB)
``````

``````julia> @btime [@SVector rand(3) for i in 1:\$n];
330.255 ns (1 allocation: 576 bytes)
``````

Still slower than `rand(3, n)`, but not as catastrophic as before.

1 Like

Even faster:

``````reinterpret(SVector{3,Float64}, reshape(rand(3, n), :))
``````

How can they have the same memory layout? In the former case, `x[i]` yields a vector, in the latter it yields a scalar. So, I cannot avoid code changes?

Thanks, I was using both constructs interchangeably without knowing that speed difference. Still, what annoys me more is that performance is the same, the compiler couldn’t constant-propagate the size of the array which is fixed at run-time. I’m not able to find a work-around either.

How can they have the same memory layout? In the former case, x[i] yields a vector, in the latter it yields a scalar. So, I cannot avoid code changes?

Look at the example. That vector is the `i`th column of `x`.

I’m interested in array access in the same way, for example, this loop will not work with the reinterpreted version:

``````for k = 1:3
dist2 += (r[k,i]-r[k,j])^2
end
``````

There is nothing as `r[k,i]`, there is `r[i][k]`. This is what I want to avoid, do I misunderstand something?

``````julia> x = rand(3, 5)
3×5 Array{Float64,2}:
0.661841   0.727968  0.0902786  0.535843  0.963917
0.0925486  0.310896  0.218177   0.157345  0.50067
0.0927625  0.47004   0.908158   0.098368  0.734819

julia> xx = reinterpret(SVector{3,Float64}, reshape(x, :))
5-element reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,1}):
[0.661841, 0.0925486, 0.0927625]
[0.727968, 0.310896, 0.47004]
[0.0902786, 0.218177, 0.908158]
[0.535843, 0.157345, 0.098368]
[0.963917, 0.50067, 0.734819]

julia> x[2,3]
0.21817716087561045

julia> xx[2,3]
ERROR: BoundsError: attempt to access 5-element reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,1}) at index [2, 3]
Stacktrace:
 throw_boundserror(::Base.ReinterpretArray{SArray{Tuple{3},Float64,1,3},1,Float64,Array{Float64,1}}, ::Tuple{Int64,Int64}) at .\abstractarray.jl:492
 checkbounds at .\abstractarray.jl:457 [inlined]
 _getindex at .\abstractarray.jl:935 [inlined]
 getindex(::Base.ReinterpretArray{SArray{Tuple{3},Float64,1,3},1,Float64,Array{Float64,1}}, ::Int64, ::Int64) at .\abstractarray.jl:913
 top-level scope at none:0
``````
``````rtemp = x[i] - x[j]
dist2 += rtemp' * rtemp
``````

They don’t have the same indexing behavior, they share the same memory layout—and indeed the very same memory. One behaves as a two-dimensional array, the other as a vector of vectors. But they both are represented in-memory by the same contiguous collection of floating-point values:

``````julia> A = rand(3, 5)
3×5 Array{Float64,2}:
0.826885  0.523742  0.0546594  0.848417  0.678522
0.308623  0.866344  0.0319207  0.249763  0.93445
0.667832  0.195743  0.296378   0.996889  0.706275

julia> v = reinterpret(SVector{3,Float64}, reshape(A, :))
5-element reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,1}):
[0.826885, 0.308623, 0.667832]
[0.523742, 0.866344, 0.195743]
[0.0546594, 0.0319207, 0.296378]
[0.848417, 0.249763, 0.996889]
[0.678522, 0.93445, 0.706275]

julia> A[2,3] *= -1
-0.031920712969890186

julia> A
3×5 Array{Float64,2}:
0.826885  0.523742   0.0546594  0.848417  0.678522
0.308623  0.866344  -0.0319207  0.249763  0.93445
0.667832  0.195743   0.296378   0.996889  0.706275

julia> v
5-element reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,1}):
[0.826885, 0.308623, 0.667832]
[0.523742, 0.866344, 0.195743]
[0.0546594, -0.0319207, 0.296378]
[0.848417, 0.249763, 0.996889]
[0.678522, 0.93445, 0.706275]

julia> v
-0.031920712969890186

julia> pointer(A)
Ptr{Float64} @0x0000000119cebee0

julia> pointer(v)
Ptr{SArray{Tuple{3},Float64,1,3}} @0x0000000119cebee0
``````
2 Likes

Thank you all for the helpful discussions, the `SArrays` version below is more than 20% faster than the Intel Fortran equivalent. I’m amazed by the performance of Julia with `StaticArrays`, small code changes payed me back very well.

``````using BenchmarkTools
using StaticArrays

function potential(r, a)
m, n = size(r)
Epot = 0.0
@inbounds for i=1:n-1, j=i+1:n
dist2 = 0.0
for k = 1:m
dist2 += (r[k,i]-r[k,j])^2
end
r6 = (1/dist2)^3
r12 = r6*r6
Epot += r12 - r6
for k = 1:m
tmp = (r6 - 2r12)/dist2 * (r[k,i]-r[k,j])
a[k,i] -= tmp
a[k,j] += tmp
end
end
return Epot
end

n = 550
r = rand(3,n)
a = zeros(3,n)
@btime potential(\$r,\$a)

function potential2(r, a)
n = size(r,1)
Epot = 0.0
@inbounds for i=1:n-1, j=i+1:n
dist2 = 0.0
rij = r[i]-r[j]
dist2 = dot(rij,rij)

r6 = (1/dist2)^3
r12 = r6*r6
Epot += r12 - r6

tmp = (r6 - 2r12)/dist2 .* (r[i] - r[j])
a[i] -= tmp
a[j] += tmp
end
return Epot
end

n = 550
r = [@SVector rand(3) for i in 1:n]
a = [@SVector zeros(3) for i in 1:n]
@btime potential2(\$r,\$a)
``````

and timings

``````  1.471 ms (0 allocations: 0 bytes)  # Original Julia code
1.010 ms (0 allocations: 0 bytes)  # Using StaticArrays
1.234 ms                           # Intel Fortran equivalent
``````
7 Likes

I wonder, in `C` wouldn’t the compiler be able to do some loop unrolling even if the number of loops is a parameter for the function?

Julia does it too, but the unrolling can be by too much. For example, in a simple dot product:

``````function dot_product(x::AbstractArray{T},y::AbstractArray{T}) where T
@boundscheck length(x) == length(y) || throw(BoundsError())
out = zero(T)
@fastmath @inbounds for i ∈ eachindex(x)
out += x[i] * y[i]
end
out
end
``````

It unrolls by a factor of 16 on a computer with avx2 (it’d be 32 given avx512):

``````x = randn(1);
y = randn(1);
@code_native dot_product(x,y)
``````

Gives output including:

``````; Function getindex; {
; Location: array.jl:731
L160:
vmovupd	(%rdx,%rcx,8), %ymm4
vmovupd	32(%rdx,%rcx,8), %ymm5
vmovupd	64(%rdx,%rcx,8), %ymm6
vmovupd	96(%rdx,%rcx,8), %ymm7
;}
; Location: fastmath.jl:161
cmpq	%rcx, %r8
jne	L160
cmpq	%r8, %r9
vextractf128	\$1, %ymm0, %xmm1
;}
``````

This is the for loop.
The block following `L160:` says to move data from memory (eg, `(%rdx,%rcx,8)` into registers (eg, `%ymm4`). Registers starting with “y” are 256 bits, meaning that’s 4 doubles. It says to do this 4 times, for 16 doubles total.
With avx-512, the "y"s would be "z"s, and that’d be 32 numbers total.

Next up are four `vfmadd231pd`, saying to do four vector fused mutiply-adds on numbers from memory, those numbers just loaded, adding and storing them registers ymm0 through ymm3.
Then `jne L160` says to conditionally jump back to L160, to repeat this process.

The problem is, if the loop is shorter than 16, it doesn’t get unrolled at all.

``````julia> using BenchmarkTools

julia> x15 = randn(15); y15 = randn(15);

julia> x16 = randn(16); y16 = randn(16);

julia> @benchmark dot_product(\$x15, \$y15)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     9.567 ns (0.00% GC)
median time:      9.628 ns (0.00% GC)
mean time:        9.648 ns (0.00% GC)
maximum time:     23.377 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     999

julia> @benchmark dot_product(\$x16, \$y16)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     5.540 ns (0.00% GC)
median time:      5.580 ns (0.00% GC)
mean time:        5.636 ns (0.00% GC)
maximum time:     20.368 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000

julia> x8 = randn(8); y8 = randn(8);

julia> @benchmark dot_product(\$x8, \$y8)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     5.510 ns (0.00% GC)
median time:      5.580 ns (0.00% GC)
mean time:        5.626 ns (0.00% GC)
maximum time:     19.136 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````

Basically, it unrolls to optimize for long loops. If you know the loops are short, it helps to pass along that information.
With SVectors (which are amazingly faster), it does speed up for vectors of length 8 vs 16:

``````julia> sx16 = @SVector randn(16);

julia> sy16 = @SVector randn(16);

julia> @benchmark dot_product(\$sx16, \$sy16)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.616 ns (0.00% GC)
median time:      3.647 ns (0.00% GC)
mean time:        3.865 ns (0.00% GC)
maximum time:     22.252 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000

julia> sx8 = @SVector randn(8);

julia> sy8 = @SVector randn(8);

julia> @benchmark dot_product(\$sx8, \$sy8)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.915 ns (0.00% GC)
median time:      2.955 ns (0.00% GC)
mean time:        3.105 ns (0.00% GC)
maximum time:     24.025 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````

Not that much faster though, which I think is I think why LLVM normally unrolls by 4 (instructions aren’t done one at a time).
15 is slower, I guess because it doesn’t neatly divide into chunks of 4:

``````julia> sx15 = @SVector randn(15);

julia> sy15 = @SVector randn(15);

julia> @benchmark dot_product(\$sx15, \$sy15)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     4.548 ns (0.00% GC)
median time:      4.589 ns (0.00% GC)
mean time:        4.830 ns (0.00% GC)
maximum time:     22.592 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````
1 Like

Hello,

just wanted to chime in. I am the starter of The cost of size() in for-loops. What a coincidence.

This is for me a nice opportunity to learn how to correctly use static arrays ! I would have mostly the same questions, we have also some Lennard-Jones 12-6 in the mix.

Best Regards

Christof

I’m not an expert yet I’d thought this would be advantage of Julia that given a loop, since the data is abstracted and Julia knows about it more than a regular compiler, it will maximize the performance.

For instance, I don’t see why in the case of a Dot Product using Static Arrays would have any advantage to regular arrays.
At the beginning of the loop Julia’s knowledge should be the same (Length of the loop, size and type of the arrays).