Strange performance of a loop

I’m not sure if this is known and was discussed before, but I noticed a performance difference (20% loss) when using

m = size(a,1)
for i = 1:m 
  do something a[i,k]
end

than when using the literal constant 3 instead:

for i = 1:3 
  do something a[i,k]
end

I don’t know if this is an unimplemented compiler optimization, but the difference is much bigger in v0.6 than in v0.7. Here is a complete example:

using BenchmarkTools

function potential(r, a)
    m, n = size(r)
    Epot = 0.0
    @inbounds for i=1:n-1, j=i+1:n
        dist2 = 0.0
        for k = 1:m
            dist2 += (r[k,i]-r[k,j])^2
        end 

        r6 = (1/dist2)^3
        r12 = r6*r6
        Epot += r12 - r6

        for k = 1:m
            tmp = (r6 - 2r12)/dist2 * (r[k,i]-r[k,j])
            a[k,i] -= tmp
            a[k,j] += tmp
        end
    end
  return Epot
end

n = 550
r = rand(3,n)
a = zeros(3,n)

 julia> @btime potential($r,$a)
   1.471 ms (0 allocations: 0 bytes)
 2.125152650915103e27

and when replacing m by 3:

 julia> @btime potential($r,$a)
   1.214 ms (0 allocations: 0 bytes)
 2.1251526509151058e27

In v0.6, the difference is more obvious as I said:

julia> @btime potential($r, $a)
  2.513 ms (0 allocations: 0 bytes)
5.334031316343885e22

and

julia> @btime potential($r, $a)
  1.219 ms (0 allocations: 0 bytes)
5.334031316343885e22

Isn’t this the exact same discussion as The cost of size() in for-loops? If the number of iterations is constant, the loop can be unrolled. If you want the compiler to specialize on specific array sizes, use StaticArrays.jl.

4 Likes

Thank you Stefan and sorry I totally overlooked that question. The discussion there is really helpful, but I don’t want to change to SArrays because it will force many code changes. I don’t know if it is a priority now, but I still think this is a missed optimization opportunity. In many of my own benchmarks, Julia outperformed C++ and could even match the same performance as ifort when using SArrays. AFAIK, Fortran implements this by default, so I expect Julia to close this tiny gap in a near 1.x release (or I hope).

Such as? You should be able to write generic code that works with either Arrays or SArrays (or MArrays if they need to be mutable) and automatically get loop unrolling and other fixed-size-array optimizations.

2 Likes

I will have to use an Array of SVectors because n is large and many other changes:

r = rand(3,n)
# becomes 
r = [SVector(rand(3)...,) for i=1:n]

a = zeros(3,n)
# becomes 
a = [SVector(zeros(3)...,) for i=1:n]

m, n = size(r)
# becomes
m = size(r[1],1)
n = size(r,1)

dist2 += (r[k,i]-r[k,j])^2
# becomes 
dist2 += (r[i][k]-r[j][k])^2

# The loop
for k = 1:m
  tmp = (r6 - 2r12)/dist2 * (r[k,i]-r[k,j])
  a[k,i] -= tmp
  a[k,j] += tmp
end
# becomes
tmp = (r6 - 2r12)/dist2 .* (r[i] - r[j])
a[i] -= tmp
a[j] += tmp

The code becomes less natural and in this specific case, no performance improvement is obtained.

I see, so the issue is that the array has a static size in one dimension but a dynamic size in another dimension?

Yes, exactly.

It’s worth noting that a Vector{SVector{3, Float64}} with length n has exactly the same memory layout as a Matrix{Float64} of size 3 x n, so you can even reinterpret from one to the other without copying the underlying data:

julia> using StaticArrays

julia> x = rand(3, 5)
3×5 Array{Float64,2}:
 0.734312  0.112768  0.582428  0.225611  0.649733
 0.75691   0.719704  0.613424  0.757309  0.95314 
 0.635051  0.921791  0.94049   0.123226  0.262944

julia> reinterpret(SVector{3, Float64}, x, (size(x, 2),))
5-element Array{StaticArrays.SArray{Tuple{3},Float64,1,3},1}:
 [0.734312, 0.75691, 0.635051] 
 [0.112768, 0.719704, 0.921791]
 [0.582428, 0.613424, 0.94049] 
 [0.225611, 0.757309, 0.123226]
 [0.649733, 0.95314, 0.262944] 

in this way you can pretty seamlessly move from one interpretation of the data to another without substantial changes in your code.

5 Likes

I haven’t tested most of your code, but the way you are constructing the SVector arrays is very much suboptimal. For n = 20:

julia> @btime [SVector(rand(3)...,) for i in 1:$n];
  9.866 μs (123 allocations: 5.00 KiB)

Do this instead:

julia> @btime [@SVector rand(3) for i in 1:$n];
  330.255 ns (1 allocation: 576 bytes)

Still slower than rand(3, n), but not as catastrophic as before.

1 Like

Even faster:

reinterpret(SVector{3,Float64}, reshape(rand(3, n), :))

How can they have the same memory layout? In the former case, x[i] yields a vector, in the latter it yields a scalar. So, I cannot avoid code changes?

Thanks, I was using both constructs interchangeably without knowing that speed difference. Still, what annoys me more is that performance is the same, the compiler couldn’t constant-propagate the size of the array which is fixed at run-time. I’m not able to find a work-around either.

How can they have the same memory layout? In the former case, x[i] yields a vector, in the latter it yields a scalar. So, I cannot avoid code changes?

Look at the example. That vector is the ith column of x.

I’m interested in array access in the same way, for example, this loop will not work with the reinterpreted version:

for k = 1:3
     dist2 += (r[k,i]-r[k,j])^2
end 

There is nothing as r[k,i], there is r[i][k]. This is what I want to avoid, do I misunderstand something?

This is your example:

julia> x = rand(3, 5)
3×5 Array{Float64,2}:
 0.661841   0.727968  0.0902786  0.535843  0.963917
 0.0925486  0.310896  0.218177   0.157345  0.50067
 0.0927625  0.47004   0.908158   0.098368  0.734819

julia> xx = reinterpret(SVector{3,Float64}, reshape(x, :))
5-element reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,1}):
 [0.661841, 0.0925486, 0.0927625]
 [0.727968, 0.310896, 0.47004]
 [0.0902786, 0.218177, 0.908158]
 [0.535843, 0.157345, 0.098368]
 [0.963917, 0.50067, 0.734819]

julia> x[2,3]
0.21817716087561045

julia> xx[2,3]
ERROR: BoundsError: attempt to access 5-element reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,1}) at index [2, 3]
Stacktrace:
 [1] throw_boundserror(::Base.ReinterpretArray{SArray{Tuple{3},Float64,1,3},1,Float64,Array{Float64,1}}, ::Tuple{Int64,Int64}) at .\abstractarray.jl:492
 [2] checkbounds at .\abstractarray.jl:457 [inlined]
 [3] _getindex at .\abstractarray.jl:935 [inlined]
 [4] getindex(::Base.ReinterpretArray{SArray{Tuple{3},Float64,1,3},1,Float64,Array{Float64,1}}, ::Int64, ::Int64) at .\abstractarray.jl:913
 [5] top-level scope at none:0
rtemp = x[i] - x[j]
dist2 += rtemp' * rtemp

They don’t have the same indexing behavior, they share the same memory layout—and indeed the very same memory. One behaves as a two-dimensional array, the other as a vector of vectors. But they both are represented in-memory by the same contiguous collection of floating-point values:

julia> A = rand(3, 5)
3×5 Array{Float64,2}:
 0.826885  0.523742  0.0546594  0.848417  0.678522
 0.308623  0.866344  0.0319207  0.249763  0.93445
 0.667832  0.195743  0.296378   0.996889  0.706275

julia> v = reinterpret(SVector{3,Float64}, reshape(A, :))
5-element reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,1}):
 [0.826885, 0.308623, 0.667832]
 [0.523742, 0.866344, 0.195743]
 [0.0546594, 0.0319207, 0.296378]
 [0.848417, 0.249763, 0.996889]
 [0.678522, 0.93445, 0.706275]

julia> A[2,3] *= -1
-0.031920712969890186

julia> A
3×5 Array{Float64,2}:
 0.826885  0.523742   0.0546594  0.848417  0.678522
 0.308623  0.866344  -0.0319207  0.249763  0.93445
 0.667832  0.195743   0.296378   0.996889  0.706275

julia> v
5-element reinterpret(SArray{Tuple{3},Float64,1,3}, ::Array{Float64,1}):
 [0.826885, 0.308623, 0.667832]
 [0.523742, 0.866344, 0.195743]
 [0.0546594, -0.0319207, 0.296378]
 [0.848417, 0.249763, 0.996889]
 [0.678522, 0.93445, 0.706275]

julia> v[3][2]
-0.031920712969890186

julia> pointer(A)
Ptr{Float64} @0x0000000119cebee0

julia> pointer(v)
Ptr{SArray{Tuple{3},Float64,1,3}} @0x0000000119cebee0
2 Likes

Thank you all for the helpful discussions, the SArrays version below is more than 20% faster than the Intel Fortran equivalent. I’m amazed by the performance of Julia with StaticArrays, small code changes payed me back very well.

using BenchmarkTools
using StaticArrays

function potential(r, a)
    m, n = size(r)
    Epot = 0.0
    @inbounds for i=1:n-1, j=i+1:n
        dist2 = 0.0
        for k = 1:m
            dist2 += (r[k,i]-r[k,j])^2
        end 
        r6 = (1/dist2)^3
        r12 = r6*r6
        Epot += r12 - r6
        for k = 1:m
            tmp = (r6 - 2r12)/dist2 * (r[k,i]-r[k,j])
            a[k,i] -= tmp
            a[k,j] += tmp
        end
    end
  return Epot
end

n = 550
r = rand(3,n)
a = zeros(3,n)
@btime potential($r,$a)

function potential2(r, a)
    n = size(r,1)
    Epot = 0.0
    @inbounds for i=1:n-1, j=i+1:n
        dist2 = 0.0
        rij = r[i]-r[j]
        dist2 = dot(rij,rij)

        r6 = (1/dist2)^3
        r12 = r6*r6
        Epot += r12 - r6

        tmp = (r6 - 2r12)/dist2 .* (r[i] - r[j])
        a[i] -= tmp
        a[j] += tmp
    end
  return Epot
end

n = 550
r = [@SVector rand(3) for i in 1:n]
a = [@SVector zeros(3) for i in 1:n]
@btime potential2($r,$a)

and timings

  1.471 ms (0 allocations: 0 bytes)  # Original Julia code
  1.010 ms (0 allocations: 0 bytes)  # Using StaticArrays
  1.234 ms                           # Intel Fortran equivalent
7 Likes

I wonder, in C wouldn’t the compiler be able to do some loop unrolling even if the number of loops is a parameter for the function?

Julia does it too, but the unrolling can be by too much. For example, in a simple dot product:

function dot_product(x::AbstractArray{T},y::AbstractArray{T}) where T
    @boundscheck length(x) == length(y) || throw(BoundsError())
    out = zero(T)
    @fastmath @inbounds for i ∈ eachindex(x)
        out += x[i] * y[i]
    end
    out
end

It unrolls by a factor of 16 on a computer with avx2 (it’d be 32 given avx512):

x = randn(1);
y = randn(1);
@code_native dot_product(x,y)

Gives output including:

; Function getindex; {
; Location: array.jl:731
L160:
	vmovupd	(%rdx,%rcx,8), %ymm4
	vmovupd	32(%rdx,%rcx,8), %ymm5
	vmovupd	64(%rdx,%rcx,8), %ymm6
	vmovupd	96(%rdx,%rcx,8), %ymm7
;}
; Function add_fast; {
; Location: fastmath.jl:161
	vfmadd231pd	(%rax,%rcx,8), %ymm4, %ymm0
	vfmadd231pd	32(%rax,%rcx,8), %ymm5, %ymm1
	vfmadd231pd	64(%rax,%rcx,8), %ymm6, %ymm2
	vfmadd231pd	96(%rax,%rcx,8), %ymm7, %ymm3
	addq	$16, %rcx
	cmpq	%rcx, %r8
	jne	L160
	vaddpd	%ymm0, %ymm1, %ymm0
	cmpq	%r8, %r9
	vaddpd	%ymm0, %ymm2, %ymm0
	vaddpd	%ymm0, %ymm3, %ymm0
	vextractf128	$1, %ymm0, %xmm1
	vaddpd	%ymm1, %ymm0, %ymm0
	vhaddpd	%ymm0, %ymm0, %ymm0
;}

This is the for loop.
The block following L160: says to move data from memory (eg, (%rdx,%rcx,8) into registers (eg, %ymm4). Registers starting with “y” are 256 bits, meaning that’s 4 doubles. It says to do this 4 times, for 16 doubles total.
With avx-512, the "y"s would be "z"s, and that’d be 32 numbers total.

Next up are four vfmadd231pd, saying to do four vector fused mutiply-adds on numbers from memory, those numbers just loaded, adding and storing them registers ymm0 through ymm3.
Then jne L160 says to conditionally jump back to L160, to repeat this process.

The problem is, if the loop is shorter than 16, it doesn’t get unrolled at all.

julia> using BenchmarkTools

julia> x15 = randn(15); y15 = randn(15);

julia> x16 = randn(16); y16 = randn(16);

julia> @benchmark dot_product($x15, $y15)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.567 ns (0.00% GC)
  median time:      9.628 ns (0.00% GC)
  mean time:        9.648 ns (0.00% GC)
  maximum time:     23.377 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark dot_product($x16, $y16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.540 ns (0.00% GC)
  median time:      5.580 ns (0.00% GC)
  mean time:        5.636 ns (0.00% GC)
  maximum time:     20.368 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> x8 = randn(8); y8 = randn(8);

julia> @benchmark dot_product($x8, $y8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.510 ns (0.00% GC)
  median time:      5.580 ns (0.00% GC)
  mean time:        5.626 ns (0.00% GC)
  maximum time:     19.136 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Basically, it unrolls to optimize for long loops. If you know the loops are short, it helps to pass along that information.
With SVectors (which are amazingly faster), it does speed up for vectors of length 8 vs 16:

julia> sx16 = @SVector randn(16);

julia> sy16 = @SVector randn(16);

julia> @benchmark dot_product($sx16, $sy16)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.616 ns (0.00% GC)
  median time:      3.647 ns (0.00% GC)
  mean time:        3.865 ns (0.00% GC)
  maximum time:     22.252 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> sx8 = @SVector randn(8);

julia> sy8 = @SVector randn(8);

julia> @benchmark dot_product($sx8, $sy8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.915 ns (0.00% GC)
  median time:      2.955 ns (0.00% GC)
  mean time:        3.105 ns (0.00% GC)
  maximum time:     24.025 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Not that much faster though, which I think is I think why LLVM normally unrolls by 4 (instructions aren’t done one at a time).
15 is slower, I guess because it doesn’t neatly divide into chunks of 4:

julia> sx15 = @SVector randn(15);

julia> sy15 = @SVector randn(15);

julia> @benchmark dot_product($sx15, $sy15)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.548 ns (0.00% GC)
  median time:      4.589 ns (0.00% GC)
  mean time:        4.830 ns (0.00% GC)
  maximum time:     22.592 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
1 Like

Hello,

just wanted to chime in. I am the starter of The cost of size() in for-loops. What a coincidence.

This is for me a nice opportunity to learn how to correctly use static arrays ! I would have mostly the same questions, we have also some Lennard-Jones 12-6 in the mix.

Best Regards

Christof

I’m not an expert yet I’d thought this would be advantage of Julia that given a loop, since the data is abstracted and Julia knows about it more than a regular compiler, it will maximize the performance.

For instance, I don’t see why in the case of a Dot Product using Static Arrays would have any advantage to regular arrays.
At the beginning of the loop Julia’s knowledge should be the same (Length of the loop, size and type of the arrays).