How to achieve the greatest speed with nested loops?

I have a large three-dimensional array X and wish to compute the sum over i of X[i,j,m] * X[i,t,m] producing a three-dimensional array Y. I’ve tried the following (my arrays will be larger than this).

``````using BenchmarkTools

function one(X::Array{Float64,3}, Y::Array{Float64,3})
Y[:,:,:] = [ sum( X[:,j,m] .* X[:,t,m]) for j= 1:10, t=1:10, m=1:10 ]
end
function two(X::Array{Float64,3}, Y::Array{Float64,3})
Y[:,:,:] = [ sum( X[i,j,m] * X[i,t,m] for i = 1:100) for j= 1:10, t=1:10, m=1:10 ]
end
function three(X::Array{Float64,3}, Y::Array{Float64,3})
@views    Y[:,:,:] = [ sum( X[:,j,m] .* X[:,t,m]) for j= 1:10, t=1:10, m=1:10 ]
end
function four(X::Array{Float64,3}, Y::Array{Float64,3})
for j = 1:10
for t=1:10
for m = 1:10
Y[j,t,m] = sum( X[i,j,m] * X[i,t,m] for i = 1:100 )
end
end
end
end

function runme()
Y = zeros(10,10,10)
X = randn(100,10,10)
Y .= 0.0
@btime one(\$X,\$Y)
Y .= 0.0
@btime two(\$X,\$Y)
Y .= 0.0
@btime three(\$X,\$Y)
Y .= 0.0
@btime four(\$X,\$Y)
end

runme()

``````

The times produced are

230.804 μs (3003 allocations: 2.57 MiB)
67.941 μs (3 allocations: 8.02 KiB)
68.552 μs (3003 allocations: 1008.02 KiB)
63.933 μs (0 allocations: 0 bytes)

Can I improve on these times?

You should check the output of `five()`, I don’t think that is doing what you think it is doing.

And the order of iteration on `four()` is backwards. You want `m` in the outer loop and `t` in the inner loop.

And you should be passing `X` and `Y` to these functions instead of using them as global variables.

You should fix up some of these issues and rerun the benchmarks.

See here especially: https://docs.julialang.org/en/v1/manual/performance-tips/

5 Likes

Aha. I checked whether Y and Z were the same between the last two runs, but forgot to reset Y prior to running the last, thanks. Indeed not the same.

Modifying 3,4,5 to take arguments, and 5 to use Iterators.product, those three are remarkably close (at least on Julia 1.5). Switching the loop order is worth about 15%, and going all out 15x…

``````julia> @btime three!(\$Y1, \$X);
102.077 μs (1001 allocations: 882.94 KiB)

julia> @btime four!(\$Y2, \$X);
95.549 μs (0 allocations: 0 bytes)

julia> @btime five!(\$Y3, \$X);
99.836 μs (0 allocations: 0 bytes)

julia> Y1 ≈ Y2 ≈ Y3
true

julia> @btime six!(\$Y2, \$X);
87.453 μs (0 allocations: 0 bytes)

julia> @btime seven!(\$Y3, \$X);
5.279 μs (7 allocations: 256 bytes)

julia> Y1 ≈ Y2 ≈ Y3
true
``````

with these definitions:

``````function three!(Y, X)
@views   Y[:,:,:] = [ sum( X[:,j,m] .* X[:,t,m]) for j= 1:10, t=1:10, m=1:10 ]
end
function four!(Y, X)
@inbounds for j = 1:10
for t=1:10
for m = 1:10
Y[j,t,m] = sum( X[i,j,m] * X[i,t,m] for i = 1:100 )
end
end
end
end
function five!(Y, X)
@inbounds for (j,t,m) in Iterators.product(1:10,1:10,1:10)
Y[j,t,m] = sum( X[i,j,m] * X[i,t,m] for i = 1:100 )
end
end
X = randn(100,10,10);
Y = zeros(10,10,10); Y1 = similar(Y); Y2 = similar(Y); Y3 = similar(Y);

using Einsum, Tullio, LoopVectorization
six!(Y, X) = @einsum Y[j,t,m] = X[i,j,m] * X[i,t,m]
seven!(Y, X) = @tullio Y[j,t,m] = X[i,j,m] * X[i,t,m]
``````

(In fact using `@einsimd` gets you to 12.303 μs, making this a case where `@simd` actually changes things.)

3 Likes

On my machine, @einsimd is about 1.5 times slower than @tullio, except when I run a much larger problem 1_000 times on 32 physical cores at once (via pmap), in which case @einsimd wins by about 15%. (16.64 seconds versus 19.11 seconds).

1 Like

Excellent. Note that `@tullio` is multi-threaded by default (if it thinks your arrays are large enough), it’s possible that this doesn’t play well with `pmap`, perhaps that’s what you are seeing? There is also a little overhead to setting that up, most of which is avoided by `@tullio threads=false` (which would take `seven!` to 0 allocations).

2 Likes

Ha, had set JULIA_NUM_THREADS to 1, but guess Tullio does its own thing. Yes, @Tullio is down to 14 seconds with threads=false. Wow!

1 Like

Note that if your problem is larger, it’d help to reorder the memory layout of `X`.

``````julia> using Tullio, LoopVectorization

julia> eight!(Y, X) = @tullio Y[j,t,m] = X[i,j,m] * X[i,t,m] threads=false

julia> Y = zeros(20,20,20); Y1 = similar(Y); Y2 = similar(Y); Y3 = similar(Y);

julia> X = randn(100,20,20);

julia> Xp = PermutedDimsArray(permutedims(X,(2,1,3)),(2,1,3));

julia> @btime eight!(\$Y2, \$X);
26.053 μs (0 allocations: 0 bytes)

julia> @btime eight!(\$Y3, \$Xp);
16.583 μs (0 allocations: 0 bytes)

julia> Y2 ≈ Y3
true

julia> 2e-9 * length(Y) * size(X,1) / 16.583e-6 # calculate GFLOPS
96.48435144425015
``````

This requires the first actual dimension in memory to be large enough to good vectorization. On my computer, 20 is big enough, but 10 is not.

`permutedims(X, (2,1,3))` creates a copy that swaps the first two dimensions of `X`, while `PermuteDimsArrays(A, (2,1,3))` creates a lazy wrapper of this copy, pretending the first two axes are switched back so that we can reuse our old code without modification.

1 Like

Thanks! Curiously, on my machine the permuted version is marginally slower for the sizes in your example, but if I replace the 20s with 50s then I do indeed see the speed-up that you’re referring to, though not nearly as large and it vanishes if I then increase the dimension being summed over. Btw: shouldn’t it be Y2==Y3?

It helps if the vectorized dimension – `X`'s first – is a relatively round multiple of your computer’s SIMD vector width. You should see good performance with 24, for example.

``````julia> Y1 = zeros(24,24,24); Y2 = similar(Y1);

julia> X1 = randn(100,24,24);

julia> X2 = PermutedDimsArray(permutedims(X1,(2,1,3)),(2,1,3));

julia> @btime eight!(\$Y1, \$X1);
47.458 μs (0 allocations: 0 bytes)

julia> @btime eight!(\$Y2, \$X2);
22.203 μs (0 allocations: 0 bytes)

julia> 2e-9 * length(Y) * size(X1,1) ./ (47.458e-6, 22.203e-6)
(58.25782797420877, 124.52371301175516)
``````

Just making the dimension bigger would help, in that the remainder would be a smaller % of the overall size, so 50 may be better than 20.

There are a lot of things at play. If we just make the first axis of `X` a lot bigger, the permuted version slows down by a lot:

``````julia> X1 = randn(1000,24,24);

julia> X2 = PermutedDimsArray(permutedims(X1,(2,1,3)),(2,1,3));

julia> @btime eight!(\$Y1, \$X1);
457.149 μs (0 allocations: 0 bytes)

julia> @btime eight!(\$Y2, \$X2);
327.771 μs (0 allocations: 0 bytes)

julia> 2e-9 * length(Y) * size(X1,1) ./ (457.149e-6, 327.771e-6)
(60.47918731092051, 84.35157472747741)
``````

This is because it moves across that axis very quickly, and starts starving for memory. That axis is also no longer contiguous, making it worse to move across quickly that it is for the non-permuted form.

It’s not that bad yet on my computer with the 24x24 or even 48x48 sizes. But 50 adds on two more problems:

``````julia> Y1 = zeros(50,50,50); Y2 = similar(Y1);

julia> X1 = randn(1000,50,50);

julia> X2 = PermutedDimsArray(permutedims(X1,(2,1,3)),(2,1,3));

julia> @btime eight!(\$Y1, \$X1);
3.299 ms (0 allocations: 0 bytes)

julia> @btime eight!(\$Y2, \$X2);
3.424 ms (0 allocations: 0 bytes)

julia> 2e-9 * length(Y) * size(X1,1) ./ (3.299e-3, 3.424e-3)
(75.78053955744167, 73.0140186915888)
``````

Those are memory alignment, that those axes it has to sum across are now further apart (separated by 50 instead of 24 elements), and that 50 % vector width is bigger than it is for 24 or 48 on my computer.

For the permuted example, memory alignment isn’t normally that big a deal. But it is for the unpermuted case, where you need to quickly load vectors along both contiguous axes (this is slower on most computers when memory isn’t aligned, which happens for multidimensional arrays when that first axis isn’t a multiple of 64 bytes).
If we make the summed over axis 999 instead of 1000 elements, look how much slower it gets:

``````julia> X1 = randn(999,50,50);

julia> X2 = PermutedDimsArray(permutedims(X1,(2,1,3)),(2,1,3));

julia> @btime eight!(\$Y1, \$X1);
4.763 ms (0 allocations: 0 bytes)

julia> @btime eight!(\$Y2, \$X2);
3.462 ms (0 allocations: 0 bytes)

julia> 2e-9 * length(Y) * size(X1,1) ./ (4.763e-3, 3.462e-3)
(52.43543984883477, 72.14038128249567)
``````

Floating point math is not associative, so I check approximate instead of exact equality to allow for rounding differences.

Thanks @Elrod. That’s great!