# Speed up multidimensional code without Base.Cartesian

Hello, I wonder if someone could help me speeding up my arbitrary dimension code to run as fast as the explicit version but without using `Base.Cartesian`

Consider the following examples

``````using Random
using BenchmarkTools

@inline bc(x, L) = x < L ? x + 1 : 1

function energy_explicit(Λ::AbstractArray{T,3}) where T
L = size(Λ, 1)
E = 0
@simd for r ∈ CartesianIndices(Λ)
x, y, z = Tuple(r)
@inbounds E += Λ[x, y, z] * (Λ[bc(x, L), y, z] + Λ[x, bc(y, L), z] + Λ[x, y, bc(z, L)])
end
return -E
end

function energy_arbitrary(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@simd for r ∈ CartesianIndices(Λ)
@inbounds E += Λ[r] * sum(Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...] for d ∈ 1:N)
end
return -E
end
``````

On my machine I get the following results

``````julia> Λ = rand(Int8[-1,1], 64, 64, 64);

julia> @btime energy_explicit(\$Λ)
252.054 μs (0 allocations: 0 bytes)
-192

julia> @btime energy_arbitrary(\$Λ)
498.445 μs (0 allocations: 0 bytes)
-192
``````

I know that I could use `Base.Cartesian` and explicitly generate the expression from `energy_explicit` but can someone come up with a more simple solution?

I think what you want to do is make `Λ` be an `Array{StaticVector{T,L},N-1}` which will make the `ntuple` construction type stable.

Could you explain how to use it exactly and the second question is: why is ntuple not type stable in the first place? I mean, the `N` is known from dispatch…

For comparison reasons, here the generated version:

``````@generated function energy_generated(Λ::AbstractArray{T,N}) where {T,N}
quote
E = 0
@simd for r ∈ CartesianIndices(Λ)
E += @inbounds Λ[r] * Base.Cartesian.@ncall(\$N, +, d -> Base.Cartesian.@nref(\$N, Λ, x -> (x == d ? bc(r[x], size(Λ, x)) : r[x])))
end
return -E
end
end
``````
``````julia> @btime energy_generated(\$Λ)
252.085 μs (0 allocations: 0 bytes)
564
``````

I think you want to replace this with an `ntuple`.

``````function energy_arbitrary_ntuple(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@simd for r ∈ CartesianIndices(Λ)
@inbounds E += @inbounds Λ[r] * (+)(ntuple(d -> Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...], N)...)
end
return -E
end
``````
``````julia> @btime energy_arbitrary_ntuple(\$Λ)
498.140 μs (0 allocations: 0 bytes)
564
``````

What if you write a custom (simple) sum function? Base sum is not the fastest, and it does not specialize for N either (AFAIK).

Following @lmiq’s suggestion, I was able to recover the same performance of the explicit version by defining a sum function that specialises on the dimension `N`:

``````@inline nsum(f::F, ::Val{N}) where {F,N} = sum(ntuple(f, Val(N)))

function energy_nsum(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@simd for r ∈ CartesianIndices(Λ)
@inbounds E += Λ[r] *
nsum(Val(N)) do d
@inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...]
end
end
return -E
end
``````
``````ulia> @btime energy_explicit(\$Λ)
263.647 μs (0 allocations: 0 bytes)
-832

julia> @btime energy_nsum(\$Λ)
263.547 μs (0 allocations: 0 bytes)
-832
``````
1 Like

I think the issue with this solution is that the `@inbounds` is not propagated to the interior of the `ntuple`s. Otherwise, it should be exactly equivalent to my solution above.

If I throw in some additional `@inbounds` (and rewrite it to make it a little easier to read), I can also achieve the performance of the explicit version:

``````function energy_arbitrary_ntuple(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@simd for r ∈ CartesianIndices(Λ)
E += @inbounds Λ[r] * (+)(ntuple(N) do d
inds = ntuple(N) do i
@inbounds i == d ? bc(r[i], size(Λ, i)) : r[i]
end
@inbounds Λ[inds...]
end...)
end
return -E
end

``````
``````julia> @btime energy_arbitrary_ntuple(\$Λ)
262.834 μs (0 allocations: 0 bytes)
-832
``````
1 Like

The performance difference of the original version comes from:

1. boundcheck (@inbounds won’t work if used outside the generator)
2. the length of generator is unknow.

The following modified version has a similar performance.

``````function energy_arbitrary_faster(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@inbounds @simd for r ∈ CartesianIndices(Λ)
E += Λ[r] * sum((@inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...]) for d in ntuple(identity, N))
end
return -E
end
``````
1 Like

The fact that `bc` makes the indexing unpredictable is also about a factor of 2 effect, for me. Can you duplicate the first row/col/etc of `Λ` as the last+1? Otherwise, you can deal with the edge afterwards. Not hard to write for `N=2`, more complicated for N=3… in general you could try TiledIteration.EdgeIterator although I haven’t timed it here.

It specializes/inlines if you pass a `tuple` (whose length is known at compile-time) instead of a generator.

1 Like

Another nice option is HybridArrays.jl which are basically what you suggest, but pretend to be `N` dimensional, with some of the dimensions statically sized.

1 Like

Note that this is a perfect case for using ghost cells. Just make your array `Λ` bigger by one element in each direction, and copy the periodic boundaries before doing any “stencil” calculations.

Not only is this much more flexible and tends to lead to cleaner code (it separates your boundary conditions from your other looping code), but it is also significantly faster because you eliminate all of the `?` conditionals in your inner loops.

In particular, the following code gives the same answer as `energy_explicit` but uses ghost cells:

``````Λghost = similar(Λ, size(Λ) .+ 1);
Λghost[CartesianIndices(Λ)] = Λ;
function periodicboundaries!(Λghost::AbstractArray{<:Any,N}) where {N}
sz = size(Λghost)
for d = 1:N
from = CartesianIndices(ntuple(i -> i == d ? (1:1) : (1:sz[i]), N))
to = CartesianIndices(ntuple(i -> i == d ? (sz[i]:sz[i]) : (1:sz[i]), N))
for (i,j) in zip(from,to)
Λghost[j] = Λghost[i]
end
end
return Λghost
end

function energy_arbitrary_faster_ghost(Λ::AbstractArray{T,N}) where {T,N}
E = 0 + zero(T) # promote to at least Int
@inbounds @simd for r ∈ CartesianIndices(ntuple(i -> 1:size(Λ,i)-1, N))
E += Λ[r] * sum(ntuple(d -> @inbounds(Λ[ntuple(i -> i == d ? r[i]+1 : r[i], N)...]), N))
end
return -E
end

@btime periodicboundaries!(\$Λghost)
@btime energy_arbitrary_faster_ghost(\$Λghost)
``````

giving

``````  57.329 μs (0 allocations: 0 bytes)
91.855 μs (0 allocations: 0 bytes)
``````

vs 337µm for `energy_explicit` on my machine. So, even if you count the cost of copying over the periodic boundaries (which only needs to be done once whenever you update the array) it is more than 2× faster. If you don’t count the time for `periodicboundaries!`, then it is almost 4× faster.

PS. Note that I eliminated the generator completely, compared to @N5N3’s solution.

5 Likes

Yeah, that is a good point. Thank you. I have thought about `@inbounds` but about it inside `ntuple`.

This made my day. I was unaware that `1:N` and `ntuple(identity, N) can have different performance in a for-loop. Is it always the case, because that would make it very sensitive with respect to for loops in all such similar cases.

Yes, there are even more options like precalculate the boundary indices and so on. But here I would like to have the most straight-forward but fast implementation.

Thanks, that sounds great. I will inspect it more soon.

Thank all the answers. I think it answers all my questions by now.

Im principle it is a great thing. However, for a LxLxL = V lattice I usualy have V updates of individul sites before calculating the energy. So, it is still advisable to not copy the updates. But thanks for the great solution at this point.

Unfortunately, it is still slower on my machine… ~ 370 μs

Ok, so the culpit here was the`sum` function. I achieve the same performance of 250 μs with this one, which uses a normal `for` loop

``````function energy_arbitrary(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@simd for r ∈ CartesianIndices(Λ)
E += @inbounds Λ[r] * (+)((@inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...] for d ∈ 1:N)...)
end
return -E
end
``````

Well, I looked into the generated LLVM IR.
Both `sum(x)` (my version) and `+(x...)` (your version) is unrolled by the optimizer.
The difference is: `sum(x::Generator)` call `mapreduce(identity, add_sum, x)` to avoid overflow, while `+(x...)` not.
If you replace `sum(x)` with `reduce(+,x)`, the performance should be exactly same.

1 Like

Ah, thanks. That sounds intuitive 