Speed up multidimensional code without Base.Cartesian

Hello, I wonder if someone could help me speeding up my arbitrary dimension code to run as fast as the explicit version but without using Base.Cartesian

Consider the following examples

using Random 
using BenchmarkTools 

@inline bc(x, L) = x < L ? x + 1 : 1 

function energy_explicit(Λ::AbstractArray{T,3}) where T
	L = size(Λ, 1)
	E = 0
	@simd for r ∈ CartesianIndices(Λ) 
		x, y, z = Tuple(r)
		@inbounds E += Λ[x, y, z] * (Λ[bc(x, L), y, z] + Λ[x, bc(y, L), z] + Λ[x, y, bc(z, L)])
	end
	return -E
end

function energy_arbitrary(Λ::AbstractArray{T,N}) where {T,N}
	E = 0
	@simd for r ∈ CartesianIndices(Λ) 
			@inbounds E += Λ[r] * sum(Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...] for d ∈ 1:N)
	end
	return -E
end

On my machine I get the following results

julia> Λ = rand(Int8[-1,1], 64, 64, 64);

julia> @btime energy_explicit($Λ)
  252.054 μs (0 allocations: 0 bytes)
-192

julia> @btime energy_arbitrary($Λ)
  498.445 μs (0 allocations: 0 bytes)
-192

I know that I could use Base.Cartesian and explicitly generate the expression from energy_explicit but can someone come up with a more simple solution?

I think what you want to do is make Λ be an Array{StaticVector{T,L},N-1} which will make the ntuple construction type stable.

Could you explain how to use it exactly and the second question is: why is ntuple not type stable in the first place? I mean, the N is known from dispatch…

For comparison reasons, here the generated version:

@generated function energy_generated(Λ::AbstractArray{T,N}) where {T,N}
	quote
		E = 0
		@simd for r ∈ CartesianIndices(Λ) 
			E += @inbounds Λ[r] * Base.Cartesian.@ncall($N, +, d -> Base.Cartesian.@nref($N, Λ, x -> (x == d ? bc(r[x], size(Λ, x)) : r[x])))
		end
		return -E
	end
end
julia> @btime energy_generated($Λ)
  252.085 μs (0 allocations: 0 bytes)
564

I think you want to replace this with an ntuple.

function energy_arbitrary_ntuple(Λ::AbstractArray{T,N}) where {T,N}
	E = 0
	@simd for r ∈ CartesianIndices(Λ) 
			@inbounds E += @inbounds Λ[r] * (+)(ntuple(d -> Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...], N)...)
	end
	return -E
end
julia> @btime energy_arbitrary_ntuple($Λ)
  498.140 μs (0 allocations: 0 bytes)
564

What if you write a custom (simple) sum function? Base sum is not the fastest, and it does not specialize for N either (AFAIK).

Following @lmiq’s suggestion, I was able to recover the same performance of the explicit version by defining a sum function that specialises on the dimension N:

@inline nsum(f::F, ::Val{N}) where {F,N} = sum(ntuple(f, Val(N)))

function energy_nsum(Λ::AbstractArray{T,N}) where {T,N}
    E = 0
    @simd for r ∈ CartesianIndices(Λ)
        @inbounds E += Λ[r] *
            nsum(Val(N)) do d
                @inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...]
            end
    end
    return -E
end
ulia> @btime energy_explicit($Λ)
  263.647 μs (0 allocations: 0 bytes)
-832

julia> @btime energy_nsum($Λ)
  263.547 μs (0 allocations: 0 bytes)
-832
1 Like

I think the issue with this solution is that the @inbounds is not propagated to the interior of the ntuples. Otherwise, it should be exactly equivalent to my solution above.

If I throw in some additional @inbounds (and rewrite it to make it a little easier to read), I can also achieve the performance of the explicit version:

function energy_arbitrary_ntuple(Λ::AbstractArray{T,N}) where {T,N}
    E = 0
    @simd for r ∈ CartesianIndices(Λ) 
        E += @inbounds Λ[r] * (+)(ntuple(N) do d
                inds = ntuple(N) do i
                    @inbounds i == d ? bc(r[i], size(Λ, i)) : r[i]
                end
                @inbounds Λ[inds...]
        end...)
    end
    return -E
end

julia> @btime energy_arbitrary_ntuple($Λ)
  262.834 μs (0 allocations: 0 bytes)
-832
1 Like

The performance difference of the original version comes from:

  1. boundcheck (@inbounds won’t work if used outside the generator)
  2. the length of generator is unknow.

The following modified version has a similar performance.

function energy_arbitrary_faster(Λ::AbstractArray{T,N}) where {T,N}
	E = 0
	@inbounds @simd for r ∈ CartesianIndices(Λ)
		E += Λ[r] * sum((@inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...]) for d in ntuple(identity, N))
	end
	return -E
end
1 Like

The fact that bc makes the indexing unpredictable is also about a factor of 2 effect, for me. Can you duplicate the first row/col/etc of Λ as the last+1? Otherwise, you can deal with the edge afterwards. Not hard to write for N=2, more complicated for N=3… in general you could try TiledIteration.EdgeIterator although I haven’t timed it here.

It specializes/inlines if you pass a tuple (whose length is known at compile-time) instead of a generator.

1 Like

Another nice option is HybridArrays.jl which are basically what you suggest, but pretend to be N dimensional, with some of the dimensions statically sized.

1 Like

Note that this is a perfect case for using ghost cells. Just make your array Λ bigger by one element in each direction, and copy the periodic boundaries before doing any “stencil” calculations.

Not only is this much more flexible and tends to lead to cleaner code (it separates your boundary conditions from your other looping code), but it is also significantly faster because you eliminate all of the ? conditionals in your inner loops.

In particular, the following code gives the same answer as energy_explicit but uses ghost cells:

Λghost = similar(Λ, size(Λ) .+ 1);
Λghost[CartesianIndices(Λ)] = Λ;
function periodicboundaries!(Λghost::AbstractArray{<:Any,N}) where {N}
    sz = size(Λghost)
    for d = 1:N
        from = CartesianIndices(ntuple(i -> i == d ? (1:1) : (1:sz[i]), N))
        to = CartesianIndices(ntuple(i -> i == d ? (sz[i]:sz[i]) : (1:sz[i]), N))
        for (i,j) in zip(from,to)
            Λghost[j] = Λghost[i]
        end
    end
    return Λghost
end

function energy_arbitrary_faster_ghost(Λ::AbstractArray{T,N}) where {T,N}
	E = 0 + zero(T) # promote to at least Int
	@inbounds @simd for r ∈ CartesianIndices(ntuple(i -> 1:size(Λ,i)-1, N))
		E += Λ[r] * sum(ntuple(d -> @inbounds(Λ[ntuple(i -> i == d ? r[i]+1 : r[i], N)...]), N))
	end
	return -E
end

@btime periodicboundaries!($Λghost)
@btime energy_arbitrary_faster_ghost($Λghost)

giving

  57.329 μs (0 allocations: 0 bytes)
  91.855 μs (0 allocations: 0 bytes)

vs 337µm for energy_explicit on my machine. So, even if you count the cost of copying over the periodic boundaries (which only needs to be done once whenever you update the array) it is more than 2× faster. If you don’t count the time for periodicboundaries!, then it is almost 4× faster.

PS. Note that I eliminated the generator completely, compared to @N5N3’s solution.

5 Likes

Yeah, that is a good point. Thank you. I have thought about @inbounds but about it inside ntuple.

This made my day. I was unaware that 1:N and `ntuple(identity, N) can have different performance in a for-loop. Is it always the case, because that would make it very sensitive with respect to for loops in all such similar cases.

Yes, there are even more options like precalculate the boundary indices and so on. But here I would like to have the most straight-forward but fast implementation.

Thanks, that sounds great. I will inspect it more soon.

Thank all the answers. I think it answers all my questions by now.

Im principle it is a great thing. However, for a LxLxL = V lattice I usualy have V updates of individul sites before calculating the energy. So, it is still advisable to not copy the updates. But thanks for the great solution at this point.

Unfortunately, it is still slower on my machine… ~ 370 μs

Ok, so the culpit here was thesum function. I achieve the same performance of 250 μs with this one, which uses a normal for loop

function energy_arbitrary(Λ::AbstractArray{T,N}) where {T,N}
	E = 0
	@simd for r ∈ CartesianIndices(Λ) 
			E += @inbounds Λ[r] * (+)((@inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...] for d ∈ 1:N)...)
	end
	return -E
end

Well, I looked into the generated LLVM IR.
Both sum(x) (my version) and +(x...) (your version) is unrolled by the optimizer.
The difference is: sum(x::Generator) call mapreduce(identity, add_sum, x) to avoid overflow, while +(x...) not.
If you replace sum(x) with reduce(+,x), the performance should be exactly same.

1 Like

Ah, thanks. That sounds intuitive :wink: