Speed up multidimensional code without Base.Cartesian

stakaz · September 17, 2021, 11:13am

Hello, I wonder if someone could help me speeding up my arbitrary dimension code to run as fast as the explicit version but without using Base.Cartesian

Consider the following examples

using Random 
using BenchmarkTools 

@inline bc(x, L) = x < L ? x + 1 : 1 

function energy_explicit(Λ::AbstractArray{T,3}) where T
	L = size(Λ, 1)
	E = 0
	@simd for r ∈ CartesianIndices(Λ) 
		x, y, z = Tuple(r)
		@inbounds E += Λ[x, y, z] * (Λ[bc(x, L), y, z] + Λ[x, bc(y, L), z] + Λ[x, y, bc(z, L)])
	end
	return -E
end

function energy_arbitrary(Λ::AbstractArray{T,N}) where {T,N}
	E = 0
	@simd for r ∈ CartesianIndices(Λ) 
			@inbounds E += Λ[r] * sum(Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...] for d ∈ 1:N)
	end
	return -E
end

On my machine I get the following results

julia> Λ = rand(Int8[-1,1], 64, 64, 64);

julia> @btime energy_explicit($Λ)
  252.054 μs (0 allocations: 0 bytes)
-192

julia> @btime energy_arbitrary($Λ)
  498.445 μs (0 allocations: 0 bytes)
-192

I know that I could use Base.Cartesian and explicitly generate the expression from energy_explicit but can someone come up with a more simple solution?

Oscar_Smith · September 17, 2021, 12:09pm

I think what you want to do is make Λ be an Array{StaticVector{T,L},N-1} which will make the ntuple construction type stable.

stakaz · September 17, 2021, 12:11pm

Could you explain how to use it exactly and the second question is: why is ntuple not type stable in the first place? I mean, the N is known from dispatch…

stakaz · September 17, 2021, 12:11pm

For comparison reasons, here the generated version:

@generated function energy_generated(Λ::AbstractArray{T,N}) where {T,N}
	quote
		E = 0
		@simd for r ∈ CartesianIndices(Λ) 
			E += @inbounds Λ[r] * Base.Cartesian.@ncall($N, +, d -> Base.Cartesian.@nref($N, Λ, x -> (x == d ? bc(r[x], size(Λ, x)) : r[x])))
		end
		return -E
	end
end

julia> @btime energy_generated($Λ)
  252.085 μs (0 allocations: 0 bytes)
564

mcabbott · September 17, 2021, 12:23pm

I think you want to replace this with an ntuple.

stakaz · September 17, 2021, 12:27pm

function energy_arbitrary_ntuple(Λ::AbstractArray{T,N}) where {T,N}
	E = 0
	@simd for r ∈ CartesianIndices(Λ) 
			@inbounds E += @inbounds Λ[r] * (+)(ntuple(d -> Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...], N)...)
	end
	return -E
end

julia> @btime energy_arbitrary_ntuple($Λ)
  498.140 μs (0 allocations: 0 bytes)
564

lmiq · September 17, 2021, 12:55pm

What if you write a custom (simple) sum function? Base sum is not the fastest, and it does not specialize for N either (AFAIK).

jipolanco · September 17, 2021, 1:43pm

Following @lmiq’s suggestion, I was able to recover the same performance of the explicit version by defining a sum function that specialises on the dimension N:

@inline nsum(f::F, ::Val{N}) where {F,N} = sum(ntuple(f, Val(N)))

function energy_nsum(Λ::AbstractArray{T,N}) where {T,N}
    E = 0
    @simd for r ∈ CartesianIndices(Λ)
        @inbounds E += Λ[r] *
            nsum(Val(N)) do d
                @inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...]
            end
    end
    return -E
end

ulia> @btime energy_explicit($Λ)
  263.647 μs (0 allocations: 0 bytes)
-832

julia> @btime energy_nsum($Λ)
  263.547 μs (0 allocations: 0 bytes)
-832

jipolanco · September 17, 2021, 1:58pm

stakaz:

function energy_arbitrary_ntuple(Λ::AbstractArray{T,N}) where {T,N}
	E = 0
	@simd for r ∈ CartesianIndices(Λ) 
			@inbounds E += @inbounds Λ[r] * (+)(ntuple(d -> Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...], N)...)
	end
	return -E
end

I think the issue with this solution is that the @inbounds is not propagated to the interior of the ntuples. Otherwise, it should be exactly equivalent to my solution above.

If I throw in some additional @inbounds (and rewrite it to make it a little easier to read), I can also achieve the performance of the explicit version:

function energy_arbitrary_ntuple(Λ::AbstractArray{T,N}) where {T,N}
    E = 0
    @simd for r ∈ CartesianIndices(Λ) 
        E += @inbounds Λ[r] * (+)(ntuple(N) do d
                inds = ntuple(N) do i
                    @inbounds i == d ? bc(r[i], size(Λ, i)) : r[i]
                end
                @inbounds Λ[inds...]
        end...)
    end
    return -E
end

julia> @btime energy_arbitrary_ntuple($Λ)
  262.834 μs (0 allocations: 0 bytes)
-832

N5N3 · September 17, 2021, 2:42pm

The performance difference of the original version comes from:

boundcheck (@inbounds won’t work if used outside the generator)
the length of generator is unknow.

The following modified version has a similar performance.

function energy_arbitrary_faster(Λ::AbstractArray{T,N}) where {T,N}
	E = 0
	@inbounds @simd for r ∈ CartesianIndices(Λ)
		E += Λ[r] * sum((@inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...]) for d in ntuple(identity, N))
	end
	return -E
end

mcabbott · September 17, 2021, 3:21pm

The fact that bc makes the indexing unpredictable is also about a factor of 2 effect, for me. Can you duplicate the first row/col/etc of Λ as the last+1? Otherwise, you can deal with the edge afterwards. Not hard to write for N=2, more complicated for N=3… in general you could try TiledIteration.EdgeIterator although I haven’t timed it here.

stevengj · September 17, 2021, 4:48pm

It specializes/inlines if you pass a tuple (whose length is known at compile-time) instead of a generator.

Mason · September 17, 2021, 5:13pm

Another nice option is HybridArrays.jl which are basically what you suggest, but pretend to be N dimensional, with some of the dimensions statically sized.

stevengj · September 17, 2021, 5:43pm

Note that this is a perfect case for using ghost cells. Just make your array Λ bigger by one element in each direction, and copy the periodic boundaries before doing any “stencil” calculations.

Not only is this much more flexible and tends to lead to cleaner code (it separates your boundary conditions from your other looping code), but it is also significantly faster because you eliminate all of the ? conditionals in your inner loops.

In particular, the following code gives the same answer as energy_explicit but uses ghost cells:

Λghost = similar(Λ, size(Λ) .+ 1);
Λghost[CartesianIndices(Λ)] = Λ;
function periodicboundaries!(Λghost::AbstractArray{<:Any,N}) where {N}
    sz = size(Λghost)
    for d = 1:N
        from = CartesianIndices(ntuple(i -> i == d ? (1:1) : (1:sz[i]), N))
        to = CartesianIndices(ntuple(i -> i == d ? (sz[i]:sz[i]) : (1:sz[i]), N))
        for (i,j) in zip(from,to)
            Λghost[j] = Λghost[i]
        end
    end
    return Λghost
end

function energy_arbitrary_faster_ghost(Λ::AbstractArray{T,N}) where {T,N}
	E = 0 + zero(T) # promote to at least Int
	@inbounds @simd for r ∈ CartesianIndices(ntuple(i -> 1:size(Λ,i)-1, N))
		E += Λ[r] * sum(ntuple(d -> @inbounds(Λ[ntuple(i -> i == d ? r[i]+1 : r[i], N)...]), N))
	end
	return -E
end

@btime periodicboundaries!($Λghost)
@btime energy_arbitrary_faster_ghost($Λghost)

giving

  57.329 μs (0 allocations: 0 bytes)
  91.855 μs (0 allocations: 0 bytes)

vs 337µm for energy_explicit on my machine. So, even if you count the cost of copying over the periodic boundaries (which only needs to be done once whenever you update the array) it is more than 2× faster. If you don’t count the time for periodicboundaries!, then it is almost 4× faster.

PS. Note that I eliminated the generator completely, compared to @N5N3’s solution.

stakaz · September 17, 2021, 6:00pm

Yeah, that is a good point. Thank you. I have thought about @inbounds but about it inside ntuple.

This made my day. I was unaware that 1:N and `ntuple(identity, N) can have different performance in a for-loop. Is it always the case, because that would make it very sensitive with respect to for loops in all such similar cases.

Yes, there are even more options like precalculate the boundary indices and so on. But here I would like to have the most straight-forward but fast implementation.

Thanks, that sounds great. I will inspect it more soon.

Thank all the answers. I think it answers all my questions by now.

stakaz · September 20, 2021, 8:49am

Im principle it is a great thing. However, for a LxLxL = V lattice I usualy have V updates of individul sites before calculating the energy. So, it is still advisable to not copy the updates. But thanks for the great solution at this point.

stakaz · September 20, 2021, 8:50am

Unfortunately, it is still slower on my machine… ~ 370 μs

stakaz · September 20, 2021, 9:12am

Ok, so the culpit here was thesum function. I achieve the same performance of 250 μs with this one, which uses a normal for loop

function energy_arbitrary(Λ::AbstractArray{T,N}) where {T,N}
	E = 0
	@simd for r ∈ CartesianIndices(Λ) 
			E += @inbounds Λ[r] * (+)((@inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...] for d ∈ 1:N)...)
	end
	return -E
end

N5N3 · September 20, 2021, 9:43am

Well, I looked into the generated LLVM IR.
Both sum(x) (my version) and +(x...) (your version) is unrolled by the optimizer.
The difference is: sum(x::Generator) call mapreduce(identity, add_sum, x) to avoid overflow, while +(x...) not.
If you replace sum(x) with reduce(+,x), the performance should be exactly same.

stakaz · September 20, 2021, 9:54am

Ah, thanks. That sounds intuitive

Topic		Replies	Views
Understanding type stability Performance	16	1641	November 16, 2017
Partial trace of multidimensional arrays General Usage	6	2765	January 21, 2017
Elegant way to construct a set of multi-index Numerics	19	2262	November 29, 2018
N-dimensional implementations General Usage question , images	19	1198	October 12, 2018
Generic n-dimensional loop without allocations General Usage question	15	1040	December 2, 2016

Speed up multidimensional code without Base.Cartesian

Related topics