Hello, I wonder if someone could help me speeding up my arbitrary dimension code to run as fast as the explicit version but without using Base.Cartesian
Consider the following examples
using Random
using BenchmarkTools
@inline bc(x, L) = x < L ? x + 1 : 1
function energy_explicit(Λ::AbstractArray{T,3}) where T
L = size(Λ, 1)
E = 0
@simd for r ∈ CartesianIndices(Λ)
x, y, z = Tuple(r)
@inbounds E += Λ[x, y, z] * (Λ[bc(x, L), y, z] + Λ[x, bc(y, L), z] + Λ[x, y, bc(z, L)])
end
return -E
end
function energy_arbitrary(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@simd for r ∈ CartesianIndices(Λ)
@inbounds E += Λ[r] * sum(Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...] for d ∈ 1:N)
end
return -E
end
I know that I could use Base.Cartesian and explicitly generate the expression from energy_explicit but can someone come up with a more simple solution?
Could you explain how to use it exactly and the second question is: why is ntuple not type stable in the first place? I mean, the N is known from dispatch…
For comparison reasons, here the generated version:
@generated function energy_generated(Λ::AbstractArray{T,N}) where {T,N}
quote
E = 0
@simd for r ∈ CartesianIndices(Λ)
E += @inbounds Λ[r] * Base.Cartesian.@ncall($N, +, d -> Base.Cartesian.@nref($N, Λ, x -> (x == d ? bc(r[x], size(Λ, x)) : r[x])))
end
return -E
end
end
function energy_arbitrary_ntuple(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@simd for r ∈ CartesianIndices(Λ)
@inbounds E += @inbounds Λ[r] * (+)(ntuple(d -> Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...], N)...)
end
return -E
end
Following @lmiq’s suggestion, I was able to recover the same performance of the explicit version by defining a sum function that specialises on the dimension N:
@inline nsum(f::F, ::Val{N}) where {F,N} = sum(ntuple(f, Val(N)))
function energy_nsum(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@simd for r ∈ CartesianIndices(Λ)
@inbounds E += Λ[r] *
nsum(Val(N)) do d
@inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...]
end
end
return -E
end
I think the issue with this solution is that the @inbounds is not propagated to the interior of the ntuples. Otherwise, it should be exactly equivalent to my solution above.
If I throw in some additional @inbounds (and rewrite it to make it a little easier to read), I can also achieve the performance of the explicit version:
function energy_arbitrary_ntuple(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@simd for r ∈ CartesianIndices(Λ)
E += @inbounds Λ[r] * (+)(ntuple(N) do d
inds = ntuple(N) do i
@inbounds i == d ? bc(r[i], size(Λ, i)) : r[i]
end
@inbounds Λ[inds...]
end...)
end
return -E
end
The performance difference of the original version comes from:
boundcheck (@inbounds won’t work if used outside the generator)
the length of generator is unknow.
The following modified version has a similar performance.
function energy_arbitrary_faster(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@inbounds @simd for r ∈ CartesianIndices(Λ)
E += Λ[r] * sum((@inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...]) for d in ntuple(identity, N))
end
return -E
end
The fact that bc makes the indexing unpredictable is also about a factor of 2 effect, for me. Can you duplicate the first row/col/etc of Λ as the last+1? Otherwise, you can deal with the edge afterwards. Not hard to write for N=2, more complicated for N=3… in general you could try TiledIteration.EdgeIterator although I haven’t timed it here.
Another nice option is HybridArrays.jl which are basically what you suggest, but pretend to be N dimensional, with some of the dimensions statically sized.
Note that this is a perfect case for using ghost cells. Just make your array Λ bigger by one element in each direction, and copy the periodic boundaries before doing any “stencil” calculations.
Not only is this much more flexible and tends to lead to cleaner code (it separates your boundary conditions from your other looping code), but it is also significantly faster because you eliminate all of the ? conditionals in your inner loops.
In particular, the following code gives the same answer as energy_explicit but uses ghost cells:
Λghost = similar(Λ, size(Λ) .+ 1);
Λghost[CartesianIndices(Λ)] = Λ;
function periodicboundaries!(Λghost::AbstractArray{<:Any,N}) where {N}
sz = size(Λghost)
for d = 1:N
from = CartesianIndices(ntuple(i -> i == d ? (1:1) : (1:sz[i]), N))
to = CartesianIndices(ntuple(i -> i == d ? (sz[i]:sz[i]) : (1:sz[i]), N))
for (i,j) in zip(from,to)
Λghost[j] = Λghost[i]
end
end
return Λghost
end
function energy_arbitrary_faster_ghost(Λ::AbstractArray{T,N}) where {T,N}
E = 0 + zero(T) # promote to at least Int
@inbounds @simd for r ∈ CartesianIndices(ntuple(i -> 1:size(Λ,i)-1, N))
E += Λ[r] * sum(ntuple(d -> @inbounds(Λ[ntuple(i -> i == d ? r[i]+1 : r[i], N)...]), N))
end
return -E
end
@btime periodicboundaries!($Λghost)
@btime energy_arbitrary_faster_ghost($Λghost)
vs 337µm for energy_explicit on my machine. So, even if you count the cost of copying over the periodic boundaries (which only needs to be done once whenever you update the array) it is more than 2× faster. If you don’t count the time for periodicboundaries!, then it is almost 4× faster.
PS. Note that I eliminated the generator completely, compared to @N5N3’s solution.
Yeah, that is a good point. Thank you. I have thought about @inbounds but about it inside ntuple.
This made my day. I was unaware that 1:N and `ntuple(identity, N) can have different performance in a for-loop. Is it always the case, because that would make it very sensitive with respect to for loops in all such similar cases.
Yes, there are even more options like precalculate the boundary indices and so on. But here I would like to have the most straight-forward but fast implementation.
Thanks, that sounds great. I will inspect it more soon.
Thank all the answers. I think it answers all my questions by now.
Im principle it is a great thing. However, for a LxLxL = V lattice I usualy have V updates of individul sites before calculating the energy. So, it is still advisable to not copy the updates. But thanks for the great solution at this point.
Ok, so the culpit here was thesum function. I achieve the same performance of 250 μs with this one, which uses a normal for loop
function energy_arbitrary(Λ::AbstractArray{T,N}) where {T,N}
E = 0
@simd for r ∈ CartesianIndices(Λ)
E += @inbounds Λ[r] * (+)((@inbounds Λ[ntuple(i -> i == d ? bc(r[i], size(Λ, i)) : r[i], N)...] for d ∈ 1:N)...)
end
return -E
end
Well, I looked into the generated LLVM IR.
Both sum(x) (my version) and +(x...) (your version) is unrolled by the optimizer.
The difference is: sum(x::Generator) call mapreduce(identity, add_sum, x) to avoid overflow, while +(x...) not.
If you replace sum(x) with reduce(+,x), the performance should be exactly same.