Poor performance with custom multidimensional iterator

I am trying to implement a custom iterator for multidimensional arrays, which among other things, allows one to specify the order in which the axes are iterated. Sadly, I am having trouble getting even close to the performance of CartesianIndices even though almost all the code is identical. In particular, I found that the performance depends greatly on whether the index calculation is done within the iterator or in the caller.

Below is a MWE. Iter1 is basically a stripped-down version of CartesianIndices and achieves good performance. It returns a tuple which is converted to an appropriate linear index in the caller. Iter2 is nearly identical to Iter1; the only difference is that the conversion to a linear index is done within the iterate function instead of in the caller. Surprisingly to me, this results in nearly 8x slowdown. (This is on Julia 1.5.)

struct Iter1{N}
	strides::Dims{N}
	stop::Dims{N}
	function Iter1(sz::Dims{N}, p::Dims{N}) where {N}
		strides = cumprod(ntuple(i -> i>1 ? sz[i-1] : 1, Val(N)))
		new{N}(ntuple(i->strides[p[i]], Val(N)), sz)
	end
end

# (Exactly the same as Iter1)
struct Iter2{N}
	strides::Dims{N}
	stop::Dims{N}
	function Iter2(sz::Dims{N}, p::Dims{N}) where {N}
		strides = cumprod(ntuple(i -> i>1 ? sz[i-1] : 1, Val(N)))
		new{N}(ntuple(i->strides[p[i]], Val(N)), sz)
	end
end


# Iterating Iter1 returns a Tuple
@inline function Base.iterate(it::Iter1{N}) where {N}
	if any(map(s -> s<1, it.stop))
		return nothing
	end
	I = ntuple(i->1, Val(N))
	return I, I
end

@inline function Base.iterate(it::Iter1{N}, state::Dims{N}) where {N}
	valid, I = __inc(state, it.stop)
	valid || return nothing
	return I, I
end


# Iterating Iter2 returns a linear index
@inline function Base.iterate(it::Iter2{N}) where {N}
	if any(map(s -> s<1, it.stop))
		return nothing
	end
	I = ntuple(i->1, Val(N))
	return 1, I
end

@inline function Base.iterate(it::Iter2, state::Dims{N}) where {N}
	valid, I = __inc(state, it.stop)
	valid || return nothing
	return calc_index(I, it.strides), I
end



# helper functions
@inline __inc(::Tuple{}, ::Tuple{}) = false, ()
@inline function __inc(state::Dims{N}, stop::Dims{N}) where {N}
	 if state[1] < stop[1]
		  return true, (state[1]+1, Base.tail(state)...)
	 end
	 valid, I = __inc(Base.tail(state), Base.tail(stop))
	 return valid, (1, I...)
end

@inline function calc_index(I::Dims{N}, strides::Dims{N}) where {N}
	1 + sum(ntuple(i -> (I[i]-1)*strides[i], Val(N)))
end


# Benchmarks

function test1(A::AbstractArray{Float64,N}, p::Dims{N}) where {N}
	it = Iter1(size(A), p)
	for I in it
		iA = calc_index(I, it.strides)
		A[iA] += 1.0
	end
end


function test2(A::AbstractArray{Float64,N}, p::Dims{N}) where {N}
	it = Iter2(size(A), p)
	for iA in it
		A[iA] += 1.0
	end
end

function test2(A::AbstractArray{Float64,N}) where {N}
	it = Iter2(size(A), ntuple(i->1+N-i, Val(N)))
	for iA in it
		A[iA] += 1.0
	end
end


using BenchmarkTools
A = rand(4,4,4,4,4)

@btime test1($A, (5,4,3,2,1))		# 710 ns
@btime test2($A, (5,4,3,2,1))		# 5.2 μs
@btime test2($A)				# 1.6 μs

Note that the only difference between test1 and test2 is where calc_index is called. Evidently the compiler is utilizing information about it.strides in test1 that it is not using in iterate(::Iter2). Hard-coding the permutation p (the second version of test2) helps somewhat, but is still 2x slower than test1.

I’d greatly appreciate any insight anyone has to offer. I’ve been pulling my hair out for two days trying to figure out why my iterator is performing so poorly.

1 Like

Without having looked at your code, let me start with a dumb question: can you just combine PermutedDimsArray with ordinary CartesianIndices iteration?

2 Likes

It’s not a dumb question at all. I originally tried multiple variations on permuting CartiesianIndex’s, but they didn’t perform nearly as well as without permutation. Then I thought I might directly iterate the subscripts in a custom order, so that no permutation would be required. Unfortunately that performed poorly as well. I finally determined that calculating a linear index with custom strides was fastest, but only if the calculation was done in the caller. At that point I threw my hands up and posted my problem here :slight_smile:.

For the record, here’s a benchmark with PermutedDimsArray:

function test0(A::AbstractArray{Float64,N}, p::Dims{N}) where {N}
	AP = PermutedDimsArray(A, p)
	for ci in CartesianIndices(AP)
		A[ci] += 1.0
	end
end

@btime test0($A, (5,4,3,2,1))		# 2.3 μs

Incidentally, my actual goal is more complicated than just changing the iteration sequence of a single array – it is to create an iterator for tensor operations, as a sort of alternative to existing tensor packages. But the issue posted here lies at the heart of it. And as a secondary goal I am hoping to get a better understanding of the idiosyncracies of compiler optimization (or the lack thereof).

That code only works if A is of isotropic size. Here’s what I meant:

function inc_cartesian!(B::AbstractArray)
    for I in CartesianIndices(B)
        B[I] += 1
    end
    return B
end

A = rand(4, 5, 6, 7, 8)   # notice that this is a bigger array
AP = PermutedDimsArray(copy(A), (5,4,3,2,1))

julia> @btime inc_cartesian!($A);
  5.718 μs (0 allocations: 0 bytes)

julia> @btime inc_cartesian!($AP);
  4.308 μs (0 allocations: 0 bytes)

There’s a crucial difference here: in my version, Julia’s compiler knows the permutation, but in yours it doesn’t. Why? You passed the permutation as a value, but I passed it as a type parameter:

julia> typeof(AP)
PermutedDimsArray{Float64, 5, (5, 4, 3, 2, 1), (5, 4, 3, 2, 1), Array{Float64, 5}}

The permutation itself (as well as the inverse permutation, which in this case happens to be the same) is encoded in the type.

If you doubt this achieves what you were asking, let’s try a different implementation:

julia> function address_order!(B)
           k = 0
           for I in CartesianIndices(B)
               B[I] = k += 1
           end
       end
address_order! (generic function with 1 method)

julia> A = zeros(Int, 3, 4);

julia> AP = PermutedDimsArray(copy(A), (2, 1));

julia> address_order!(A);

julia> address_order!(AP);

julia> A
3×4 Matrix{Int64}:
 1  4  7  10
 2  5  8  11
 3  6  9  12

julia> AP.parent
3×4 Matrix{Int64}:
 1   2   3   4
 5   6   7   8
 9  10  11  12

You see that AP.parent was visited in row-major order.

2 Likes

That code only works if A is of isotropic size.

Oops. That’s because I had a typo. It should’ve been

function test0(A::AbstractArray{Float64,N}, p::Dims{N}) where {N}
	AP = PermutedDimsArray(A, p)
	for ci in CartesianIndices(AP)
		AP[ci] += 1.0
	end
end

which clocks in at a whopping 77 microseconds (!). To put your suggestion in my framework above:

function test0_TH(A::AbstractArray{Float64,N}, p::Dims{N}) where {N}
	AP = PermutedDimsArray(A, p)
	inc_cartesian!(AP)
end

julia> @btime test0_TH($A, (5,4,3,2,1));
  1.010 μs (4 allocations: 208 bytes)

which i think is pretty good – it’s not too far off the 700ns for the unpermuted iteration. However, this approach is a bit less convenient than I was hoping for: It requires the user to create a permuted array and define a separate function to do the loop, whereas I am aiming for an iterator that can be used in a transparent way. But I bet a macro could give me something close to the kind of API I would like.

Incidentally, one of my earliest attempts was to incorporate index permutation based on a type parameter (a la PermutedDimsArray) into an iterator, along these lines:

struct Iter3{N, P}
	stop::Dims{N}
	function Iter3(sz::Dims{N}, p::Dims{N}) where {N}
		new{N,p}(sz)
	end
end

@inline function Base.iterate(it::Iter3{N,P}) where {N,P}
	if any(map(s -> s<1, it.stop))
		return nothing
	end
	I = ntuple(i->1, Val(N))
	return CartesianIndex{N}(map(j->I[j], P)), I
end

@inline function Base.iterate(it::Iter3{N,P}, state::Dims{N}) where {N,P}
	valid, I = __inc(state, it.stop)
	valid || return nothing
	return CartesianIndex{N}(map(j->I[j], P)), I
end

function test3(A::AbstractArray{Float64,N}, p::Dims{N}) where {N}
	for iA in Iter3(size(A), p)
		A[iA] += 1.0
	end
end

Although the permutation is a type parameter as it was in your case, the performance is quite poor:

julia> @btime test3($A, (5,4,3,2,1));
  44.399 μs (2050 allocations: 144.05 KiB)

I suspect this is because there is no function barrier between the type-unstable calling function and the iterating function.