CartesianIndices overhead

I’ve been experimenting with the techniques described here Multidimensional algorithms and iteration for writing some generic multi-dimensional operations.

Let’s look at a nan-skipping sum:

function nansum!(B, A)
    # It's assumed that B has size 1 along any dimension that we're summing,
    # and otherwise matches A
    fill!(B, NaN)

    Bmax = last(CartesianIndices(B))

    @inbounds for I in CartesianIndices(A)
        dest = min(Bmax, I)
        ai = A[I]
        bi = B[dest]
        B[dest] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
    end
    return B
end

julia> A = rand([NaN, 3.4, 4.5], 100, 40, 60);

julia> B = Array{Float64}(undef, 1, 40, 60);

julia> @btime nansum!($B, $a);
  788.930 μs (0 allocations: 0 bytes)

This works, and handles arbitrary dimensions. Cool!

nansum2! below is hard-coded for the case of 3-dimensions & dims=1, i.e. reducing only the first dimension:

function nansum2!(B::AbstractArray{T,3}, A::AbstractArray{T,3}) where T
    fill!(B, NaN)

    @inbounds for k in 1:size(A, 3)
        for j in 1:size(A, 2)
            for i in 1:size(A, 1)
                ai = A[i, j, k]
                bi = B[1, j, k]
                B[1, j, k] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
            end
        end
    end
    return B
end

julia> @btime nansum2!($B, $a);
  450.433 μs (0 allocations: 0 bytes)

The performance is significantly better, which is a shame. Seems like the fact that the generic solution doesn’t know about which dims are size 1 at compile-time does have a cost (has to go through the min every iteration).

Does anyone have suggestions for how to keep the simplicity and genericity of the first version, but with genuinely no overhead? It’ll have to to dispatch / recompile based on the reduction dims - just not sure what the nicest way to code that is.

1 Like

How about reshaping both, then processing, then reshaping back?

function nansum!(B, A)
    # It's assumed that B has size 1 along any dimension that we're summing,
    # and otherwise matches A
    bsize = size(B)
    A = reshape(A, (size(A,1), length(B)))
    B = reshape(B, length(B))
    fill!(B, NaN)

    @inbounds for j in 1:size(A,2)
        @inbounds for i in 1:size(A,1)
            aij = A[i,j]
            bj = B[j]
            B[j] = ifelse(isnan(aij), bj, ifelse(isnan(bj), aij, bj + aij))
        end
    end

    B = reshape(B, bsize)
    return B
end

On my machine this is about as fast as your fast version

Yes, that’s currently a known issue with CartesianIndices, unfortunately. See https://github.com/JuliaLang/julia/issues/9080 and https://github.com/JuliaLang/julia/pull/35074.

3 Likes

It seems like min on CartesianIndex is the culprit here. This should be equivalent, and on my machine it’s as fast as nansum2!:

julia> function nansum3!(B, A)
           # It's assumed that B has size 1 along any dimension that we're summing,
           # and otherwise matches A
           fill!(B, NaN)

           @inbounds for I in CartesianIndices(A)
               dest = CartesianIndex(1, Base.tail(I.I)...)
               ai = A[I]
               bi = B[dest]
               B[dest] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
           end
           return B
       end
nansum3! (generic function with 1 method)
julia> @btime nansum!($B, $A);
  688.682 μs (0 allocations: 0 bytes)

julia> @btime nansum2!($B, $A);
  478.752 μs (0 allocations: 0 bytes)

julia> @btime nansum3!($B, $A);
  479.218 μs (0 allocations: 0 bytes)

These become the same speed if I write dest = CartesianIndex(1, Base.tail(I.I)...) so, as you say, it looks like it’s the min not just the CartesianIndices.

[Beaten by @rdeits, with identical notation too!]

Perhaps you could make a fixdims(I, ::Val{dims}) and call a function using that after checking which sizes are 1?

A2 = permutedims(A, (2,3,1));
B2 = B[1,:,:];

using LoopVectorization
function nansumavx!(B::AbstractArray{T,2}, A::AbstractArray{T,3}) where T
    fill!(B, NaN)
    for i ∈ axes(A,3); @avx for J ∈ CartesianIndices(B)
        ai = A[J, i]
        bi = B[J]
        B[J] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
    end; end
    return B
end

This yields

julia> @benchmark nansum2!($B, $A)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     353.777 μs (0.00% GC)
  median time:      354.128 μs (0.00% GC)
  mean time:        354.892 μs (0.00% GC)
  maximum time:     427.012 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark nansumavx!($B2, $A2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     76.999 μs (0.00% GC)
  median time:      78.071 μs (0.00% GC)
  mean time:        78.213 μs (0.00% GC)
  maximum time:     148.726 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> reshape(B, (size(B,2),size(B,3))) ≈ B2
true
5 Likes

Answering your initial question, you can give to the compiler information about which dimensions have size 1:

@inline @generated function update_indices(I::CartesianIndex{N}, J::CartesianIndex{N}, ::Val{positions}) where {N, positions}
	list = Array{Expr}(undef, N)
	for k = 1 : N
		list[k] = if k in positions
			:(J[$k])
		else
			:(I[$k])
		end
	end
	:(CartesianIndex($(list...)))
end
function nansum4!(B, A, positions = Val(Tuple(k for k in 1 : ndims(B) if isone(size(B, k)))))
	# It's assumed that B has size 1 along any dimension that we're summing,
	# and otherwise matches A
	
	fill!(B, NaN)

    Bmax = last(CartesianIndices(B))

    @inbounds for I in CartesianIndices(A)
		dest = update_indices(I, Bmax, positions)
        ai = A[I]
        bi = B[dest]
        B[dest] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
    end
    return B
end

positions gives information about which dimensions have size 1 to the generated function update_indices.
In your example:

positions = Val((1,))

and update_indices lowers to:

update_indices(I, Bmax, positions) = CartesianIndex(Bmax[1], I[2], I[3])

It is important to note that, when placing positions as a optional argument of nansum4!, I’m isolating the type instability of computing positions from the computationally intensive part of the function. This is referred to as a function-barrier technique.

If I had written positions = ... in the body of the function, the performance would be much worse. In addition, giving positions as a positional argument allows you to manually define which dimensions have size 1, eliminating the type instability problem completely.

@btime nansum!($B, $A);
@btime nansum2!($B, $A);
@btime nansum4!($B, $A);

Output:

  628.300 μs (0 allocations: 0 bytes)
  434.399 μs (0 allocations: 0 bytes)
  440.901 μs (5 allocations: 240 bytes)

The allocations of nansum4! are due to the type instability of computing positions. We can eliminate this problem with:

@btime nansum4!($B, $A, Val((1,)));

output:

  429.399 μs (0 allocations: 0 bytes)

Of course, you can adapt nansum4! to use LoopVectorization.jl as suggested by @Elrod and build even faster code.

Thanks, and to Elrod. This would be helpful for nanmean and nanstd I need for porting code from Octave (or actually MATLAB, there different functions… so far only ported to Octave, on the way to Julia).

Should this be in Julia Base? Is it redundant with skipmissing I found instead? Is that the way you should go in pure Julia? Or should a function do both?

NaNMath may be useful.

Thanks for the help everyone, this has given me some good ideas. I’ll report back if I have further problems!