CartesianIndices overhead

Benjia · September 11, 2020, 3:03pm

I’ve been experimenting with the techniques described here Multidimensional algorithms and iteration for writing some generic multi-dimensional operations.

Let’s look at a nan-skipping sum:

function nansum!(B, A)
    # It's assumed that B has size 1 along any dimension that we're summing,
    # and otherwise matches A
    fill!(B, NaN)

    Bmax = last(CartesianIndices(B))

    @inbounds for I in CartesianIndices(A)
        dest = min(Bmax, I)
        ai = A[I]
        bi = B[dest]
        B[dest] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
    end
    return B
end

julia> A = rand([NaN, 3.4, 4.5], 100, 40, 60);

julia> B = Array{Float64}(undef, 1, 40, 60);

julia> @btime nansum!($B, $a);
  788.930 μs (0 allocations: 0 bytes)

This works, and handles arbitrary dimensions. Cool!

nansum2! below is hard-coded for the case of 3-dimensions & dims=1, i.e. reducing only the first dimension:

function nansum2!(B::AbstractArray{T,3}, A::AbstractArray{T,3}) where T
    fill!(B, NaN)

    @inbounds for k in 1:size(A, 3)
        for j in 1:size(A, 2)
            for i in 1:size(A, 1)
                ai = A[i, j, k]
                bi = B[1, j, k]
                B[1, j, k] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
            end
        end
    end
    return B
end

julia> @btime nansum2!($B, $a);
  450.433 μs (0 allocations: 0 bytes)

The performance is significantly better, which is a shame. Seems like the fact that the generic solution doesn’t know about which dims are size 1 at compile-time does have a cost (has to go through the min every iteration).

Does anyone have suggestions for how to keep the simplicity and genericity of the first version, but with genuinely no overhead? It’ll have to to dispatch / recompile based on the reduction dims - just not sure what the nicest way to code that is.

cscherrer · September 11, 2020, 4:15pm

How about reshaping both, then processing, then reshaping back?

function nansum!(B, A)
    # It's assumed that B has size 1 along any dimension that we're summing,
    # and otherwise matches A
    bsize = size(B)
    A = reshape(A, (size(A,1), length(B)))
    B = reshape(B, length(B))
    fill!(B, NaN)

    @inbounds for j in 1:size(A,2)
        @inbounds for i in 1:size(A,1)
            aij = A[i,j]
            bj = B[j]
            B[j] = ifelse(isnan(aij), bj, ifelse(isnan(bj), aij, bj + aij))
        end
    end

    B = reshape(B, bsize)
    return B
end

On my machine this is about as fast as your fast version

simeonschaub · September 11, 2020, 4:39pm

Yes, that’s currently a known issue with CartesianIndices, unfortunately. See https://github.com/JuliaLang/julia/issues/9080 and https://github.com/JuliaLang/julia/pull/35074.

rdeits · September 11, 2020, 4:43pm

It seems like min on CartesianIndex is the culprit here. This should be equivalent, and on my machine it’s as fast as nansum2!:

julia> function nansum3!(B, A)
           # It's assumed that B has size 1 along any dimension that we're summing,
           # and otherwise matches A
           fill!(B, NaN)

           @inbounds for I in CartesianIndices(A)
               dest = CartesianIndex(1, Base.tail(I.I)...)
               ai = A[I]
               bi = B[dest]
               B[dest] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
           end
           return B
       end
nansum3! (generic function with 1 method)

julia> @btime nansum!($B, $A);
  688.682 μs (0 allocations: 0 bytes)

julia> @btime nansum2!($B, $A);
  478.752 μs (0 allocations: 0 bytes)

julia> @btime nansum3!($B, $A);
  479.218 μs (0 allocations: 0 bytes)

mcabbott · September 11, 2020, 4:43pm

These become the same speed if I write dest = CartesianIndex(1, Base.tail(I.I)...) so, as you say, it looks like it’s the min not just the CartesianIndices.

[Beaten by @rdeits, with identical notation too!]

Perhaps you could make a fixdims(I, ::Val{dims}) and call a function using that after checking which sizes are 1?

Elrod · September 11, 2020, 4:58pm

A2 = permutedims(A, (2,3,1));
B2 = B[1,:,:];

using LoopVectorization
function nansumavx!(B::AbstractArray{T,2}, A::AbstractArray{T,3}) where T
    fill!(B, NaN)
    for i ∈ axes(A,3); @avx for J ∈ CartesianIndices(B)
        ai = A[J, i]
        bi = B[J]
        B[J] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
    end; end
    return B
end

This yields

julia> @benchmark nansum2!($B, $A)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     353.777 μs (0.00% GC)
  median time:      354.128 μs (0.00% GC)
  mean time:        354.892 μs (0.00% GC)
  maximum time:     427.012 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark nansumavx!($B2, $A2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     76.999 μs (0.00% GC)
  median time:      78.071 μs (0.00% GC)
  mean time:        78.213 μs (0.00% GC)
  maximum time:     148.726 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> reshape(B, (size(B,2),size(B,3))) ≈ B2
true

lucas711642 · September 12, 2020, 5:25pm

Answering your initial question, you can give to the compiler information about which dimensions have size 1:

@inline @generated function update_indices(I::CartesianIndex{N}, J::CartesianIndex{N}, ::Val{positions}) where {N, positions}
	list = Array{Expr}(undef, N)
	for k = 1 : N
		list[k] = if k in positions
			:(J[$k])
		else
			:(I[$k])
		end
	end
	:(CartesianIndex($(list...)))
end
function nansum4!(B, A, positions = Val(Tuple(k for k in 1 : ndims(B) if isone(size(B, k)))))
	# It's assumed that B has size 1 along any dimension that we're summing,
	# and otherwise matches A
	
	fill!(B, NaN)

    Bmax = last(CartesianIndices(B))

    @inbounds for I in CartesianIndices(A)
		dest = update_indices(I, Bmax, positions)
        ai = A[I]
        bi = B[dest]
        B[dest] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
    end
    return B
end

positions gives information about which dimensions have size 1 to the generated function update_indices.
In your example:

positions = Val((1,))

and update_indices lowers to:

update_indices(I, Bmax, positions) = CartesianIndex(Bmax[1], I[2], I[3])

It is important to note that, when placing positions as a optional argument of nansum4!, I’m isolating the type instability of computing positions from the computationally intensive part of the function. This is referred to as a function-barrier technique.

If I had written positions = ... in the body of the function, the performance would be much worse. In addition, giving positions as a positional argument allows you to manually define which dimensions have size 1, eliminating the type instability problem completely.

@btime nansum!($B, $A);
@btime nansum2!($B, $A);
@btime nansum4!($B, $A);

Output:

  628.300 μs (0 allocations: 0 bytes)
  434.399 μs (0 allocations: 0 bytes)
  440.901 μs (5 allocations: 240 bytes)

The allocations of nansum4! are due to the type instability of computing positions. We can eliminate this problem with:

@btime nansum4!($B, $A, Val((1,)));

output:

  429.399 μs (0 allocations: 0 bytes)

Of course, you can adapt nansum4! to use LoopVectorization.jl as suggested by @Elrod and build even faster code.

Palli · September 12, 2020, 10:25pm

Thanks, and to Elrod. This would be helpful for nanmean and nanstd I need for porting code from Octave (or actually MATLAB, there different functions… so far only ported to Octave, on the way to Julia).

Should this be in Julia Base? Is it redundant with skipmissing I found instead? Is that the way you should go in pure Julia? Or should a function do both?

lucas711642 · September 12, 2020, 10:32pm

NaNMath may be useful.

Benjia · September 13, 2020, 4:29pm

Thanks for the help everyone, this has given me some good ideas. I’ll report back if I have further problems!

Topic		Replies	Views
Crazy allocations using CartesianIndices General Usage question , indexing , memory-allocation	14	1454	March 29, 2021
Iteration with CartesianIndices(array) is slow when the dimension of the array is large Performance indexing	2	814	August 25, 2019
Speed up multidimensional code without Base.Cartesian Performance question	19	899	September 20, 2021
Performance issues with CartesianIndices Performance question	5	597	February 24, 2022
Cheapest way for creating a CartesianIndex General Usage question , cartesianindices	9	475	November 6, 2022

CartesianIndices overhead

Related topics