I’ve been experimenting with the techniques described here Multidimensional algorithms and iteration for writing some generic multi-dimensional operations.

Let’s look at a nan-skipping sum:

```
function nansum!(B, A)
# It's assumed that B has size 1 along any dimension that we're summing,
# and otherwise matches A
fill!(B, NaN)
Bmax = last(CartesianIndices(B))
@inbounds for I in CartesianIndices(A)
dest = min(Bmax, I)
ai = A[I]
bi = B[dest]
B[dest] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
end
return B
end
julia> A = rand([NaN, 3.4, 4.5], 100, 40, 60);
julia> B = Array{Float64}(undef, 1, 40, 60);
julia> @btime nansum!($B, $a);
788.930 μs (0 allocations: 0 bytes)
```

This works, and handles arbitrary dimensions. Cool!

`nansum2!`

below is hard-coded for the case of 3-dimensions & dims=1, i.e. reducing only the first dimension:

```
function nansum2!(B::AbstractArray{T,3}, A::AbstractArray{T,3}) where T
fill!(B, NaN)
@inbounds for k in 1:size(A, 3)
for j in 1:size(A, 2)
for i in 1:size(A, 1)
ai = A[i, j, k]
bi = B[1, j, k]
B[1, j, k] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
end
end
end
return B
end
julia> @btime nansum2!($B, $a);
450.433 μs (0 allocations: 0 bytes)
```

The performance is significantly better, which is a shame. Seems like the fact that the generic solution doesn’t know about which dims are size 1 at compile-time does have a cost (has to go through the `min`

every iteration).

Does anyone have suggestions for how to keep the simplicity and genericity of the first version, but with genuinely no overhead? It’ll have to to dispatch / recompile based on the reduction dims - just not sure what the nicest way to code that is.