 I’ve been experimenting with the techniques described here https://julialang.org/blog/2016/02/iteration/#computing_a_reduction for writing some generic multi-dimensional operations.

Let’s look at a nan-skipping sum:

``````function nansum!(B, A)
# It's assumed that B has size 1 along any dimension that we're summing,
# and otherwise matches A
fill!(B, NaN)

Bmax = last(CartesianIndices(B))

@inbounds for I in CartesianIndices(A)
dest = min(Bmax, I)
ai = A[I]
bi = B[dest]
B[dest] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
end
return B
end

julia> A = rand([NaN, 3.4, 4.5], 100, 40, 60);

julia> B = Array{Float64}(undef, 1, 40, 60);

julia> @btime nansum!(\$B, \$a);
788.930 μs (0 allocations: 0 bytes)
``````

This works, and handles arbitrary dimensions. Cool!

`nansum2!` below is hard-coded for the case of 3-dimensions & dims=1, i.e. reducing only the first dimension:

``````function nansum2!(B::AbstractArray{T,3}, A::AbstractArray{T,3}) where T
fill!(B, NaN)

@inbounds for k in 1:size(A, 3)
for j in 1:size(A, 2)
for i in 1:size(A, 1)
ai = A[i, j, k]
bi = B[1, j, k]
B[1, j, k] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
end
end
end
return B
end

julia> @btime nansum2!(\$B, \$a);
450.433 μs (0 allocations: 0 bytes)
``````

The performance is significantly better, which is a shame. Seems like the fact that the generic solution doesn’t know about which dims are size 1 at compile-time does have a cost (has to go through the `min` every iteration).

Does anyone have suggestions for how to keep the simplicity and genericity of the first version, but with genuinely no overhead? It’ll have to to dispatch / recompile based on the reduction dims - just not sure what the nicest way to code that is.

1 Like

How about reshaping both, then processing, then reshaping back?

``````function nansum!(B, A)
# It's assumed that B has size 1 along any dimension that we're summing,
# and otherwise matches A
bsize = size(B)
A = reshape(A, (size(A,1), length(B)))
B = reshape(B, length(B))
fill!(B, NaN)

@inbounds for j in 1:size(A,2)
@inbounds for i in 1:size(A,1)
aij = A[i,j]
bj = B[j]
B[j] = ifelse(isnan(aij), bj, ifelse(isnan(bj), aij, bj + aij))
end
end

B = reshape(B, bsize)
return B
end
``````

On my machine this is about as fast as your fast version

Yes, that’s currently a known issue with `CartesianIndices`, unfortunately. See https://github.com/JuliaLang/julia/issues/9080 and https://github.com/JuliaLang/julia/pull/35074.

3 Likes

It seems like `min` on `CartesianIndex` is the culprit here. This should be equivalent, and on my machine it’s as fast as `nansum2!`:

``````julia> function nansum3!(B, A)
# It's assumed that B has size 1 along any dimension that we're summing,
# and otherwise matches A
fill!(B, NaN)

@inbounds for I in CartesianIndices(A)
dest = CartesianIndex(1, Base.tail(I.I)...)
ai = A[I]
bi = B[dest]
B[dest] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
end
return B
end
nansum3! (generic function with 1 method)
``````
``````julia> @btime nansum!(\$B, \$A);
688.682 μs (0 allocations: 0 bytes)

julia> @btime nansum2!(\$B, \$A);
478.752 μs (0 allocations: 0 bytes)

julia> @btime nansum3!(\$B, \$A);
479.218 μs (0 allocations: 0 bytes)

``````

These become the same speed if I write `dest = CartesianIndex(1, Base.tail(I.I)...)` so, as you say, it looks like it’s the `min` not just the CartesianIndices.

[Beaten by @rdeits, with identical notation too!]

Perhaps you could make a `fixdims(I, ::Val{dims})` and call a function using that after checking which sizes are 1?

``````A2 = permutedims(A, (2,3,1));
B2 = B[1,:,:];

using LoopVectorization
function nansumavx!(B::AbstractArray{T,2}, A::AbstractArray{T,3}) where T
fill!(B, NaN)
for i ∈ axes(A,3); @avx for J ∈ CartesianIndices(B)
ai = A[J, i]
bi = B[J]
B[J] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
end; end
return B
end
``````

This yields

``````julia> @benchmark nansum2!(\$B, \$A)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     353.777 μs (0.00% GC)
median time:      354.128 μs (0.00% GC)
mean time:        354.892 μs (0.00% GC)
maximum time:     427.012 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark nansumavx!(\$B2, \$A2)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     76.999 μs (0.00% GC)
median time:      78.071 μs (0.00% GC)
mean time:        78.213 μs (0.00% GC)
maximum time:     148.726 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> reshape(B, (size(B,2),size(B,3))) ≈ B2
true
``````
5 Likes

Answering your initial question, you can give to the compiler information about which dimensions have size 1:

``````@inline @generated function update_indices(I::CartesianIndex{N}, J::CartesianIndex{N}, ::Val{positions}) where {N, positions}
list = Array{Expr}(undef, N)
for k = 1 : N
list[k] = if k in positions
:(J[\$k])
else
:(I[\$k])
end
end
:(CartesianIndex(\$(list...)))
end
function nansum4!(B, A, positions = Val(Tuple(k for k in 1 : ndims(B) if isone(size(B, k)))))
# It's assumed that B has size 1 along any dimension that we're summing,
# and otherwise matches A

fill!(B, NaN)

Bmax = last(CartesianIndices(B))

@inbounds for I in CartesianIndices(A)
dest = update_indices(I, Bmax, positions)
ai = A[I]
bi = B[dest]
B[dest] = ifelse(isnan(ai), bi, ifelse(isnan(bi), ai, bi + ai))
end
return B
end
``````

`positions` gives information about which dimensions have size 1 to the generated function `update_indices`.

``````positions = Val((1,))
``````

and `update_indices` lowers to:

``````update_indices(I, Bmax, positions) = CartesianIndex(Bmax, I, I)
``````

It is important to note that, when placing `positions` as a optional argument of `nansum4!`, I’m isolating the type instability of computing `positions` from the computationally intensive part of the function. This is referred to as a function-barrier technique.

If I had written `positions = ...` in the body of the function, the performance would be much worse. In addition, giving `positions` as a positional argument allows you to manually define which dimensions have size 1, eliminating the type instability problem completely.

``````@btime nansum!(\$B, \$A);
@btime nansum2!(\$B, \$A);
@btime nansum4!(\$B, \$A);
``````

Output:

``````  628.300 μs (0 allocations: 0 bytes)
434.399 μs (0 allocations: 0 bytes)
440.901 μs (5 allocations: 240 bytes)
``````

The allocations of `nansum4!` are due to the type instability of computing `positions`. We can eliminate this problem with:

``````@btime nansum4!(\$B, \$A, Val((1,)));
``````

output:

``````  429.399 μs (0 allocations: 0 bytes)
``````

Of course, you can adapt `nansum4!` to use LoopVectorization.jl as suggested by @Elrod and build even faster code.

Thanks, and to Elrod. This would be helpful for nanmean and nanstd I need for porting code from Octave (or actually MATLAB, there different functions… so far only ported to Octave, on the way to Julia).

Should this be in Julia Base? Is it redundant with skipmissing I found instead? Is that the way you should go in pure Julia? Or should a function do both?

NaNMath may be useful.

Thanks for the help everyone, this has given me some good ideas. I’ll report back if I have further problems!