General reduction performance

I need to efficiently apply a custom reduction function along specific axis of a large array, basically what mapslices does. However the performance is 30 times worse than expected:

using BenchmarkTools
using Statistics

A = randn(100, 100, 100, 100)
mean(A, dims=2) ≈ mapslices(mean, A, dims=2)  # true
@btime mean(A, dims=2)  # 60 ms, 7 Mb alloc
@btime mapslices(mean, A, dims=2)  # 1.7 s, 312 Mb alloc

It looks like mapslices dynamically builds the resulting array instead of allocating it at once. Any possible solutions? Manual for loops would probably make the code efficient, but it will be difficult to specify dim dynamically.

The implementation of mapslice looks pretty complicated and I see a bunch of areas that look suboptimal… Not sure if it solves a more complex problem that I currently don’t see, or if it’s simply from pre julia 1.0.
But this implementation of mapslice is 10x faster for your case:

function fast_mapslice(f, a, dims = size(a))
    mask = ntuple(i-> ifelse(i in dims, (:), true), ndims(a))
    dims = ntuple(ndims(a)) do i 
        ifelse(mask[i] === (:), 1, axes(a, i))
    end
    map(CartesianIndices(dims)) do I
        It = Tuple(I)
        idx = ntuple(ndims(a)) do i 
            ifelse(mask[i] === (:), :, It[i])
        end
        @inbounds return f(view(a, idx...))
    end
end
3 Likes

But it’s still slower then mean(A, dims = 2), because mean doesn’t get inlined, and so the allocation of the view doesn’t get eliminated. Proof:

@inline function fast_mean(A)
    x = zero(eltype(A))
    @simd for a in A
        x += a
    end
    x / length(A)
end

@btime fast_mapslice(mean, a, (2,))
155.195 ms (1000017 allocations: 68.67 MiB)    
@btime fast_mapslice(fast_mean, a, (2,))
108.065 ms (16 allocations: 7.63 MiB

As you can see the allocations are gone, when mean inlines - but it’s still slower…

https://github.com/bramtayl/JuliennedArrays.jl is the go-to solution when you want fast mapslices.

2 Likes

JuliennedArrays is a really nice library! I benchmarked all solutions from this thread:

@btime mean(A, dims=2)))
  58.155 ms (17 allocations: 7.63 MiB)
@btime mapslices(mean, A, dims=2)
  1.577 s (11000080 allocations: 312.81 MiB)
@btime mapslices(fast_mean, A, dims=2)
  1.583 s (11000080 allocations: 312.81 MiB)
@btime map(mean, julienne(A, (*, :, *, *)))
  153.212 ms (1000004 allocations: 68.66 MiB)
@btime map(fast_mean, julienne(A, (*, :, *, *)))
  139.943 ms (4 allocations: 7.63 MiB)
@btime fast_mapslice(mean, A, (2,))  
  158.287 ms (1000017 allocations: 68.67 MiB)
@btime fast_mapslice(fast_mean, A, (2,))
  138.010 ms (17 allocations: 7.63 MiB)

So the Base mapslices perform the slowest for both inlined and not inlined reduction and it doesn’t seem to perform any optimizations for inlined case. Both fast_mapslicesand julienne indeed work more optimally for fast_mean and presumably for any other simple inlined function. The difference in speed between mean and fast_mean is quite small for me, less than for @sdanisch, while the memory usage certainly becomes more optimal for fast_mean. Indeed, they allocate the output array and that’s basically it.

To sum up: use JuliennedArrays :slight_smile: