# General reduction performance

I need to efficiently apply a custom reduction function along specific axis of a large array, basically what `mapslices` does. However the performance is 30 times worse than expected:

``````using BenchmarkTools
using Statistics

A = randn(100, 100, 100, 100)
mean(A, dims=2) ≈ mapslices(mean, A, dims=2)  # true
@btime mean(A, dims=2)  # 60 ms, 7 Mb alloc
@btime mapslices(mean, A, dims=2)  # 1.7 s, 312 Mb alloc
``````

It looks like `mapslices` dynamically builds the resulting array instead of allocating it at once. Any possible solutions? Manual `for` loops would probably make the code efficient, but it will be difficult to specify `dim` dynamically.

The implementation of mapslice looks pretty complicated and I see a bunch of areas that look suboptimal… Not sure if it solves a more complex problem that I currently don’t see, or if it’s simply from pre julia 1.0.
But this implementation of mapslice is 10x faster for your case:

``````function fast_mapslice(f, a, dims = size(a))
mask = ntuple(i-> ifelse(i in dims, (:), true), ndims(a))
dims = ntuple(ndims(a)) do i
ifelse(mask[i] === (:), 1, axes(a, i))
end
map(CartesianIndices(dims)) do I
It = Tuple(I)
idx = ntuple(ndims(a)) do i
end
@inbounds return f(view(a, idx...))
end
end
``````
3 Likes

But it’s still slower then mean(A, dims = 2), because mean doesn’t get inlined, and so the allocation of the view doesn’t get eliminated. Proof:

``````@inline function fast_mean(A)
x = zero(eltype(A))
@simd for a in A
x += a
end
x / length(A)
end

@btime fast_mapslice(mean, a, (2,))
155.195 ms (1000017 allocations: 68.67 MiB)
@btime fast_mapslice(fast_mean, a, (2,))
108.065 ms (16 allocations: 7.63 MiB
``````

As you can see the allocations are gone, when mean inlines - but it’s still slower…

https://github.com/bramtayl/JuliennedArrays.jl is the go-to solution when you want fast `mapslices`.

2 Likes

JuliennedArrays is a really nice library! I benchmarked all solutions from this thread:

``````@btime mean(A, dims=2)))
58.155 ms (17 allocations: 7.63 MiB)
@btime mapslices(mean, A, dims=2)
1.577 s (11000080 allocations: 312.81 MiB)
@btime mapslices(fast_mean, A, dims=2)
1.583 s (11000080 allocations: 312.81 MiB)
@btime map(mean, julienne(A, (*, :, *, *)))
153.212 ms (1000004 allocations: 68.66 MiB)
@btime map(fast_mean, julienne(A, (*, :, *, *)))
139.943 ms (4 allocations: 7.63 MiB)
@btime fast_mapslice(mean, A, (2,))
158.287 ms (1000017 allocations: 68.67 MiB)
@btime fast_mapslice(fast_mean, A, (2,))
138.010 ms (17 allocations: 7.63 MiB)
``````

So the `Base` `mapslices` perform the slowest for both inlined and not inlined reduction and it doesn’t seem to perform any optimizations for inlined case. Both `fast_mapslices`and `julienne` indeed work more optimally for `fast_mean` and presumably for any other simple inlined function. The difference in speed between `mean` and `fast_mean` is quite small for me, less than for @sdanisch, while the memory usage certainly becomes more optimal for `fast_mean`. Indeed, they allocate the output array and that’s basically it.

To sum up: use `JuliennedArrays`