General reduction performance

aplavin · September 29, 2018, 9:22pm

I need to efficiently apply a custom reduction function along specific axis of a large array, basically what mapslices does. However the performance is 30 times worse than expected:

using BenchmarkTools
using Statistics

A = randn(100, 100, 100, 100)
mean(A, dims=2) ≈ mapslices(mean, A, dims=2)  # true
@btime mean(A, dims=2)  # 60 ms, 7 Mb alloc
@btime mapslices(mean, A, dims=2)  # 1.7 s, 312 Mb alloc

It looks like mapslices dynamically builds the resulting array instead of allocating it at once. Any possible solutions? Manual for loops would probably make the code efficient, but it will be difficult to specify dim dynamically.

sdanisch · September 30, 2018, 11:28am

The implementation of mapslice looks pretty complicated and I see a bunch of areas that look suboptimal… Not sure if it solves a more complex problem that I currently don’t see, or if it’s simply from pre julia 1.0.
But this implementation of mapslice is 10x faster for your case:

function fast_mapslice(f, a, dims = size(a))
    mask = ntuple(i-> ifelse(i in dims, (:), true), ndims(a))
    dims = ntuple(ndims(a)) do i 
        ifelse(mask[i] === (:), 1, axes(a, i))
    end
    map(CartesianIndices(dims)) do I
        It = Tuple(I)
        idx = ntuple(ndims(a)) do i 
            ifelse(mask[i] === (:), :, It[i])
        end
        @inbounds return f(view(a, idx...))
    end
end

sdanisch · September 30, 2018, 11:42am

But it’s still slower then mean(A, dims = 2), because mean doesn’t get inlined, and so the allocation of the view doesn’t get eliminated. Proof:

@inline function fast_mean(A)
    x = zero(eltype(A))
    @simd for a in A
        x += a
    end
    x / length(A)
end

@btime fast_mapslice(mean, a, (2,))
155.195 ms (1000017 allocations: 68.67 MiB)    
@btime fast_mapslice(fast_mean, a, (2,))
108.065 ms (16 allocations: 7.63 MiB

As you can see the allocations are gone, when mean inlines - but it’s still slower…

tim.holy · September 30, 2018, 12:02pm

https://github.com/bramtayl/JuliennedArrays.jl is the go-to solution when you want fast mapslices.

aplavin · October 3, 2018, 7:59am

JuliennedArrays is a really nice library! I benchmarked all solutions from this thread:

@btime mean(A, dims=2)))
  58.155 ms (17 allocations: 7.63 MiB)
@btime mapslices(mean, A, dims=2)
  1.577 s (11000080 allocations: 312.81 MiB)
@btime mapslices(fast_mean, A, dims=2)
  1.583 s (11000080 allocations: 312.81 MiB)
@btime map(mean, julienne(A, (*, :, *, *)))
  153.212 ms (1000004 allocations: 68.66 MiB)
@btime map(fast_mean, julienne(A, (*, :, *, *)))
  139.943 ms (4 allocations: 7.63 MiB)
@btime fast_mapslice(mean, A, (2,))  
  158.287 ms (1000017 allocations: 68.67 MiB)
@btime fast_mapslice(fast_mean, A, (2,))
  138.010 ms (17 allocations: 7.63 MiB)

So the Base mapslices perform the slowest for both inlined and not inlined reduction and it doesn’t seem to perform any optimizations for inlined case. Both fast_mapslicesand julienne indeed work more optimally for fast_mean and presumably for any other simple inlined function. The difference in speed between mean and fast_mean is quite small for me, less than for @sdanisch, while the memory usage certainly becomes more optimal for fast_mean. Indeed, they allocate the output array and that’s basically it.

To sum up: use JuliennedArrays

Topic		Replies	Views
Mapslices very slow GPU	3	782	September 13, 2022
Speeding up mapslices Performance performance , arrays , mapslices	0	458	May 27, 2021
Bikeshedding mapslices Internals & Design mapslices	18	2692	April 14, 2018
Performant Alternatives to mapslices General Usage question	4	489	October 18, 2022
Mapslices and @threads General Usage multithreading , mapslices	4	2049	January 11, 2024

General reduction performance

Related topics