Performant Alternatives to mapslices

There are a few discussions I’ve found on this already, but I’m hoping my use case is simple enough that a solution already exists.

Bikeshedding mapslices

I’m interested in broadcasting an arbitrary function over the columns of a matrix and having the output be a matrix. The output is guaranteed to be the same size as the input if that helps implementation.

using BenchmarkTools
a = rand(3,3)
foo(x) = reverse(x)
@btime mapslices(foo, a, dims=1)

2.046 μs (50 allocations: 2.27 KiB)

However the time to actually perform the operation is quite small so there seems to be a lot of room for improvement.

julia> @btime map(foo, eachcol($a));
83.419 ns (4 allocations: 320 bytes)

One alternative:

julia> @btime hcat(map(foo, eachcol($a))…);
175.830 ns (6 allocations: 480 bytes)

This is way faster, but hcat is still taking a good chunk of the timeline. Is there a better way to construct the matrix directly?

BTW, here is the same thing with StaticArrays

using BenchmarkTools, StaticArrays
a = rand(SMatrix{3,3})
foo(x) = reverse(x)
@btime map(foo, eachcol($a));
@btime hcat(map(foo, eachcol($a))...);
@btime mapslices(foo, a, dims=1);

19.289 ns (1 allocation: 128 bytes)
21.858 ns (1 allocation: 128 bytes)
1.692 μs (41 allocations: 2.11 KiB)

We don’t see the slowdown from hcat with StaticArrays, but there’s still an allocation I can’t track down.

mapslices should be much faster on Julia nightly (but not on 1.8):

julia> A = rand(3, 30);  # slightly larger example

julia> @btime mapslices(reverse, $A, dims=1);
  min 2.222 μs, mean 2.677 μs (44 allocations, 3.53 KiB)     # 1.9 native
  min 21.834 μs, mean 23.504 μs (212 allocations, 8.84 KiB)  # 1.7 rosetta

julia> @btime reduce(hcat, map(reverse, eachcol($A)));
  min 1.450 μs, mean 1.820 μs (32 allocations, 3.42 KiB)  # 1.9
  min 2.116 μs, mean 2.339 μs (32 allocations, 3.42 KiB)  # 1.7

julia> @btime mapslices(reverse!, $A, dims=1);  # you are free to mutate the slice
  min 1.375 μs, mean 1.519 μs (14 allocations, 1.19 KiB)  # 1.9

julia> @btime stack(eachcol($A)) do col  # with PR43334, accepts tuples
         col[3], col[2], col[1]
  min 157.769 ns, mean 176.099 ns (1 allocation, 816 bytes)

Thanks for the explanation and preview of coming improvements.

Digging in a little further, I think I’m most interested in what is driving the performance in the SVector case. If I profile the code, I see that the allocation of the SizedArray through the collect call in map is taking the vast majority of the time vs the actual execution of foo.

Do folks know what is going on here and why there is an allocation in this case?

Just a quick update on this in case anyone runs into the same thing. To get this working well with StaticArrays I had to define a specific mapcolumns function that tells the compiler what the output type of any function foo is (as well as use the hidden sacollect method).

function mapcolumns(OutType::Type{<:SVector}, fn::Function, m::SMatrix{N,M,T}) where {N,M,T}
    return reduce(hcat, map((x -> fn(x)::OutType), StaticArrays.sacollect(SVector{M}, eachcol(m))))

# default assumes the function outputs the same type as input
mapcolumns(fn::Function, m::SMatrix{N,M,T}) where {N,M,T} = mapcolumns(SVector{N,T}, fn, m)
using BenchmarkTools, StaticArrays
foo(x) = reverse(x)
a = rand(SMatrix{3,30})
julia> @btime mapcolumns(foo, $a);
  29.271 ns (0 allocations: 0 bytes)

julia> @btime map(foo, eachcol($a));
  47.306 ns (1 allocation: 816 bytes)

julia> @btime reduce(hcat, map(foo, eachcol($a)));
  72.713 ns (1 allocation: 816 bytes)

julia> @btime mapslices(foo, $a, dims=1);
  11.708 μs (229 allocations: 49.81 KiB)

Curious if anyone else has a more direct approach.

JuliennedArrays.jl should be pretty performant (and if not let me know).