Broadcast into higher dimensional array vs nested array

I want to broadcast a vector-valued function and store the results in a higher dimensional array as opposed to a vector of arrays. I like the performance of this code:

import .Base: Broadcast
import .Broadcast: Broadcasted

function bccopyto!(dest::AbstractArray, bc::Base.Broadcast.Broadcasted)
    Base.tail(axes(dest)) == axes(bc) || throw(ArgumentError("array could not" *
    " be broadcast to be compatible with destination"))
    bc′ = Base.Broadcast.preprocess(dest, bc)
    @simd for I in eachindex(bc′)
        @inbounds dest[:,I] = bc′[I]
    end
    return dest
end
@inline function materialize2array!(dest, bc::Broadcast.Broadcasted{Style}) where {Style}
    return bccopyto!(dest, Broadcast.instantiate(Broadcast.Broadcasted{Style}(bc.f, bc.args, Base.tail(axes(dest)))))
end
broadcast2array!(f::Tf, dest, As::Vararg{Any,N}) where {Tf,N} = (materialize2array!(dest, Broadcast.broadcasted(f, As...),); dest)
A = Array{Float64,2}(undef,(2,100))
broadcast2array!(x -> [sin(x),cos(x)], A, 1:100)  # 100 allocations: 9.38 KiB

The downside is that I cannot use the convenient dot-notation for broadcasting.

The alternatives that I have tried are hcat(broadcast(x->[sin(x),cos(x)], 1:100)...)) (106 allocations: 12.92 KiB) and creating a new type and interface with broadcast:

import .Base: copyto!
mutable struct MyType{T,N} <: AbstractArray{T,N}
    arr::Array{T,N}
end
Base.size(A::MyType) = size(A.arr)
Base.getindex(A::MyType, inds::Vararg{Int,N}) where {T,N} = A.arr[inds...]
Base.setindex!(A::MyType, val, inds::Vararg{Int,N}) where {T,N} = A.arr[inds...] = val
Base.showarg(io::IO, A::MyType, toplevel) = print(io, typeof(A))

function copyto!(dest::MyType, bc::Base.Broadcast.Broadcasted)
    Base.tail(axes(dest)) == axes(bc) || throw(ArgumentError("array could not" *
    " be broadcast to be compatible with destination"))
    bc′ = Base.Broadcast.preprocess(dest, bc)
    @simd for I in eachindex(bc′)
        @inbounds dest[:,I] = bc′[I]
    end
    return dest
end
@inline function Broadcast.materialize!(dest::MyType, bc::Broadcast.Broadcasted{Style}) where {Style}
    return copyto!(dest, Broadcast.instantiate(Broadcast.Broadcasted{Style}(bc.f, bc.args, Base.tail(axes(dest)))))
end
A = MyType(Array{Float64,2}(undef,(2,100))
broadcast!(x -> [sin(x), cos(x)], A, 1:100)  # (300 allocations: 12.50 KiB)

Is there a better way to solve this problem?

Dot broadcasting doesn’t really work slice-wise by itself. It’s not so obvious to me, from your question, what you don’t like about your functions, but the standard ways to do this would be

julia> f(x) = [sin(x), cos(x)];

julia> reduce(hcat, [f(x) for x in 1:10])
2×10 Array{Float64,2}:
 0.841471   0.909297   0.14112   -0.756802  …  0.656987   0.989358   0.412118  -0.544021
 0.540302  -0.416147  -0.989992  -0.653644     0.753902  -0.1455    -0.91113   -0.839072

or, if you must mutate A:

julia> A = zeros(2,10);

julia> foreach((x,y) -> copyto!(x, f(y)), eachcol(A), 1:10)

julia> A
2×10 Array{Float64,2}:
 0.841471   0.909297   0.14112   -0.756802  …  0.656987   0.989358   0.412118  -0.544021
 0.540302  -0.416147  -0.989992  -0.653644     0.753902  -0.1455    -0.91113   -0.839072

but f still allocates all the slices. So this would be better:

julia> f!(y, x) = (y[1] = sin(x); y[2] = cos(x));

julia> foreach(f!, eachcol(A), 1:10)
1 Like

Thank you for the suggestions. I think the last solution using foreach is what I was looking for, though I thought that having a better performing method that uses broadcast could be desirable because of the possibility of fusion. The way I specialized on MyType in my original post allocates a lot more than necessary for reasons I do not quite understand.

For some reason, this definition of f2! allocates less than @mcabbott’s definition of f!:

function f2!(y,x)
    y[1] = sin(x)
    y[2] = cos(x)
    return y
end

I would be very grateful if someone would be willing to explain why.

After digging more into foreach, I also found that this method does not allocate at all:

f2(x) = (sin(x), cos(x));
function foo!(A, X)
    for (c, x) in zip(eachcol(A),X)
        c .= f2(x)
    end
    return nothing
end
A = zeros(2,1000)
X = 1:1000
@btime foo!($A,$X)  # 30.062 μs (0 allocations: 0 bytes)

But f3!(y,x) = (y.=f2(x)) combined with foreach(f3!, eachcol(A),X) does allocate. Is this because splatting itrs and z in the definition of foreach allocates? Here is the definition of foreach:

foreach(f, itrs...) = (for z in zip(itrs...); f(z...); end; nothing)

Thanks in advance!

Partially answering part of my own question: Splatting itrs does not allocate but z... does. We can see this because, if I add a method foreach(f, itr1, itr2) = (for (z1,z2) in zip(itr1, itr2); f(z1,z2); end; nothing), then foreach(f3!,A,X) allocates less than it did before (1002 allocations [46.92 KiB] instead of 1492 allocations [54.59 KiB]). On the other hand, splatting (itr1, itr2)... into zip does not change anything.

But if f3! is inlined, the number of allocations goes down to 2 (48 bytes). I think these last two allocations come from eachcol(A). Inside of my foo function above, passing eachcol(A) as an argument to zip() did not result in any allocations, though. So it seems to matter where eachcol() is evaluated.

That is the behavior I observe in this specific example, but I do not know why each of these three things result in allocations: z..., the non-inlined f3!, and passing eachcol(A) as an argument to foreach.

Note that there is a pull request currently that seems to improve the inferability and thus improvement of some iteration related things.