Lazy broadcasting speed issues

RainerHeintzmann · February 7, 2023, 3:34pm

I am working on a new package called SeparableFunctions.jl with the aim to boost the performance for functions which are expensive to calculate on multidimensional grids, but can profit a lot from broadcasting.

To this aim, it turns out that the fastest implementation on CPU or GPU alike is to utilize Julia’s optimized broadcasting mechanisms. For this reason SeparableFunctions.jl returns a tuple of oriented Vectors each oriented along a particular axis. An example looks like this:

julia> exp_sep = exp_ikx_sep((5,6), (3.3,2.2))
(ComplexF32[-0.4257795f0 + 0.90482694f0im; -0.5358267f0 - 0.844328f0im; … ; -0.5358267f0 + 0.844328f0im; -0.4257795f0 - 0.90482694f0im;;], ComplexF32[0.8090169f0 + 0.5877854f0im -0.10452834f0 - 0.9945219f0im … -0.6691307f0 - 0.7431448f0im -0.10452834f0 + 0.9945219f0im])

julia> res = .*(exp_sep...)
5×6 Matrix{ComplexF32}:
 -0.876307+0.481753im   0.944376+0.328867im  …    0.95732-0.289032im  -0.855364-0.518027im
 0.0627908-0.998027im  -0.783694+0.621148im      -0.26892+0.963163im   0.895712-0.444635im
  0.809017+0.587785im  -0.104528-0.994522im     -0.669131-0.743145im  -0.104528+0.994522im
 -0.929777+0.368124im   0.895712+0.444635im      0.985996-0.166769im  -0.783694-0.621148im
  0.187381-0.982287im  -0.855364+0.518027im     -0.387515+0.921863im   0.944376-0.328867im

which works very well, but is admittedly ugly to use for newcomers. It would be nice, if there was a simple way to delay this expression into a lazy Array without any performance loss. Something that returns exp_sep as a macro wich self-expands into the .*(exp_sep ...) expression upon use. Yet any attempts using LazyArrays.jl caused massive losses in performance. Is there any other ways to hide the .*(arr ...) expression from the user and let this broadcasting construct be used like an ordinary array e.g. res = collect(exp_sep)? Maybe the broadcasting mechanism itself has a half-way processed broadcasted expression type, which could be used for packaging such a lazy broadcast evaluation?

barucden · February 7, 2023, 3:56pm

One approach would be to create a new data type

struct LazyResult
   arrays::<The type as returned by exp_ikx_sep>
end

and provide a method for it

evaluate(r::LazyResult) = .*(exp_sep...)

Then exp_ikx_sep would return a LazyResult and the user would call evaluate when needed:

r = exp_ikx_sep(...)
res = evaluate(r)

(The names can be certainly improved)

Sidenote: did you consider using reduce(.*, exp_sep) instead of using the splatting operator?

RainerHeintzmann · February 7, 2023, 4:07pm

Thanks for the suggestion. But this does not quite fulfill the need for two reasons:

The user still has to use “evaluate”. I could also just make an evaluate functions, which unpacks it.
(more importantly) This way the expression does not participate in other broadcasting operations. I.e. myarr .= evaluate(r) or myarr .*= evaluate(r) each would create an unnecessary copy which would kill most of the speed advantage.
Any other idea?

mcabbott · February 7, 2023, 4:26pm

You could perhaps return a Broadcasted, which is the lazy object Base uses in building up fused broadcasting expressions:

julia> bc = Broadcast.broadcasted(*, [1,2], [3 4 5 6]);

julia> copy(bc)
2×4 Matrix{Int64}:
 3  4   5   6
 6  8  10  12

julia> sqrt.(bc)  # fuses
2×4 Matrix{Float64}:
 1.73205  2.0      2.23607  2.44949
 2.44949  2.82843  3.16228  3.4641

If you want to reduce this, note that sum(bc) and sum(bc; dims=1) are different; you may need init and may want Broadcast.instantiate.

RainerHeintzmann · February 7, 2023, 5:32pm

This is great. Exactly what I needed. I do not quite get the comment about the sums as they should be different, since bc can be seen as an array, or am I missing a point here?

RainerHeintzmann · February 7, 2023, 5:45pm

Playing a bit with data of various sizes, there indeed seems to be a numerical difference between the sum over a Broadcasted object and a collected array. But I guess this is acceptable.

@mcabbott Now I think I see the problem: a sum over a dimension is not defined for a Base.Broadcast.Broadcasted object. One may wonder, whether it would make sense to define this in Base?

RainerHeintzmann · February 7, 2023, 5:48pm

… and size(collect(bc)) also does not yield the expected result, but instantiate seems to take care of this, yet the sum over dimensions still does not work, so maybe init fixes this.

simsurace · February 7, 2023, 6:43pm

This is because the sum over Broadcasted uses the single-iteration version of mapreduce, which uses a pairwise reduction. This is why it is sometimes preferable when summing many small elements, see this discussion here.

mcabbott · February 7, 2023, 8:02pm

Yes ideally these would all work. I think some work was done to make some of them fast already, maybe fixing initialisation for all case is next. (Don’t know if there’s an issue summarising this.)

RainerHeintzmann · February 8, 2023, 8:51am

With init, do you mean the optional init argument of the sum function, or is there anything called init which I could not find in Broadcast ?

RainerHeintzmann · February 16, 2023, 4:59pm

@mcabbot The package is now released!
Thanks for your help! However, I did not solve the sum supporting the dims argument. You mention something about init, which is presumably the init argument of the sum, but I did not understand, how this argument can help here. Can you explain?

mcabbott · February 17, 2023, 7:07pm

Sorry I mean roughly this:

julia> bc = Broadcast.broadcasted(*, [1,2], [3 4 5 6]);

julia> sum(Broadcast.instantiate(bc); dims=1, init=0.0)  # uses typeof(init)
1×4 Matrix{Float64}:
 9.0  12.0  15.0  18.0

julia> sum(Broadcast.instantiate(bc))  # type Int, from bc
54

julia> sum(Broadcast.instantiate(bc); dims=1)  # without init, this fails
ERROR: MethodError: no method matching reducedim_init(::typeof(identity), ::typeof(Base.add_sum), ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Vector{Int64}, Matrix{Int64}}}, ::Int64)

Here reducedim_init is what figures out the array to allocate before reducing, and it doesn’t understand how to infer the eltype needed from Broadcasted. But with the init keyword, it just follows what’s given.

In addition to such errors, some cases of summing Broadcasted are as fast as arrays, but some fall back to slower methods. But I’m not entirely sure which.

RainerHeintzmann · February 18, 2023, 11:34am

Thanks for the explanation. That’s interesting. Indeed there seem to be trouble figuring out the result datatype which needs the eltype of the array but at least size works fine:

julia> eltype(Broadcast.instantiate(Broadcast.broadcasted(*,rand(Float32,1,10),rand(Float64,10,1))))
Any

julia> eltype(Broadcast.broadcasted(*,rand(Float32,1,10),rand(Float64,10,1)))
Any

julia> size(Broadcast.broadcasted(*,rand(Float32,1,10),rand(Float64,10,1)))
(10, 10)

julia> size(Broadcast.instantiate(Broadcast.broadcasted(*,rand(Float32,1,10),rand(Float64,10,1))))
(10, 10)

How about specifying eltype for this case as:

Base.eltype(b::Base.Broadcast.Broadcasted) = eltype(reduce((x,y)->b.f.(x,y) ,(el[] for el in eltype.(b.args))))
julia> bc = Broadcast.broadcasted(*, UInt8.([1,2]), [3 4 5 6]);
julia> eltype(bc)
Int64

? which seems to work fine. Yet reduce_init still has trouble:

julia> sum(Broadcast.instantiate(bc); dims=1)
ERROR: MethodError: no method matching reducedim_init(::typeof(identity), ::typeof(Base.add_sum), ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Vector{UInt8}, Matrix{Int64}}}, ::Int64)
Closest candidates are:
  reducedim_init(::Any, ::typeof(&), ::Union{Base.AbstractBroadcasted, AbstractArray}, ::Any) at reducedim.jl:206
  reducedim_init(::Any, ::typeof(|), ::Union{Base.AbstractBroadcasted, AbstractArray}, ::Any) at reducedim.jl:207
  reducedim_init(::Any, ::Union{typeof(+), typeof(Base.add_sum)}, ::Union{AbstractArray{ComplexF16}, AbstractArray{ComplexF32}, AbstractArray{ComplexF64}, AbstractArray{Complex{Int128}}, AbstractArray{Complex{Int16}}, AbstractArray{Complex{Int32}}, AbstractArray{Complex{Int64}}, AbstractArray{Complex{Int8}}, AbstractArray{Complex{UInt128}}, AbstractArray{Complex{UInt16}}, AbstractArray{Complex{UInt32}}, AbstractArray{Complex{UInt64}}, AbstractArray{Complex{UInt8}}, AbstractArray{Float16}, AbstractArray{Float32}, AbstractArray{Float64}, AbstractArray{Int128}, AbstractArray{Int16}, AbstractArray{Int32}, AbstractArray{Int64}, AbstractArray{Int8}, AbstractArray{UInt128}, AbstractArray{UInt16}, AbstractArray{UInt32}, AbstractArray{UInt64}, AbstractArray{UInt8}}, ::Any) at reducedim.jl:217

so either the operator needs to be converted to add_sum or something needs to be changed with reducedim_init I guess?

RainerHeintzmann · March 12, 2023, 4:52pm

@mcabbot Would it be useful to post this query in General or a Broadcasting thread? It seems useful to at least add the eltype specification?

Topic		Replies	Views
Parallelising a broadcast on a large-allocating function New to Julia question , package , parallel , broadcast	0	587	November 6, 2020
Broadcasting performance Performance gpu , multithreading , broadcasting	13	533	January 6, 2025
Unknown allocation Performance	5	559	February 10, 2020
Should scalar calculation in Broadcast be "lazy"? Internals & Design broadcast	12	892	August 10, 2020
Reducing Materialize in Profiling Performance	5	344	February 24, 2024

Lazy broadcasting speed issues

Related topics