Broadcast with >3 parameters 30x slower

Yamen · November 5, 2020, 6:39am

Trying to understand this behaviour, once we go to 4+ entries it slows down dramatically. Understand I may need to add the specialisations for 4+ myself, but not sure where to start…

a = rand(Bool, 100_000)
b = rand(Bool, 100_000)
c = rand(Bool, 100_000)
d = rand(Bool, 100_000)
out = zeros(Bool, 100_000)

using BenchmarkTools

julia> @btime broadcast!(&, $out, $a, $b, $c)
  2.822 μs (0 allocations: 0 bytes)

julia> @btime broadcast!(&, $out, $a, $b, $c, $d)
  67.999 μs (0 allocations: 0 bytes)

moeddel · November 5, 2020, 7:40am

I can confirm this behavior.

While things seem normal for small array sizes

N = 100
a = rand(Bool,N)
b = rand(Bool,N)
c = rand(Bool,N)
d = rand(Bool,N)
out = zeros(Bool,N)

@btime broadcast!(&, $out, $a, $b, $c)
96.947 ns (0 allocations: 0 bytes)
@btime broadcast!(&, $out, $a, $b, $c, $d)
139.146 ns (0 allocations: 0 bytes)

things are awkward at e.g. N=150, where I get

@btime broadcast!(&, $out, $a, $b, $c)
36.427 ns (0 allocations: 0 bytes)
@btime broadcast!(&, $out, $a, $b, $c, $d)
196.133 ns (0 allocations: 0 bytes)

Not only does the additional array slow things down by a factor of 5.4, but the execution time with the original 3 Array broadcast is actually faster than it was for Ǹ=100 .

Sukera · November 5, 2020, 8:27am

Judging from @code_native, looks like the 3 argument version uses SIMD to make the & fast and the 4 argument version does not. LLVM is not always the best when vectorizing

Yamen · November 5, 2020, 8:42am

Yep, can definitely make it go faster by implementing a vectorised version directly:

function Broadcast.broadcast!(::typeof(&), out::T, a::T, b::T, c::T, d::T) where T <: Vector{Bool}
    @avx for i in eachindex(out)
        out[i] = a[i] & b[i] & c[i] & d[i]
    end
end

julia> @btime broadcast!(&, $out, $a, $b, $c)
  2.800 μs (0 allocations: 0 bytes)

julia> @btime broadcast!(&, $out, $a, $b, $c, $d)
  3.788 μs (0 allocations: 0 bytes)

That’s more like it! Would appreciate any hints on how this might be solved generally, but good to have a solution for now.

Yamen · November 5, 2020, 9:02am

For those following along, here’s the terser version and a catch-all:

using LoopVectorization

Broadcast.broadcast!(::typeof(&), out::T, a1::T, a2::T, a3::T, a4::T) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4
Broadcast.broadcast!(::typeof(&), out::T, a1::T, a2::T, a3::T, a4::T, a5::T) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4 .& a5
Broadcast.broadcast!(::typeof(&), out::T, a1::T, a2::T, a3::T, a4::T, a5::T, a6::T) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4 .& a5 .& a6
Broadcast.broadcast!(::typeof(&), out::T, args::Vararg{T}) where T <: Vector{Bool} = @avx out .= .&(args...)

Sukera · November 5, 2020, 11:00am

Note that @avx does not do bounds checking at all, so I wouldn’t use this deceivingly simple definition in real code.

julia> a = rand(Bool, 100_000);

julia> b = rand(Bool, 100_000);

julia> c = rand(Bool, 100);

julia> size.((a,b,c))
((100000,), (100000,), (100,))

julia> @avx out .= a .& b .& c; # there should be an error here...

julia> out .= a .& b .& c; # regular broadcasting errors
ERROR: DimensionMismatch("array could not be broadcast to match destination")
Stacktrace:

It’s also pirating typeof(&) (which might be a different & that’s not commutatitve/associative!) and Vector{Bool}.

Yamen · November 5, 2020, 12:38pm

Thanks, these are good points. This is for internal code where I know sizes will match up etc, but regardless I’ve created a dummy function and() and rewritten it (I need to use Base.broadcast! as and is just one type of function that might be called in this part of the code):

using LoopVectorization
function and() end
Broadcast.broadcast!(::typeof(and), out::T, a1, a2) where T <: Vector{Bool} = @avx out .= a1 .& a2
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3, a4) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3, a4, a5) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4 .& a5
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3, a4, a5, a6) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4 .& a5 .& a6
Broadcast.broadcast!(::typeof(and), out::T, args...) where T <: Vector{Bool} = @avx out .= .&(args...)

Notwithstanding the point about bounds checking which is well made, I’m a little confused by the piracy one. The specialisation here is very specific - function & with all Vector{Bool} inputs and Vector{Bool} output for 4+ arguments. Under what scenarios would that result in unexpected behaviour for in some unrelated package? Helpful for me to understand to avoid this kind of issue in future.

I suppose I’m trying to understand the expected pattern to fill gaps in Base library code.

Topic		Replies	Views
Why is a multi-argument inplace map much faster in this case than a broadcast? Performance question , broadcast , map	16	667	December 12, 2022
Broadcast of .== slow performance, allocations Performance question , performance	5	795	May 16, 2019
Performance of simple broadcasting operations with many arguments Performance performance , broadcast	15	1591	November 29, 2021
Array broadcasting slower than numpy? Performance	20	716	June 4, 2022
Arithmetic broadcasting in Julia 5x slower than MATLAB Performance	17	1068	May 26, 2022

Broadcast with >3 parameters 30x slower

Related topics