Broadcast with >3 parameters 30x slower

Trying to understand this behaviour, once we go to 4+ entries it slows down dramatically. Understand I may need to add the specialisations for 4+ myself, but not sure where to start…

a = rand(Bool, 100_000)
b = rand(Bool, 100_000)
c = rand(Bool, 100_000)
d = rand(Bool, 100_000)
out = zeros(Bool, 100_000)

using BenchmarkTools

julia> @btime broadcast!(&, $out, $a, $b, $c)
  2.822 μs (0 allocations: 0 bytes)

julia> @btime broadcast!(&, $out, $a, $b, $c, $d)
  67.999 μs (0 allocations: 0 bytes)
2 Likes

I can confirm this behavior.

While things seem normal for small array sizes

N = 100
a = rand(Bool,N)
b = rand(Bool,N)
c = rand(Bool,N)
d = rand(Bool,N)
out = zeros(Bool,N)

@btime broadcast!(&, $out, $a, $b, $c)
96.947 ns (0 allocations: 0 bytes)
@btime broadcast!(&, $out, $a, $b, $c, $d)
139.146 ns (0 allocations: 0 bytes)

things are awkward at e.g. N=150, where I get

@btime broadcast!(&, $out, $a, $b, $c)
36.427 ns (0 allocations: 0 bytes)
@btime broadcast!(&, $out, $a, $b, $c, $d)
196.133 ns (0 allocations: 0 bytes)

Not only does the additional array slow things down by a factor of 5.4, but the execution time with the original 3 Array broadcast is actually faster than it was for Ǹ=100 :thinking:.

Judging from @code_native, looks like the 3 argument version uses SIMD to make the & fast and the 4 argument version does not. LLVM is not always the best when vectorizing :man_shrugging:

3 Likes

Yep, can definitely make it go faster by implementing a vectorised version directly:

function Broadcast.broadcast!(::typeof(&), out::T, a::T, b::T, c::T, d::T) where T <: Vector{Bool}
    @avx for i in eachindex(out)
        out[i] = a[i] & b[i] & c[i] & d[i]
    end
end

julia> @btime broadcast!(&, $out, $a, $b, $c)
  2.800 μs (0 allocations: 0 bytes)

julia> @btime broadcast!(&, $out, $a, $b, $c, $d)
  3.788 μs (0 allocations: 0 bytes)

That’s more like it! Would appreciate any hints on how this might be solved generally, but good to have a solution for now.

1 Like

For those following along, here’s the terser version and a catch-all:

using LoopVectorization

Broadcast.broadcast!(::typeof(&), out::T, a1::T, a2::T, a3::T, a4::T) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4
Broadcast.broadcast!(::typeof(&), out::T, a1::T, a2::T, a3::T, a4::T, a5::T) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4 .& a5
Broadcast.broadcast!(::typeof(&), out::T, a1::T, a2::T, a3::T, a4::T, a5::T, a6::T) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4 .& a5 .& a6
Broadcast.broadcast!(::typeof(&), out::T, args::Vararg{T}) where T <: Vector{Bool} = @avx out .= .&(args...)
3 Likes

Note that @avx does not do bounds checking at all, so I wouldn’t use this deceivingly simple definition in real code.

julia> a = rand(Bool, 100_000);

julia> b = rand(Bool, 100_000);

julia> c = rand(Bool, 100);

julia> size.((a,b,c))
((100000,), (100000,), (100,))

julia> @avx out .= a .& b .& c; # there should be an error here...

julia> out .= a .& b .& c; # regular broadcasting errors
ERROR: DimensionMismatch("array could not be broadcast to match destination")
Stacktrace:

It’s also pirating typeof(&) (which might be a different & that’s not commutatitve/associative!) and Vector{Bool}.

3 Likes

Thanks, these are good points. This is for internal code where I know sizes will match up etc, but regardless I’ve created a dummy function and() and rewritten it (I need to use Base.broadcast! as and is just one type of function that might be called in this part of the code):

using LoopVectorization
function and() end
Broadcast.broadcast!(::typeof(and), out::T, a1, a2) where T <: Vector{Bool} = @avx out .= a1 .& a2
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3, a4) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3, a4, a5) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4 .& a5
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3, a4, a5, a6) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4 .& a5 .& a6
Broadcast.broadcast!(::typeof(and), out::T, args...) where T <: Vector{Bool} = @avx out .= .&(args...)

Notwithstanding the point about bounds checking which is well made, I’m a little confused by the piracy one. The specialisation here is very specific - function & with all Vector{Bool} inputs and Vector{Bool} output for 4+ arguments. Under what scenarios would that result in unexpected behaviour for in some unrelated package? Helpful for me to understand to avoid this kind of issue in future.

I suppose I’m trying to understand the expected pattern to fill gaps in Base library code.