Trying to understand this behaviour, once we go to 4+ entries it slows down dramatically. Understand I may need to add the specialisations for 4+ myself, but not sure where to start…
a = rand(Bool, 100_000)
b = rand(Bool, 100_000)
c = rand(Bool, 100_000)
d = rand(Bool, 100_000)
out = zeros(Bool, 100_000)
using BenchmarkTools
julia> @btime broadcast!(&, $out, $a, $b, $c)
2.822 μs (0 allocations: 0 bytes)
julia> @btime broadcast!(&, $out, $a, $b, $c, $d)
67.999 μs (0 allocations: 0 bytes)
Not only does the additional array slow things down by a factor of 5.4, but the execution time with the original 3 Array broadcast is actually faster than it was for Ǹ=100 .
Judging from @code_native, looks like the 3 argument version uses SIMD to make the & fast and the 4 argument version does not. LLVM is not always the best when vectorizing
Note that @avx does not do bounds checking at all, so I wouldn’t use this deceivingly simple definition in real code.
julia> a = rand(Bool, 100_000);
julia> b = rand(Bool, 100_000);
julia> c = rand(Bool, 100);
julia> size.((a,b,c))
((100000,), (100000,), (100,))
julia> @avx out .= a .& b .& c; # there should be an error here...
julia> out .= a .& b .& c; # regular broadcasting errors
ERROR: DimensionMismatch("array could not be broadcast to match destination")
Stacktrace:
It’s also pirating typeof(&) (which might be a different & that’s not commutatitve/associative!) and Vector{Bool}.
Thanks, these are good points. This is for internal code where I know sizes will match up etc, but regardless I’ve created a dummy function and() and rewritten it (I need to use Base.broadcast! as and is just one type of function that might be called in this part of the code):
using LoopVectorization
function and() end
Broadcast.broadcast!(::typeof(and), out::T, a1, a2) where T <: Vector{Bool} = @avx out .= a1 .& a2
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3, a4) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3, a4, a5) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4 .& a5
Broadcast.broadcast!(::typeof(and), out::T, a1, a2, a3, a4, a5, a6) where T <: Vector{Bool} = @avx out .= a1 .& a2 .& a3 .& a4 .& a5 .& a6
Broadcast.broadcast!(::typeof(and), out::T, args...) where T <: Vector{Bool} = @avx out .= .&(args...)
Notwithstanding the point about bounds checking which is well made, I’m a little confused by the piracy one. The specialisation here is very specific - function & with all Vector{Bool} inputs and Vector{Bool} output for 4+ arguments. Under what scenarios would that result in unexpected behaviour for in some unrelated package? Helpful for me to understand to avoid this kind of issue in future.
I suppose I’m trying to understand the expected pattern to fill gaps in Base library code.