First, this is a benchmarking artifact.

Second, the base/broadcast variant appears to use some tricks. Before looking at the actual implementation, the two obvious things to do are: (1) branch-free implementation and (2) accumulate 64 values into a register and only commit it to memory afterwards. If necessary, then commit it to memory for every entry, but don’t read it back in between (writing may be necessary because the predicate might throw an exception). A write `out[i]= ...`

involves a 64-bit read, some bit-wrangling, and a 64-bit write. Most of the time, the next index will read the just written 64-bit value again. This is a dependency chain mediated through memory, and requires load-store-forwarding. This is much slower than just keeping the value in register or even a normal load from L1 cache, cf eg here.

```
julia> check_bc(x, y)= x.==y;
julia> function check_naive(x::Vector, y)
l = length(x)
out = BitArray{1}(undef,l)
for i ∈ 1:l
out[i] = x[l] == 1
end
return out
end
check_naive (generic function with 1 method)
julia> using BenchmarkTools
julia> y=1; xsmall=[1]; xlarge=rand(1:4, 100_003);
julia> @btime check_bc($xsmall, $y); @btime check_naive($xsmall, $y);
872.460 ns (3 allocations: 4.31 KiB)
57.026 ns (2 allocations: 128 bytes)
julia> @btime check_bc($xlarge, $y); @btime check_naive($xlarge, $y);
154.259 μs (3 allocations: 16.59 KiB)
295.649 μs (2 allocations: 12.41 KiB)
julia> @show hash(check_bc(xlarge, y).chunks);
hash((check_bc(xlarge, y)).chunks) = 0xd427141e4a359ed3
```

Using that knowledge about how modern CPU work, we can try to monkey-patch the base implementation. First with the “fast but wrong in view of exceptions” variant:

```
julia> @eval Base.Broadcast function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
destc = dest.chunks
ind = cind = 1
bcp = preprocess(dest, bc)
length(bcp)<=0 && return dest
@inbounds for i = 0:Base.num_bit_chunks(length(bcp))-2
z = UInt64(0)
for j=0:63
z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
end
destc[i+1] = z
end
i = Base.num_bit_chunks(length(bcp))-1
z = UInt64(0)
@inbounds for j=0:(length(bcp)-1)&63
z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
end
@inbounds destc[i+1] = z
return dest
end
julia> @btime check_bc($xsmall, $y); @btime check_naive($xsmall, $y);
80.744 ns (4 allocations: 192 bytes)
56.908 ns (2 allocations: 128 bytes)
julia> @btime check_bc($xlarge, $y); @btime check_naive($xlarge, $y);
42.910 μs (4 allocations: 12.47 KiB)
295.515 μs (2 allocations: 12.41 KiB)
julia> @show hash(check_bc(xlarge, y).chunks);
hash((check_bc(xlarge, y)).chunks) = 0xd427141e4a359ed3
```

Now with the correct variant:

```
julia> @eval Base.Broadcast function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
destc = dest.chunks
ind = cind = 1
bcp = preprocess(dest, bc)
length(bcp)<=0 && return dest
@inbounds for i = 0:Base.num_bit_chunks(length(bcp))-2
z = UInt64(0)
for j=0:63
z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
destc[i+1] = z
end
end
i = Base.num_bit_chunks(length(bcp))-1
z = UInt64(0)
@inbounds for j=0:(length(bcp)-1)&63
z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
@inbounds destc[i+1] = z
end
return dest
end
julia> @btime check_bc($xsmall, $y); @btime check_naive($xsmall, $y);
80.319 ns (4 allocations: 192 bytes)
57.064 ns (2 allocations: 128 bytes)
julia> @btime check_bc($xlarge, $y); @btime check_naive($xlarge, $y);
119.142 μs (4 allocations: 12.47 KiB)
295.410 μs (2 allocations: 12.41 KiB)
julia> @show hash(check_bc(xlarge, y).chunks);
hash((check_bc(xlarge, y)).chunks) = 0xd427141e4a359ed3
```

I guess I’ll make a PR and hope that @mbauman can help me figure out whether this (much simpler!) algorithm can be made to work with arbitrary axes, and whether we can find a way to get the less correct variant working (the 3x speedup is definitely valid for `broadcast`

, because we hold the only reference to the destination: In case of exceptions we don’t need to leave it in a valid state. I don’t really understand why the stores are so freakishly expensive.).

PS. When I look at the current code, then we can use the fast version: The current code makes no guarantees at all about the state of `dest`

if an exception is triggered by the predicate (because there is no try/catch/rethrow block that dumps the current bitcache and then rethrows the exception). One might argue that this is a bug/quirk, but I am ok with that current slightly surprising behavior (semantic mismatch between `broadcast!`

to `BitArray`

and `Array{Bool}`

), and setting up an exception handler would be ridiculous.

PPS. PR opened.