I’m trying my hand at performance hacking the bitwise AND function, and although what I have is fairly good, it seems to me like it’s allocating much memory even though it seems as though it’s always just 2 allocations regardless of the size of a and b. Does anyone have advice to reduce this down to just one allocation?
import Base: &
function (&)(a::BitArray, b::BitArray)::BitArray
if size(a) != size(b)
throw(DimensionMismatch("BitArrays must have the same dimensions"))
end
c = BitArray(undef, size(a)) # Create a new BitArray to store the result
c.chunks .= a.chunks .& b.chunks
return c
end
Here are the BenchmarkTools @benchmark results for inputs of length 100_000, compared to c = a .&& b.
Creating a BitArray requires at least 2 allocations: one for the BitArray object itself (a mutable struct, so it lives on the heap), and one for the internal chunks array (a Vector{UInt64}) that it wraps. So, I think that’s all you’re seeing.
No, that’s not how .= works. c.chunks .= a.chunks .& b.chunks is effectively equivalent to broadcast!((&), c.chunks, a.chunks, b.chunks) … it’s not a compiler optimization that may or may not happen, it’s a guaranteed syntactic transformation.
Maybe not as much as it could be? I don’t know, but I wanted to push back gently on the assumption that whatever the compiler does is always as good as it is possible to do.
@code_native on a .& b shows a lot of AVX vpand instructions on my x86 box, though.
Intersting, I get about the same performance for both on my machine (with Julia 1.10).
(To get accurate benchmarking results, it is recommended to interpolate any global variables: Manual · BenchmarkTools.jl )
(EDIT: Changed the construction of a and b to something more efficient – see discussion below – in case someone comes across this for copy-pasting. Results are not affected by this.)
julia> a = Random.bitrand(100000);
julia> b = Random.bitrand(100000);
julia> @benchmark $a .& $b
BenchmarkTools.Trial: 10000 samples with 87 evaluations.
Range (min … max): 848.345 ns … 2.003 ms ┊ GC (min … max): 0.00% … 99.83%
Time (median): 1.517 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.082 μs ± 21.756 μs ┊ GC (mean ± σ): 48.61% ± 14.70%
█▅ ▁▁ ▁
██████▆▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▄▄▃▄▄▄▅▅▆▄▅▅▆▅▆▅▆▅▆▆ █
848 ns Histogram: log(frequency) by time 58.8 μs <
Memory estimate: 12.41 KiB, allocs estimate: 2.
julia> @benchmark dotand($a, $b)
BenchmarkTools.Trial: 10000 samples with 60 evaluations.
Range (min … max): 844.800 ns … 2.372 ms ┊ GC (min … max): 0.00% … 99.77%
Time (median): 1.520 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.233 μs ± 26.118 μs ┊ GC (mean ± σ): 49.97% ± 12.29%
█▃▂ ▁
████▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▄▁▄▄▃▄▁▅▄▄▅▅▅▅▅▅ █
845 ns Histogram: log(frequency) by time 85 μs <
Memory estimate: 12.41 KiB, allocs estimate: 2.
versioninfo()
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 × Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, icelake-client)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
JULIA_PKG_PRESERVE_TIERED_INSTALLED = true
I think my implementation saves some time compared to the broadcast .& in 1.10 and 1.11, but not very much. But yeah, type piracy, I didn’t think about that.