Performance hacking `(&)(a::BitArray, b::BitArray)`

I’m trying my hand at performance hacking the bitwise AND function, and although what I have is fairly good, it seems to me like it’s allocating much memory even though it seems as though it’s always just 2 allocations regardless of the size of a and b. Does anyone have advice to reduce this down to just one allocation?

import Base: &

function (&)(a::BitArray, b::BitArray)::BitArray
    if size(a) != size(b)
        throw(DimensionMismatch("BitArrays must have the same dimensions"))
    end
    c = BitArray(undef, size(a))  # Create a new BitArray to store the result
    c.chunks .= a.chunks .& b.chunks
    return c
end

Here are the BenchmarkTools @benchmark results for inputs of length 100_000, compared to c = a .&& b.

My suspicion is that my function is allocating local variable c, then re-allocating the field, c.chunks rather than modifying in-place.

Creating a BitArray requires at least 2 allocations: one for the BitArray object itself (a mutable struct, so it lives on the heap), and one for the internal chunks array (a Vector{UInt64}) that it wraps. So, I think that’s all you’re seeing.

No, that’s not how .= works. c.chunks .= a.chunks .& b.chunks is effectively equivalent to broadcast!((&), c.chunks, a.chunks, b.chunks) … it’s not a compiler optimization that may or may not happen, it’s a guaranteed syntactic transformation.

Okay, great, thanks! So this implementation is about as efficient as we can expect, is what I’m reading?

There already is a performance-optimized broadcasted & over bit-arrays — it’s just .&! It performs similarly to your implementation:

julia> @benchmark $a .& $b
BenchmarkTools.Trial: 10000 samples with 191 evaluations.
 Range (min … max):  304.319 ns … 150.002 μs  ┊ GC (min … max):  0.00% … 97.20%
 Time  (median):     875.220 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.324 μs ±   2.706 μs  ┊ GC (mean ± σ):  37.54% ± 22.73%

  ▆▂▂▅█▆▃▁                                           ▁▁▁▁▁      ▂
  ████████▇▅▄▄▄▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▄▄▆▆▆▆▄▄▄▄▆█████████▇▇▆ █
  304 ns        Histogram: log(frequency) by time        8.1 μs <

julia> function dotand(a::BitArray, b::BitArray)::BitArray
           if size(a) != size(b)
               throw(DimensionMismatch("BitArrays must have the same dimensions"))
           end
           c = BitArray(undef, size(a))  # Create a new BitArray to store the result
           c.chunks .= a.chunks .& b.chunks
           return c
       end
dotand (generic function with 1 method)

julia> @benchmark dotand($a, $b)
BenchmarkTools.Trial: 10000 samples with 249 evaluations.
 Range (min … max):  308.402 ns … 75.301 μs  ┊ GC (min … max):  0.00% … 98.98%
 Time  (median):     921.267 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.449 μs ±  2.126 μs  ┊ GC (mean ± σ):  38.56% ± 24.34%

  ▄▁▁▄█▆▃                                                ▁  ▁▁ ▁
  ███████▇▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▃▄▁▃▃▄▄▄▄▆▇███▇▇▇▇▇▇████████ █
  308 ns        Histogram: log(frequency) by time      8.33 μs <

 Memory estimate: 12.34 KiB, allocs estimate: 4.

The .&& path doesn’t have the same optimization, but it could.

Ah! Okay, that makes even more sense, I don’t know why I was comparing to .&& in retrospect.

I just double checked, and I think that my implementation is about 4 times faster for larger arrays.

In [7]:  size(a)
Out[7]: (100000,)
In [8]: @benchmark c = a .& b
Out[8]: BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.060 μs … 493.100 μs  ┊ GC (min … max):  0.00% … 98.11%
 Time  (median):     4.440 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   6.687 μs ±  21.936 μs  ┊ GC (mean ± σ):  16.95% ±  5.13%

  ▆▆▂▂▁         ▄█▇▄▂▁▁           ▅▇▇▆▄▂▁                     ▂
  █████▇▅▆▅▄▅▅▆▄████████████▇████████████████▇▇▇▇▇▆▇▆▆▇▇▇█▆▇▇ █
  1.06 μs      Histogram: log(frequency) by time      12.8 μs <

 Memory estimate: 12.47 KiB, allocs estimate: 4.

In [9]: function dotand(a::BitArray, b::BitArray)::BitArray
                   if size(a) != size(b)
                       throw(DimensionMismatch("BitArrays must have the same dimensions"))
                   end
                   c = BitArray(undef, size(a))  # Create a new BitArray to store the result
                   c.chunks .= a.chunks .& b.chunks
                   return c
               end
Out[9]: dotand (generic function with 1 method)

In [10]: @benchmark c = dotand(a,b)
Out[10]: BenchmarkTools.Trial: 10000 samples with 121 evaluations.
 Range (min … max):  732.231 ns … 40.890 μs  ┊ GC (min … max):  0.00% … 93.30%
 Time  (median):       1.123 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.637 μs ±  2.148 μs  ┊ GC (mean ± σ):  14.57% ± 12.49%

  ▄▅█▄▁ ▁                        ▂▁                            ▁
  ███████▇▇▆▆▆▃▅▆▅▄▃▃▅▃▅▃▆▆▅▆▆▃▅▃███▆▆▅▆▅▅▄▆▅▅▄▅▆▆▆▆▆▅▅▅▅▆▆▆▅▆ █
  732 ns        Histogram: log(frequency) by time      9.98 μs <

 Memory estimate: 12.41 KiB, allocs estimate: 2.

Oh, you can maybe play games with SIMD and threads to make it faster.

Isn’t it automatically vectorized?

Maybe not as much as it could be? I don’t know, but I wanted to push back gently on the assumption that whatever the compiler does is always as good as it is possible to do.

@code_native on a .& b shows a lot of AVX vpand instructions on my x86 box, though.

2 Likes

Intersting, I get about the same performance for both on my machine (with Julia 1.10).
(To get accurate benchmarking results, it is recommended to interpolate any global variables: Manual · BenchmarkTools.jl )

(EDIT: Changed the construction of a and b to something more efficient – see discussion below – in case someone comes across this for copy-pasting. Results are not affected by this.)

julia> a = Random.bitrand(100000);

julia> b = Random.bitrand(100000);

julia> @benchmark $a .& $b
BenchmarkTools.Trial: 10000 samples with 87 evaluations.
 Range (min … max):  848.345 ns …  2.003 ms  ┊ GC (min … max):  0.00% … 99.83%
 Time  (median):       1.517 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):     3.082 μs ± 21.756 μs  ┊ GC (mean ± σ):  48.61% ± 14.70%

  █▅ ▁▁                                                        ▁
  ██████▆▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▄▄▃▄▄▄▅▅▆▄▅▅▆▅▆▅▆▅▆▆ █
  848 ns        Histogram: log(frequency) by time      58.8 μs <

 Memory estimate: 12.41 KiB, allocs estimate: 2.

julia> @benchmark dotand($a, $b)
BenchmarkTools.Trial: 10000 samples with 60 evaluations.
 Range (min … max):  844.800 ns …  2.372 ms  ┊ GC (min … max):  0.00% … 99.77%
 Time  (median):       1.520 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):     3.233 μs ± 26.118 μs  ┊ GC (mean ± σ):  49.97% ± 12.29%

  █▃▂                                                          ▁
  ████▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▄▁▄▄▃▄▁▅▄▄▅▅▅▅▅▅ █
  845 ns        Histogram: log(frequency) by time        85 μs <

 Memory estimate: 12.41 KiB, allocs estimate: 2.
versioninfo()
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, icelake-client)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_PKG_PRESERVE_TIERED_INSTALLED = true

Yes same here on 1.11.1

julia> @b $a .& $b
311.111 ns (4 allocs: 12.344 KiB)

julia> @b dotand($a, $b)
314.286 ns (4 allocs: 12.344 KiB)

this is a

20 × 13th Gen Intel(R) Core(TM) i7-13800H

Also note, that this is a good example of type piracy.

You define a method for & and BitArray. Neither & nor BitArray is owned by your code/module.

In this case this has probably no implications but part of a larger package, it could be.

So it’s safer to rename it.

Oh, right, for unrelated reasons I was testing on 1.9, so there are probably performance improvements on 1.10 and 1.11

In case you didn’t know, this creates a vector of Bools, then converts to a BitVector. You can directly make a BitVector, though:

julia> @btime BitVector(rand(Bool, 100000));
  15.100 μs (7 allocations: 110.09 KiB)

julia> @btime Random.bitrand(100000);
  908.571 ns (4 allocations: 12.34 KiB)
1 Like

Thanks, you are spot on (that I didn’t know)!
I thought there has to be a better way to do this, but I was too lazy to look it up :sweat_smile:

I think my implementation saves some time compared to the broadcast .& in 1.10 and 1.11, but not very much. But yeah, type piracy, I didn’t think about that.